# RSE-Sheffield/Code_cafe

Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
 .. Failed to load latest commit information. exercise_solutions README.md best_fit.R example_data.csv rstudio.png

# First Steps With R

This is designed to be a self-directed study session where you work through the material at your own pace. If you are at a Code Cafe event, instructors will be on hand to help you.

## What is R?

R is a free, open-source programming language that has very strong support for statistics. It was originally developed as an open source implementation of the S Programming language. It is used extensively in research and industry for areas such as data analysis, statistics, machine learning, bioinformatics, simulation, linguistics and much more.

Don't just take our word for it though -- here's what others have to say

## Installing R and RStudio

Many users of R use it from within another free piece of software called RStudio. RStudio is a powerful and productive user interface for R. It’s free and open source, and works great on Windows, Mac, and Linux.

Our first task, therefore, is to install R and RStudio.

## Starting RStudio

When you start RStudio, you'll be greeted with a window like the one below

R can be used interactively by typing commands into the Console panel. In this tutorial, everything that is formatted like this:

``````print("this is an R command")
``````

Should be typed into the terminal. Press Return after every command.

## Simple commands and calculations

R is a command based system which means that you (usually) interact with it by entering commands rather than using a Graphical User Interface (GUI). Some of these commands are rather straightforward! For example, R can be used to do arithmetic

``````1+1
3*9
377/120
``````

R can also do all of the mathematical operations that you'd expect to see on a scientific calculator. For example, to take the square root of two:

``````sqrt(2)
``````

This is the first time we've entered a function in R so let's discuss some details. In the above, the function name is `sqrt` and the function argument is 2. In R, all function arguments are enclosed in parentheses `()`

R is case sensitive. For example, the correct command for square root is `sqrt(2)` with everything in lower case. Variations such as `Sqrt(2)` or `SQRT(2)` won't work (try it!).

R can also evaluate all the standard trigonometric functions such as `sin`, `cos` and `tan`. These take their arguments in radians rather than degrees. As such, a right angle is `pi/2` rather than 90.

``````sin(pi/2)
``````

Unlike many scientific calculators, R's `log` function takes the natural logarithm by default.

``````log(10)
``````

If you want to calculate a logarithm to base 10, you'll need to specify the base as a second argument.

``````log(100,base=10)
``````

This shows another feature of R functions -- named arguments. In this case, the named argument is base. Since the second argument to `log` is, by design, always the base you could have simply executed

``````log(100,10)
``````

but the named argument version is more readable.

## Getting help

Built in to R is a large amount of documentation that you can call on any time. For example, if you forget the details about the `log` function described above, ask R for help

`````` help(log)
``````

## Variables

We'll rarely want to perform a calculation and throw away the result. It is much more likely that we'll want to store the result in R's memory for later use; either as part of future calculations or ready for export to external files.

We do this by assigning the results of calculations to variables. For example,

``````a <- sin(1)
b <- 10
c <- a+b
``````

In the above, we created three variables called a, b and c. Note that as you create variables, they are shown, along with their values, in RStudio's Environment window. You can also list all variable names that currently exist in R's memory using the command

``````ls()
``````

To see the value of any given variable, just type it's name followed by enter

``````c
``````

To remove a variable from R's memory, we use the rm() command

``````rm(c)
``````

The `rm` command can also remove a list of variables in one go. For example, we could remove all variables in R's memory by sending the results of ls() to it.

``````rm(list=ls())
``````

## Built in datasets

R comes with a package called datasets that contains a set of classic datasets such as Fisher's Iris data and Anscombe's quartet. This package is one of the few that are loaded when you start R.

To see the full list of available datasets, execute the command

``````library(help="datasets")
``````

We are going to focus on the iris dataset which is stored as an R object called a Data Frame in the variable name `iris`. Learn more about this dataset using the `help` command:

``````help(iris)
``````

If you run the above command, you'll see that R's documentation tells us that "iris is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species."

Let's confirm this information for ourselves by introducing a few more R commands. `dim()` tells us the dimensions of a data frame

``````dim(iris)
``````

The `names()` function tells us the column names of a data frame.

``````names(iris)
``````

We can extract any of the columns by name using the `\$` operator. To get a list of the petal lengths for example we do

``````iris\$Petal.Length
``````

The `str()` function gives a compact summary of the structure of its input

``````str(iris)
``````

The `head()` function shows us the first 6 data points.

``````head(iris)
``````

You could display the entire data frame by simply entering

``````iris
``````

Alternatively, we can obtain some summary statistics about this data frame using the `summary()` command

``````summary(iris)
``````

## Plotting data

Let's extract the columns Petal.Length and Petal.Width and plot them against each other

``````x = iris\$Petal.Length
y = iris\$Petal.Width
plot(x,y)
``````

We add axis labels and titles by supplying named arguments to the plot command

``````plot(x,y,xlab="Petal Length",ylab="Petal Width",main="Iris Data")
``````

Each datapoint has an iris species associated with it - one of setosa, versicolor and virginica. We can see this by asking R what the structure of the `iris\$Species` column is

``````str(iris\$Species)
``````

Factors are how R represent categorical variables. We can see what the factor levels are with

``````levels(iris\$Species)
``````

We can include this information on the plot by coloring each datapoint according to its species.

``````plot(x,y,xlab="Petal Length",ylab="Petal Width",main="Iris Data",col=iris\$Species)
``````

``````plot(x,y,xlab="Petal Length",ylab="Petal Width",main="Iris Data",col=iris\$Species)
legend(x = 1, y = 2.5, legend = levels(iris\$Species), col = c(1:3), pch=1)
``````

Exercise - Tooth growth:

Try summarising and plotting a different dataset using the commands you've learned. The name of the dataset to investigate is `ToothGrowth`. Again, you can use `help(ToothGrowth)` to see contextual information and metadata.

## Packages

R has many functions built in but there are over 8000 freely available add-on packages that provide thousands more functions. Once you know the name of a package, you call install it very easily.

For example, a package called ggplot2 is widely used to create high quality graphics. To install ggplot2:

``````install.packages("ggplot2")
``````

We make all of the `ggplot2` functions available to our R session with the `library` command

``````library(ggplot2)
``````

Among other things, this makes the qplot function available to us. We can use this as an alternative to the basic `plot` command described above

``````qplot(iris\$Petal.Length, iris\$Petal.Width,col=iris\$Species)
``````

Alternatively, we can save ourselves typing `iris\$` a lot by telling `qplot` that the data we are referring to is the iris data

``````qplot(data=iris,Petal.Length, Petal.Width,col=Species)
``````

To get help about the functionality in the ggplot2 package:

``````help(package=ggplot2)
``````

Exercise (Packages)

A very popular R package is MASS which was created to support the book Modern Applied Statistics with S. This contains many more classic data sets which can be used to develop your R skills.

1. Install the MASS package on your machine.
2. Explore the MASS package's documentation and find a dataset that interests you.
4. Take a look at the dataset you chose in part (2) using what you've learned so far.

## The current working directory

Working with built-in datasets is great for practice but for real-life work its vital that you can import our own data. Before we do this, we must learn where R is expecting to find your files. It does this using the concept of current working directory. To see what the current working directory is, execute

``````getwd()
``````

You can create a new directory using `dir.create()`

``````dir.create('mydata')
``````

Move into this new directory using `setwd()`

``````setwd('mydata')
``````

See its contents with

``````dir()
``````

The current working directory is where R is currently looking for files and also where it will put any files it creates unless you tell it otherwise.

In this section, you'll learn how to import data into R from the common .csv (comma separated values) format.

``````download.file('https://raw.githubusercontent.com/mikecroucher/Code_cafe/master/First_steps_with_R/example_data.csv',destfile="example_data.csv")
``````

Ensure that the file is in your current working directory using the dir() function

``````dir()
``````

Import the .csv file using the read.csv() function

``````example_data <- read.csv('example_data.csv')
``````

The variable `example_data` will be an R data frame -- exactly the same type of object as the iris data we looked at earlier.

Exercise - example_data

• Show the first few lines of example_data
• Create a plot of the example_data
• Show summary statistics of example_data

## Scripts

In the simplest terms, a script is just a text file containing a list of R commands. We can run this list in order with a single command called `source()`

An alternative way to think of a script is as a permanent, repeatable, annotated, shareable, cross-platform archive1 of your analysis! Everything required to repeat your analysis is available in a single place. The only extra required ingredient is a computer.

For example, based on the article at http://www.walkingrandomly.com/?p=5254, we have created a script called `best_fit.R` that finds the parameters `p1` and `p2` such that the curve `p1*cos(p2*xdata) + p2*sin(p1*xdata)` is a best fit for the `example_data` described earlier. The details of this are beyond the scope of this course but you can easily download and run this analysis yourself.

``````download.file('https://raw.githubusercontent.com/mikecroucher/Code_cafe/master/First_steps_with_R/best_fit.R',destfile='best_fit.R')
source('best_fit.R')
``````

By doing this, you have reproduced the analysis that we did. You are able to check and extend our results or apply the code to your own work. Making code and data publicly available like this is the foundation of Open Data Science

## Further reading and next steps

In this session, we told you how to import data from a file but not how to export it. The following link will teach you how to export to .csv.

There are many online resources for learning R. Here are some we like

## References

[1] Getting Started with R - An Introduction for Biologists. Authors: Beckerman and Petchey.