First Steps With R
This is designed to be a self-directed study session where you work through the material at your own pace. If you are at a Code Cafe event, instructors will be on hand to help you.
We have an interactive discussion notebook at https://v.etherpad.org/p/code_cafe where you can ask questions and make comments.
What is R?
R is a free, open-source programming language that has very strong support for statistics. It was originally developed as an open source implementation of the S Programming language. It is used extensively in research and industry for areas such as data analysis, statistics, machine learning, bioinformatics, simulation, linguistics and much more.
With over 8000 freely available add-on packages that provide extensive additional functionality, R will probably have something that can help your research.
Don't just take our word for it though -- here's what others have to say
- Why use R? Five reasons - From the 'Econometrics By Simulation' blog
Installing R and RStudio
Many users of R use it from within another free piece of software called RStudio. RStudio is a powerful and productive user interface for R. It’s free and open source, and works great on Windows, Mac, and Linux.
Our first task, therefore, is to install R and RStudio.
- Install R first. Downloads are available at https://cran.rstudio.com/
- Install RStudio second. Downloads are available at https://www.rstudio.com/products/rstudio/download/
When you start RStudio, you'll be greeted with a window like the one below
R can be used interactively by typing commands into the Console panel. In this tutorial, everything that is formatted like this:
print("this is an R command")
Should be typed into the terminal. Press Return after every command.
Simple commands and calculations
R is a command based system which means that you (usually) interact with it by entering commands rather than using a Graphical User Interface (GUI). Some of these commands are rather straightforward! For example, R can be used to do arithmetic
1+1 3*9 377/120
R can also do all of the mathematical operations that you'd expect to see on a scientific calculator. For example, to take the square root of two:
This is the first time we've entered a function in R so let's discuss some details. In the above, the function name is
sqrt and the function argument is 2. In R, all function arguments are enclosed in parentheses
R is case sensitive. For example, the correct command for square root is
sqrt(2) with everything in lower case. Variations such as
SQRT(2) won't work (try it!).
R can also evaluate all the standard trigonometric functions such as
tan. These take their arguments in radians rather than degrees. As such, a right angle is
pi/2 rather than 90.
Unlike many scientific calculators, R's
log function takes the natural logarithm by default.
If you want to calculate a logarithm to base 10, you'll need to specify the base as a second argument.
This shows another feature of R functions -- named arguments. In this case, the named argument is base. Since the second argument to
log is, by design, always the base you could have simply executed
but the named argument version is more readable.
Built in to R is a large amount of documentation that you can call on any time. For example, if you forget the details about the
log function described above, ask R for help
We'll rarely want to perform a calculation and throw away the result. It is much more likely that we'll want to store the result in R's memory for later use; either as part of future calculations or ready for export to external files.
We do this by assigning the results of calculations to variables. For example,
a <- sin(1) b <- 10 c <- a+b
In the above, we created three variables called a, b and c. Note that as you create variables, they are shown, along with their values, in RStudio's Environment window. You can also list all variable names that currently exist in R's memory using the command
To see the value of any given variable, just type it's name followed by enter
To remove a variable from R's memory, we use the rm() command
rm command can also remove a list of variables in one go. For example, we could remove all variables in R's memory by sending the results of ls() to it.
Built in datasets
To see the full list of available datasets, execute the command
We are going to focus on the iris dataset which is stored as an R object called a Data Frame in the variable name
iris. Learn more about this dataset using the
If you run the above command, you'll see that R's documentation tells us that "iris is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species."
Let's confirm this information for ourselves by introducing a few more R commands.
dim() tells us the dimensions of a data frame
names() function tells us the column names of a data frame.
We can extract any of the columns by name using the
$ operator. To get a list of the petal lengths for example we do
str() function gives a compact summary of the structure of its input
head() function shows us the first 6 data points.
You could display the entire data frame by simply entering
Alternatively, we can obtain some summary statistics about this data frame using the
Let's extract the columns Petal.Length and Petal.Width and plot them against each other
x = iris$Petal.Length y = iris$Petal.Width plot(x,y)
We add axis labels and titles by supplying named arguments to the plot command
plot(x,y,xlab="Petal Length",ylab="Petal Width",main="Iris Data")
Each datapoint has an iris species associated with it - one of setosa, versicolor and virginica. We can see this by asking R what the structure of the
iris$Species column is
Factors are how R represent categorical variables. We can see what the factor levels are with
We can include this information on the plot by coloring each datapoint according to its species.
plot(x,y,xlab="Petal Length",ylab="Petal Width",main="Iris Data",col=iris$Species)
Finally, let's add a legend
plot(x,y,xlab="Petal Length",ylab="Petal Width",main="Iris Data",col=iris$Species) legend(x = 1, y = 2.5, legend = levels(iris$Species), col = c(1:3), pch=1)
Exercise - Tooth growth:
Try summarising and plotting a different dataset using the commands you've learned. The name of the dataset to investigate is
ToothGrowth. Again, you can use
help(ToothGrowth) to see contextual information and metadata.
R has many functions built in but there are over 8000 freely available add-on packages that provide thousands more functions. Once you know the name of a package, you call install it very easily.
For example, a package called ggplot2 is widely used to create high quality graphics. To install ggplot2:
We make all of the
ggplot2 functions available to our R session with the
Among other things, this makes the qplot function available to us. We can use this as an alternative to the basic
plot command described above
Alternatively, we can save ourselves typing
iris$ a lot by telling
qplot that the data we are referring to is the iris data
To get help about the functionality in the ggplot2 package:
- Install the MASS package on your machine.
- Explore the MASS package's documentation and find a dataset that interests you.
- Load the MASS library into your R session.
- Take a look at the dataset you chose in part (2) using what you've learned so far.
The current working directory
Working with built-in datasets is great for practice but for real-life work its vital that you can import our own data. Before we do this, we must learn where R is expecting to find your files. It does this using the concept of current working directory. To see what the current working directory is, execute
You can create a new directory using
Move into this new directory using
See its contents with
The current working directory is where R is currently looking for files and also where it will put any files it creates unless you tell it otherwise.
Importing your own data
In this section, you'll learn how to import data into R from the common .csv (comma separated values) format.
Download the file example_data.csv to your current working directory. You can either do this manually, using your web browser, or you can use the R command download.file
Ensure that the file is in your current working directory using the dir() function
Import the .csv file using the read.csv() function
example_data <- read.csv('example_data.csv')
example_data will be an R data frame -- exactly the same type of object as the iris data we looked at earlier.
Exercise - example_data
- Show the first few lines of example_data
- Create a plot of the example_data
- Show summary statistics of example_data
In the simplest terms, a script is just a text file containing a list of R commands. We can run this list in order with a single command called
An alternative way to think of a script is as a permanent, repeatable, annotated, shareable, cross-platform archive1 of your analysis! Everything required to repeat your analysis is available in a single place. The only extra required ingredient is a computer.
For example, based on the article at http://www.walkingrandomly.com/?p=5254, we have created a script called
best_fit.R that finds the parameters
p2 such that the curve
p1*cos(p2*xdata) + p2*sin(p1*xdata) is a best fit for the
example_data described earlier. The details of this are beyond the scope of this course but you can easily download and run this analysis yourself.
By doing this, you have reproduced the analysis that we did. You are able to check and extend our results or apply the code to your own work. Making code and data publicly available like this is the foundation of Open Data Science
Further reading and next steps
In this session, we told you how to import data from a file but not how to export it. The following link will teach you how to export to .csv.
There are many online resources for learning R. Here are some we like
 Getting Started with R - An Introduction for Biologists. Authors: Beckerman and Petchey.