# The A to Z of R 
## From basic commands to beautiful plots with ggplot

# Introduction

## The aim of this presentation is to introduce you to the basics of the R programming language. We will cover:

## *The basics of R - assigning variables, creating dataframes, simple functions;*

## *The 'grammar of graphics' - beautiful plots created with ggplot2; and*

## *Correlations and correlograms*

# **The R 'starter kit'**

## What you will need:

## A version of R, downloaded from https://cran.r-project.org/; and

## An IDE - RStudio is recommended, but platforms such as PyCharm and Atom also support R. You can also create R environments with Anaconda

## *Optional extras:* Tools like Shiny allow building of webapps, while Markdown offers an attractive notebook interface 

In [None]:
options(repr.plot.width = 6, repr.plot.height = 4.5)

In [None]:
# First, let's create some variables and plot them
x <- rnorm(50)
y <- rnorm(x)
# Generate two pseudo-random normal vectors of x- and y-coordinates.
# Note how, above, we used '<-' to assign to a variable. In R, the '=' is generally used for function arguments only 

plot(x, y) # Plot using the built-in R 'plot' function 

In [None]:
# We can use ls() to show the variables in our workspace (and remove them with rm())
ls()

In [None]:
# Importantly, we can create a dataframe as follows:
test <- data.frame(x = x, y = x + rnorm(x))
test  # and display it...

In [None]:
# Fit a simple linear regression and look at the analysis. 
# With y to the left of the tilde, we are modelling y dependent on x
fm <- lm(y ~ x, data = test)
summary(fm)

In [None]:
# Note that, now that our variables are inside a dataframe, we now need to use a $ sign 
# to isolate that variable
test$y

In [None]:
plot(test$x, test$y)

# Note that we can also 'attach()' the data to avoid the need for dollar signs, but this can lead 
# to problems with the search path... Remember to 'detach()' to avoid the wrong object being found. 

In [None]:
# Now let's try something more complicated... First, load the 'dplyr' library for data manipulation
# and show the built-in 'iris' dataframe, which contains flower morphology parameters.
library(dplyr)
iris

In [None]:
# Now, try a simple t-test comparing flower sepal length between species
t.test(iris[iris$Species == 'setosa',]$Sepal.Length, iris[iris$Species == 'virginica',]$Sepal.Length)

In [None]:
# If you want to know exactly what the 't.test' function does, type:
?t.test

In [None]:
# If we want to be more concise, we can assign the t-test result to a variable...
ttest <- t.test(iris[iris$Species == 'setosa',]$Sepal.Length, iris[iris$Species == 'virginica',]$Sepal.Length)
# ...and access the stats of interest like so:
ttest$p.value 

In [None]:
# Data types in R: vectors, lists, matrices, arrays, factors, and data frames 
vector1 <- c(1, 2, 3, 4, 5)
vector1

In [None]:
# Lists can hold different sorts of data
list1 <- list(c(2, 5, 3), 21.3, "apple")
list1 

In [None]:
# Matrices and arrays are 2D and ND datasets, respectively
matrix1 = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
matrix1

In [None]:
# ... and factors are grouping variables with associated labels (levels)
levels(iris$Species)

# Advanced plotting with ggplot2

## ggplot - 'grammar of graphics'

## Each plot is envisioned as a series of layers - a theme, facets, scales, data

## Allows for a lot of flexibility - data are easily overplotted, facetted, and grouped using 'facet' and 'aesthetic' mappings

In [None]:
# Load the ggplot2 library and re-plot our scatter plot from earlier...
library(ggplot2)

scatter_plot <- ggplot(data = test, aes(x = x, y = y))
scatter_plot + geom_point()

In [None]:
# Let's try adding a regression line...
scatter_plot + geom_point() + geom_smooth()

In [None]:
# That wasn't quite right. Let's specify the method for 'geom_smooth()'
scatter_plot + geom_point() + geom_smooth(method = 'lm')

In [None]:
# Now let's try another common plot type - the boxplot
box_data <- ggplot(data = iris, aes(x = Species, y = Sepal.Width, colour = Species))
box_data + geom_boxplot()

In [None]:
# Jitter shows the raw data, 
box_data + geom_jitter(width = 0.2)

In [None]:
box_data + geom_dotplot(binaxis = "y", stackdir = "center", dotsize = 0.5)

In [None]:
box_data + geom_violin(scale = "area")

In [None]:
# Let's try a new dataset - 'diamonds' - to demonstrate histograms
plot_histo <- ggplot(diamonds, aes(carat)) 
plot_histo + geom_histogram()

In [None]:
# Alternatively, display the data as a density plot
plot_histo + geom_density()

In [None]:
# One last useful tool - the ggplot2 'facet'. Let's see what happens if we facet by the factor 'cut'
plot_histo + geom_density() + facet_grid(. ~ cut)

# Exercise

## Add some of the following arguments to the 'geom' calls shown earlier, outside of the aes() function: alpha (= 0-1), fill, colour, size. Set any of these equal to a factor variable inside aes() to cause them vary by group  

## Try adding (literally '+') these layers to the previous plots, and see what happens:

## theme_bw(), theme_classic()  *Alternatives to ggplot's grey theme* 

## scale_colour_viridis_d()  *A colourblind-friendly colour scale*  

# Summary

## R has several advantages over SPSS: it's open-source, lightweight, flexible, and easily scriptable - for better reproducibility.

## It has a large and engaged community, plus packages and features that are absent from most commercial software - e.g. plotting geospatial data.

## It is a competitor with Python, but is better-suited to producing statistical models or beautiful data visualisations, Python is more useful for data science and machine-learning applications.
