# Week 01 - Beginning R

In [None]:
R.Version()

In [None]:
Sys.which("make")

In [None]:
Sys.getenv("PATH")

In [None]:
library("tidyverse")

# Starting data visualisation with R
### Getting used to R
 - Installing and using packages
 - Using supplied datasets
 - Discovering data
 - Changing data types
### Plotting single variables with ggplot

 - bar graph
 - Pie chart
 - Histogram


### Getting used to R
Having installed R, you have base functionality, but most of the functionality is offered by the packages that you can install. Many are provided in the Comprehensive R Archive Network(CRAN) but also via GitHub and other sources.  To enable the functionality, the package must be installed (once per R installation) and loaded (for every program run).

R comes with a large number of supplied datasets that can be used for learning.  Thes are easily loaded into a daraframe and used.

Before using data, the structure and content of the dataframe is needed.  Sometimes the structure needs to be altered before being used for visualisation.

### Using packages (e.g. tidyverse): 

A package needs to be installed in the environment once:

install.packages("tidyverse")

It must be loaded into every program that uses it:

library(tidyverse)

You can see all the packages you have installed using the following function:
installed.packages()

In [None]:
install.packages("tidyverse")

In [None]:
library(tidyverse)

 ### Using supplied datasets
To find what datasets are available:

 data()
 
 To find all datasets that are available to your environment (depending on packages installed), run: 
 
 data(package = .packages(all.available = TRUE))
 
 To load a specific dataset (e.g. mtcars)
 
 data("mtcars")
 
 This will load a dataframe called mtcars into your program.
 
 We'll just look at the standard dataset (package = "datasets")

In [None]:
data(package="datasets")

#### Tidyverse

Tidyverse is a bundle of packages, including ggplot2 (for visualisations), dplyr (for data manipulation), tidyr (for tidying) and readr(to read in data).  We may use different packages though.

In [None]:
data(package="ggplot2")

## Check contents

To check the contents of a dataset, precede it with a question mark.  Sometimes the information isn't great, but it's a help.

In [None]:
?mtcars

### mtcars dataset
The mtcars data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

 - variable	description
 - mpg	Miles/(US) gallon
 - cyl	Number of cylinders
 - disp	Displacement (cu.in.)
 - hp	Gross horsepower
 - drat	Rear axle ratio
 - wt	Weight (lb/1000)
 - qsec	1/4 mile time
 - vs	V/S
 - am	Transmission (0 = automatic, 1 = manual)
 - gear	Number of forward gears
 - carb	Number of carburetors


So we'll do some visualisation from mtcars.  First we load the dataframe using the data() function.

In [None]:
data("mtcars")

#### Discovering data

The data is in a data frame (data.frame).  This is quite similar to the Pandas DataFrame.  It can have both a column index and a row index.

Once the data is in the dataframe, you can:

 - display the first few lines - e.g. **head(mtcars)**
 - display the number of rows or  - e.g. **nrow(mtcars)** or **ncol(mtcars)**
 - get a summary of the data in each column - e.g. **summary(mtcars)** and
 - get the structure of each column - e.g. **str(mtcars)**
 - get the rowname of each row - e.g. **rownames(mtcars)**
 - display the whole dataframe - **mtcars**
 - **length()** and **dim()** are other functions you can call.
 

In [None]:
head(mtcars)

As we can see, the frame has names for each column and names for each row.  Every item in a column is of the same data type.  The data type is shown under the column name.  In this case, all columns are doubles.

In [None]:
nrow(mtcars)
ncol(mtcars)

In [None]:
rownames(mtcars)

In [None]:
colnames(mtcars)

In [None]:
summary(mtcars)

In [None]:
str(mtcars)

#### Manipulating the data
You can also access and manipulate columns in the dataset

 - To get the contents of the hp column in mtcars  - **mtcars\\$hp**. This returns a vector.
 - To get the dataframe with just the rowname and hp column  - **mtcars['hp']**. This returns a data.frame.
 - To get the class of data in a column, class() e.g. - **class(mtcars\\$hp)**
 - To copy the dataframe into another dataframe you can use the assignment function **<-**  e.g. **df<-mtcars**
 
 - If we want categorical data, we may need to put the values into categories or *factor* the column.

 - To convert a vector column into a factor or category - e.g. gear column

 - To create a categorical variable:  **df\\$gear <- as.factor(df\\$gear)**

In [None]:
class(mtcars$hp)

In [None]:
mtcars$hp

In [None]:
class(mtcars['hp'])

In [None]:
mtcars['hp']

In [None]:
mtcars

In [None]:
mtcars$cyl

In [None]:
unique(mtcars$cyl)

In [None]:
class(mtcars$cyl)

In [None]:
df<-mtcars

In [None]:
df

In [None]:
class(mtcars$gear)
df$gear<-as.factor(df$gear)
class(df$gear)

# Visualisation
### data, grammar and geometries.
 - The data is what needs to be plotted.
 - Graphics are made up of distinct layers of grammatical elements.
 - Meaningful plots are built around appropriate aesthetic mappings. 
 - The aesthetics layer defines the scales where the mapping takes place.
 - The geom layer refers to the shape the data will take.



### Plotting single variables with ggplot

 - Histogram 
 - bar graph
 - Pie chart


## Layering your plot
The first layer tells the plot what data is to be used and is generally not run on its own, but we will run it here just to show.  Note:  Only a grey area is shown - you didn't get an error!

In [None]:
ggplot(mtcars)

## How ggplot works

 - ggplot sets out the area of the graph  (ggplot)
 - The first term defines what data is being used, and the aesthetics (**aes()**) of the graph.  This is followed by + to add a new term.  This cannot be at the start of the line.
 - The next term defines what type of graph, or **geom** will be used.
 - You can save the graph equation as you build it and reuse parts of it that you have previously saved.

#### Single variable exploration
 - Bar graph
 

## Bar graph
 - These show the number of rows for each value of the column.  In this case, mtcars is providing the data and there is a bar for every value of cyl (number of cylinders in the car)
 - The chart type is a bar chart (geom_bar).
 - Because there is only one variable and two axes, the graph is monochrome (black and white).

The x axis in a bar chart should use factors, so we'll convert cyl to factors

In [None]:
mtcars$cyl = as.factor(mtcars$cyl)

In [None]:
ggplot(mtcars, aes(cyl)) + geom_bar()


Note:  We can save our plot layers and build on them

In [None]:
p<-ggplot(mtcars, aes(cyl)) 
p<-p+geom_bar()
p

Because there is a column for every value, this doesn't work well for real numbers

In [None]:
p<-ggplot(mtcars, aes(mpg))
p+geom_bar()

We could make it categorical, by using the as.factor function, but it's still not great!

In [None]:
df<-mtcars
df$mpg<-as.factor(df$mpg)
p<-ggplot(df, aes(mpg))
p+geom_bar()

Or integer, by using the as.integer function

In [None]:
df<-mtcars
df$mpg<-as.integer(df$mpg)
p<-ggplot(df, aes(mpg))
p+geom_bar()

To draw a pie chart, we start with a bar chart, then convert the geometry to polar_coords:

In [None]:
ggplot(data=mtcars, aes(x=cyl)) + 
  geom_bar() 

On a bar chart, the bars are mapped using cartesian coordinates (i.e. using an x,y axis).  The default mapping is coord_cartesian.   We have been specifying that the different values are along the x-axis.  If we change it to the y-axis, leaving the x-axis blank, we will get a stacked bar chart.

Because the values are stacked, ggplot automatically differentiates the bands by colour, and generates a legend, labelled with the column name, giving which value of the column is represented in each colour.


In [None]:
ggplot(data=mtcars, aes(x="", fill = cyl)) + 
  geom_bar() 

##  Parts of a whole
- Pie charts
- Rose charts
- Stacked bar / Rose charts
- Donut charts


We can also visualise the exact same thing as a stacked bar chart, using cartesian coordinates

## Let's try a pie chart

To get a pie chart, two things must change:

 - The mapping must be changed from cartesian to polar. Instead of using the signed distances along the two coordinate axes, polar coordinates specifies the location of a point P in the plane by its distance r  from the origin and the angle θ  made between the line segment from the origin to P  and the positive x-axis. 
 - Because we've changed coordinates, our data needs to be on the Y axis. We have no data on the x axis.

That didn't work as expected.  Because we've changed coordinates, our data needs to be on the Y axis.  We have no data on the x axis.

In [None]:
ggplot(data=mtcars, aes(x = "", fill = cyl))+ 
geom_bar() + 
  coord_polar(theta = "y") 

A bullet chart!!!

In [None]:
pie <- ggplot(mtcars, aes(x = "", fill = factor(cyl))) +
 geom_bar() +
coord_polar()
pie

Hmm... more of a pacman than a bullet! Note that the circle isn't filled.  To fill the circle, give the bar a width of 1.

In [None]:
pie <- ggplot(mtcars, aes(x = "", fill = factor(cyl))) +
 geom_bar(width=1) +
coord_polar()
pie

A coxcomb chart - each pie has the same angle, but the areas represent the size, so blue is bigger.

In [None]:
ggplot(data = mtcars) +
geom_bar(mapping = aes(x = cyl, fill = cyl)) +
coord_polar()

Again, we can fill the area by setting width = 1.  What happens if we set it to 2?  Or to .5?

In [None]:
ggplot(data = mtcars) +
geom_bar(mapping = aes(x = cyl, fill = cyl), width=1) +
coord_polar()

# Making our visualisation notable.

If you were passing by a poster that was showing this visualisation, would you understand it?  **NO!!!**

ggplot allows us to modify our visualisation to make it stand out.

### Themes

The theme function allows us to add, amend or delete options on the graph.  
 - We can get rid of, or change the axes
 - we can change the background
 - We can give it a label and annotate it.
 - We can put a numeric value on each of the pies.

First, let's look at it again:

In [None]:
ggplot(data=mtcars, aes(x = "", fill = cyl))+ 
geom_bar() + 
  coord_polar(theta = "y") 

Let's get rid of the axis labels - i.e. the "count" on the x-axis and the "x" on the y axis.

In [None]:
ggplot(data = mtcars) +
geom_bar(mapping = aes(x ="", fill = cyl), width = 1) +

theme(
    axis.title.y=element_blank(),
    axis.title.x=element_blank()

) +
coord_polar(theta="y")

Now lLet's get rid of the axis text.

In [None]:
ggplot(data = mtcars) +
geom_bar(mapping = aes(x ="", fill = cyl), width = 1) +

theme(
    axis.title.x=element_blank(),
    axis.title.y=element_blank(),
    axis.text.x=element_blank(),
    axis.text.y=element_blank(),

) +
coord_polar(theta="y")

..and the grey background. 

In [None]:
ggplot(mtcars, aes(x = "", fill = cyl,)) +

geom_bar(position="fill")+
  geom_text(
    stat='count', 
    aes(y=after_stat(..count..),
        label=after_stat(scales::percent(..count../sum(..count..),1))),
    position=position_fill(0.5),
  ) +
  coord_polar(theta = "y") +
theme(
   axis.title.x=element_blank(),
    axis.text.x=element_blank(),
    axis.title.y=element_blank(),
    axis.text.y=element_blank(),
    panel.background=element_rect(fill='white'),
    )


## I can change the colours...

In [None]:
ggplot(data=mtcars, aes(x="", fill=cyl)) +
  geom_bar(position="fill") +
  geom_text(
    stat='count', 
    aes(y=after_stat(..count..),
        label=after_stat(scales::percent(..count../sum(..count..),1))),
    position=position_fill(0.5),
  ) +
  coord_polar(theta="y") +
  labs(x=NULL, y=NULL) +
  scale_fill_brewer(palette="Pastel1") +
  theme_void()

# Increasing readability using text

In [None]:
ggplot(mtcars,  aes(x = "",  fill = cyl, )) +

geom_bar(position = "fill")+
  geom_text(
    stat = "count",  
    aes(y = after_stat(..count..), 
        label = after_stat(scales::percent(..count../ sum(..count..), 1))), 
    position = position_fill(0.5), 
  ) +
  coord_polar(theta = "y") +
theme(
   axis.title.x = element_blank(), 
    axis.text.x = element_blank(), 
    axis.title.y = element_blank(), 
    axis.text.y = element_blank(), 
    panel.background = element_rect(fill = "white"), 
    )


# Histograms

In [None]:
ggplot(mtcars,aes(x=as.integer(cyl))) + geom_histogram(bins=3)