# Learning R

This Jupyter notebook aims to outline some of the core components of the [R](https://www.r-project.org/) programming language, because I am doing this in Jupyter, I have downloaded and installed the R kernel for Jupyter, using the Anaconda package manager; it was a little tricky but I got there in the end!

This guide was conceived when reading the following: [r-bloggers](https://www.r-bloggers.com/the-5-most-effective-ways-to-learn-r/), [Rstudio](https://www.rstudio.com/online-learning/#r-programming); r-bloggers.com is a great place to keep up with the community and also to find all sorts of useful resources. Rstudio is a free IDE for R.

R is a statistical language, first let's just see if the R kernel is working: it has worked. 

At any point, you can type `?method` and it will open the documentation for that method in the console.

In [1]:
?class

## The basics

I will be using this free DataCamp [intro to R](https://www.datacamp.com/courses/free-introduction-to-r) course to learn some of the basics of the R programming language. Let's get started.

### Arithmetic

Like other programming languages R has many inbuilt arithmetic operators, these are:
- `+` the addition operator
- `-` the subtraction operator
- `/` the division operator
- `*` the multiplication operator
- `%%` the moulo operator
- `^` the exponent operator
 
These operations can be seen below and will be printed to the console.

In [None]:
4 + 6 # Addition
12 - 3 # Subtraction
60/5 # Division
3*5 # Multiplication
60%%5 # Modulo
3^6 # Exponentiation

### Assigning Variables and Basic Data Types

Unlike some of the languages you are familiar with at this point, R uses the `<-` operator to assign a value to a variable, the LHS is the variable and the RHS is the value. For example typing `x <- 4` would set the value of x to be 4. As in other languages you can then perform myriad operations to these variables in any way that you choose.

The **basic data types** in R are `numeric` for a number like 63.1 and for natural numbers like 7, `logical` for true or false boolean values and `character` for a string of characters. You can check the type of a variable by calling the `class()` method which will return the type of the data.

Notice that adding to values of different types will throw an error.

Also note that R is **case sensistive**

In [None]:
# Making a bunch of variables and assigning them values
patrick <- "numpty"
my_integer <- 7
my_numeric <- 63.1
my_bool <- TRUE

# my_integer + patrick will throw an error

class(my_numeric)
class(patrick)
class(my_bool)
class(my_integer)

# typing the variable simply prints it.
my_bool

## Vectors

Being a budding data scientist, I want some more robust ways of analysing data, the `vector` data type is perfect for this, it is a structure that contains a single kind of data, but as many of these as we want (practically speaking), a vector can only contain a single kind of data, like `numeric`, `complex`, `character` e.t.c.

Fist of all we can creat a vector by again assigning a variable but this time we call the `c()` method which *combines* a number of elements, note that indexes in R start at 1, so to access the first element of our vector we will reference `my_vector[1]` instead of 0.

### Operations

Adding two vectors together will do an element-wise addition. This allows us to add vectors together to do something, note that the two vectors must be of the same size otherwise an error is thrown.

There are other operations which may be useful too, these are things like the `mean()` and `sum()` methods, which will return the numerical mean and sum of a vector respectively.

Like matrices below, we can perform operations on vectors that are element-wise.

### Naming Vectors

As a data scientist it is important to have a clear view of what we are analysing, to do this we will give the vector values that we are concerned with names, to do this we use the `names()` method and pass in the vector that we want to name and the names we want to assign to it. 

Below I have given a hypothetical vector of my spending for a week.

We can also apply operations like `>` and `<` to find which days a certain amount of money was spend, applying the functions `sum()` and `mean()` allows us to find the total amount of money and the average amount of money spent.

### Selecting Values from Vectors

The most basic way to access a vector values is to use the syntax `vector[n]` where n is the position in the vector, we can also use the syntax `n:m` to select specific values, this is similar to Python slicing.

We can go through data to find when certain values arise, for example, I check to see which days I spent more than 50 dollars, this selection vector can then be used to only select days that Is pent more than 50 dollars as shown below.

In [17]:
# How to instantiate a vector
my_vector1 <- c(1,2,3,4,5)
my_vector2 <- c(12,4,11,9,6)
my_vector3 <- c(1,3,5)
my_string_vector <- c("Pat", "is", "a", "numpty")
my_bool_vector <- c(TRUE, FALSE, TRUE, TRUE)

# Printing a bunch of vectors, prints their elements
my_vector1
my_string_vector
new_vector <- sum(my_vector1 + my_vector2)
new_vector
# my_vector1 + my_vector3 - Will throw an error, non-equal lengths.

# Can print specific elements only too 
my_vector1[1:3]

class(my_string_vector)
class(my_vector1)
class(my_bool_vector)

# Making a vector and naming each element
my_spending <- c(22.5, 31, 65.35, 22, 0, 16, 120)
days <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")

# Find the sum and mean of my spending
sum(my_spending)
mean(my_spending)

# Name the vector
names(my_spending) <- days

# Printing this will show the days that I spent different amounts of money
my_spending

# We can select specific values
my_spending[3:5]

# Finding the days that I spend more than 50 dollars, returns a vector showing the truth value of each entry
g50_selection <- my_spending > 50
g50_selection

# Can also print out days that we spent > 50 dollars by using the selection vector, it will only select TRUE values
g50 <- my_spending[g50_selection]
g50


## Matrices

Matrices are another structure in R that is useful for analysing all sorts of data, a matrix in R is a structure that contains all the same datatypes and has *dimensionality*, they are very similar to vectors, however a vector is often treated as a component of a matrix. For example, a row or a column is a vector, and vectors together are a matrix.

I am going to create a hypothetical matrix for my spending over the course of a few weeks to demonstrate some of the principles of matrices, 

### Creating a matrix

The `matrix()` command is central to the initialisation of a matrix, it takes in multiple vectors and creates a matrix with a certain number of rows and columns. We can fill the matrix row by row by passing in `byrow=TRUE`, alternatively we can make this FALSE to fill it by column.

When filling a matrix, the number of elements that are passed in to create the matrix must be a whole multiple of the number of rows, in other words, there must be enough elements that the matrix can be constructed without running out of elements.

There are a few methods that you should become familiar with for manipulating vectors, these are methods like `cbind()` and `rbind()` that add new rows and columns to a matrix.

### Operations

Like matrices, we can perform operations on a matrix, the unary operations are element-wise, such as:
- `+` elementwise addition
- `-` elementwise subtraction
- `*` elementwise multiplication
- `/` elementwise division
One can also perform standard matrix multiplication by using the `%*%` operator. 

There are also ways to sum the columns and rows by using the `colSums()` and `rowSums()` methods.

### Naming matrices

Just like vectors, it's a good idea to name our matrices, we can do this easily by naming both the rows and the columns of our matrix using the `rownames()` and `colnames()` and passing in the matrix we want to name and the names we want to assign to the rows and columns. 

### Selecting matrix elements

Again, this is similar to vectors, we use the square brackets to select elements of our matrix, however this time we must be mindful that there are two dimensions to our matrix, so we must supply both a row and column that we are selecting from. This would be something like `our_matrix[1,2]` to get a single element, `our_matrix[1:3,3]` to get the third element of the first 3 rows and `our_matrix[,1]` to get the the first column of all of the rows. 



In [26]:
# Creating a matrix with 4 rows, we have 8 elements so we will make 2 columns in each row, each row counting upwards.
my_matrix <- matrix(1:8, byrow=TRUE, nrow=4)
#my_matrix

# We could fill this by column, meaning that each column will contain 1-4 and 5-8
my_matrix <- matrix(1:8, byrow=FALSE, nrow=4)
#my_matrix

week1 <- c(13, 65, 32, 0, 91, 12, 14.3)
week2 <- c(22.5, 31, 65.35, 22, 0, 16, 120)
week3 <- c(0, 81, 14, 16, 45, 0, 12)

row_names <- c("week1", "week2")
col_names <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")

spending_matrix <- matrix(c(week1, week2), byrow=TRUE, nrow=2)
spending_matrix

# We can now name the matrix using our name vectors
rownames(spending_matrix) <- row_names
colnames(spending_matrix) <- col_names

# If we want to add another row, i.e. another week, we can use the rbind() method
spending_matrix <- rbind(spending_matrix, week3)
spending_matrix

# We could see the average amount spent on a sunday
mean(spending_matrix[,7])

# Or the total spent each week
aves <- rowSums(spending_matrix)
aves

0,1,2,3,4,5,6
13.0,65,32.0,0,91,12,14.3
22.5,31,65.35,22,0,16,120.0


Unnamed: 0,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday
week1,13.0,65,32.0,0,91,12,14.3
week2,22.5,31,65.35,22,0,16,120.0
week3,0.0,81,14.0,16,45,0,12.0


## Factors

Factors are an important data structure in R, it is used for *categorical variables*, there are two types of categorical variables
- nominal categorical variables
- ordinal categorical variables
A *nominal categorical variable* is one that does not have an implied order, it is impossible to say which has a greater value than the other,for example we could have a `person_vector`, with the categories `"name"`, `"gender"`, `"ethnicity"`, where each of these variables does not have a greater importance or value than the others. 

On the other hand, an *ordinal categorical variable* has some sort of natural order, where values are lower or higher than others, for example we could have `temperature_vector` with the categories `"Low"`, `"Medium"` and `"High"`, it is clear that these have a natural order, which we have to specify when defining the factor. We do this by passing in the parameter `order=TRUE` and then `levels=[]` which is the order of the factor. 

### Levels

Levels are often used to clarify bits of data, for example I have a hypothetical vector for sexes of participants of a survey, if I wanted to cahnge the abbreviations to Male and Female, I can specify the levels. If none is specified when the factor is made, R will automatically assign levels in alphabetical order

### Summaries

A very useful tool for figuring out what data we have is the `summary()` method, this will give us a summary of the contents of a variable. By identifying the factor levels of a factor R is able to tell us how many of each category we have! 

In [8]:
# Creating a factor of nominal categorical variables, there is no natural order
sex_vector <- c("M", "F", "M", "F", "F", "F", "M")
sex_factor <- factor(sex_vector)

# R automatically assigns levels by alphabetical order
levels(sex_factor)

levels(sex_factor) <- c("Female","Male")
sex_factor

# sex_factor[1] < sex_factor[2] will throw an error because there is no order to the variables!
summary(sex_factor)

# Creating a factor of ordinal categorical variables
# A hypothetical vector of productivity of different days
productivity_vector <- c("medium", "low", "high", "medium")
productivity_factor <- factor(productivity_vector,
                              order=TRUE,
                              levels=c("low","medium","high"))

# Was day 3 more productive than day 1?
day1 = productivity_factor[1]
day3 = productivity_factor[3]

day3 > day1


## Data Frames

A data frame is another structure in R that is used for storing a number of different data types, whereas a matrix would be best suited to data all of the same type a data frame is used for things like market research surveys which may require data of many different types. The columns are for each individual data type and the rows are for observations. 

R has some inbuilt data frames, one such dataframe is `mtcars`, which is survey data taken from 32 automobiles, looking at things like horsepower and weight, we can have a look at the data frame by typing `?mtcars`, this opens up the dataframe in the console. 

### head() and tail()

Because we are working with a lot of data in some cases it can be useful to look at the first and last observations in the data frame, to do this we use the `head()` and `tail()` methods respectively.

### Structure of a Data Frame

To probes the structure of a given data frame we can use the `str()` method, this will tell us the number of observations (32 for mtcars) and the number of variables (11).

### Selecting values

Just like matrices and vectors we can use the `[]` notation to select elements, again if we want to get all of the values from a column or row we leave the entry blank, e.g. `df[1,]` will select all of the columns of the first row and `df[,1]` will select all of the entries of the first column.

We can also select columns and rows by using the name of the data, for example in the code below we have a data frame consisting of information about planets in our solar system, to print out the rotation of all of the planets we can simply type `planets_df[,"rotation"]`

Sometimes we want to print out a specific column, or one peice of data about each row, we could use the bracket notation e.g. `planets_df[,"rings"]`, however there is a shortcut where we can use the dollar sign to select the entire column as seen in the code below.

We can also use the function `subset()` to generate a subset of our data frame, this is particularly useful for selecting only some data according to some comparitive, for example to select all planets larger than earth we could write `subset(planets_df,subset=diameter>1)`

### Sorting 

Sorting is essential to data science, a very useful function in R is `order()` which returns the order of a given variable such as a matrix or a vector. It can be used to shuffle the order of a data frame too, for example we can print out the smallest to largest planets.

In [12]:
# We can have a look at the mtcars dataframe
#?mtcars

#mtcars will print the whole dataframe

# These will print the first and last few observations
#head(mtcars)
#tail(mtcars)

# Probe the structure of the dataframe
#str(mtcars)

# We can also create a dataframe of our own
# Definition of vectors
name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", 
          "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)

# Create a data frame from the vectors
planets_df <- data.frame(name, type, diameter, rotation, rings)
planets_df
# Getting various data

# Print out the rotation of jupiter at 5,4
planets_df[5,4]

# Print out the type of planet of the first 4 planets
planets_df[1:4, "type"]

# Prints out all the planets with rings
planets_df[planets_df$rings,]

# Prints out all planets with a faster roation thatn earth using subset
subset(planets_df, subset=rotation>1)

# Prints out smallest to largest planets
planets_df[order(planets_df$diameter),]

name,type,diameter,rotation,rings
Mercury,Terrestrial planet,0.382,58.64,False
Venus,Terrestrial planet,0.949,-243.02,False
Earth,Terrestrial planet,1.0,1.0,False
Mars,Terrestrial planet,0.532,1.03,False
Jupiter,Gas giant,11.209,0.41,True
Saturn,Gas giant,9.449,0.43,True
Uranus,Gas giant,4.007,-0.72,True
Neptune,Gas giant,3.883,0.67,True


Unnamed: 0,name,type,diameter,rotation,rings
5,Jupiter,Gas giant,11.209,0.41,True
6,Saturn,Gas giant,9.449,0.43,True
7,Uranus,Gas giant,4.007,-0.72,True
8,Neptune,Gas giant,3.883,0.67,True


Unnamed: 0,name,type,diameter,rotation,rings
1,Mercury,Terrestrial planet,0.382,58.64,False
4,Mars,Terrestrial planet,0.532,1.03,False


Unnamed: 0,name,type,diameter,rotation,rings
1,Mercury,Terrestrial planet,0.382,58.64,False
4,Mars,Terrestrial planet,0.532,1.03,False
2,Venus,Terrestrial planet,0.949,-243.02,False
3,Earth,Terrestrial planet,1.0,1.0,False
8,Neptune,Gas giant,3.883,0.67,True
7,Uranus,Gas giant,4.007,-0.72,True
6,Saturn,Gas giant,9.449,0.43,True
5,Jupiter,Gas giant,11.209,0.41,True


## Lists 

Lists are another type of data structure in R, they are used to encapsulate many other data types and are for this reason very useful for storing a wide variety of data, for example a list can easily store a vector, matrix and and data frame. 