# Introduction to R

We might need to install the IRkernel in R, see https://irkernel.github.io/installation/ 

R is a programming language for statistical computing and data visualization. It has been adopted in the fields of data mining, bioinformatics, and data analysis.

It is an interpreted language and it comes with a command line. You can install R or Rstudio, the latter provides a richer user experience. In the lectures I will be switching between jupyter notebooks, the R console/command line, and a file editor in which we will be writting executable scripts. 

Some of the most basic structures of R are:
* vectors 
* matrices 
* data frames, typically used to store data containing mixtures of quantitative and qualitative data
* lists

Many of the operations in R are **vectorised** they can be directly performed over vectors or matrices **without the use of loops.**



## Vectors and matrices

In [None]:
#create a vector with elements: 1, 2, 3, 4
c(1,2,3,4)

In [None]:
#a faster way to do the above is
1:4

In [None]:
#another vector
c("a",1,2)

In [24]:
#store the vector in a variable 
x<-c(1,2,3,4)
print(x)

[1] 1 2 3 4


So lets see what a *vectorised* operations look like

In [27]:
#what do you think the following commands do:
x + 2
x*3
x^2
x*x
x%*%x

0
30


All of them operate in a vectorised manner. In a classical programming language to implement each of these operators we would need to iterate over the dimensions of the vector, i.e. create loops. Not in R and not in Python libraries such as numpy and PyTorch. This allows us to exploit parallelism both in CPUs and GPUs and execute things much faster than when we are using loops.

In [None]:
#how about this operation, what happens here?
#note that our two vectors are of different length
y <- c(1,2)
print(x)
print(y)
print(x*y)

y<-c(1,2,3)
print(x)
print(y)
print(x*y)

What we have above are examples of what we can call **recycling**. When applying an operation to two vectors that requires them to be the same length, R automatically recycles, or repeats, the shorter one, until it is long enough to match the longer one. If the length longer vector is not a multiple of the length of the shorter vector we get an error such as the above:

``` “la taille d'un objet plus long n'est pas multiple de la taille d'un objet plus court” ```

In [None]:
#Lets make a matrix now
X <- matrix(1:12, nrow = 3, ncol = 4)
print(X)
print(X*X)
print(X%*%t(X))
dim(X) #gives the dimensions of X 

What is the difference between the two operations:

``` 
X*X
X%*%t(X) ```

the first one does elementwise multiplication of $X$ with itself, while the second does **matrix multiplication** between $X$ and its transpose. 

Lets try now to divide the elements of each line with the sum of the elements of that line (What will this operation do to each line)?

In [None]:
X <- matrix(c(1,2,3,10,20,30,100,200,300), nrow = 3, ncol = 3, byrow=T)
X
rowSums(X)
X/rowSums(X)
rowSums(X/rowSums(X)) #a check to see whether we get the correct result

So what happens above is that the rowSums(X) vector is repeated over the columns of X, giving us the desired result. 

What about doing the same thing but this time we want to divide the elements of each column with the sum of the elements of that column. 

In [None]:
X
colSums(X)
X/colSums(X)
colSums(X/colSums(X)) #a check to see whether we get the correct result

As expected **this does not work**, since the vector is again moved over the columns of the matrix as it was the case in the previous example. 

One solution would be to take the tranpose of X repeat the process that we know that it works and then transpose the result


In [None]:
trX<-t(X)
rowSums(trX)
trX/rowSums(trX)
t(trX/rowSums(trX))
colSums(t(trX/rowSums(trX))) #a check to see whether we get the correct result

Another solution is to use a command, ```sweep``` that controls how the vector will be recycled over the matrix, i.e. over the lines or the columns. 

In [None]:
sweep(X,2,colSums(X),FUN="/") # repeats the vector colSums(X) over the 
                              # lines of the X matrix and
                              # applies the / operation
                              # basically normalises columns to 1
colSums(sweep(X,2,colSums(X),FUN="/"))

In [None]:
sweep(X,1,rowSums(X),FUN="/") # repeats the vector rowSums(X) over the 
                              # columns of the X matrix and
                              # applies the / operation
                              # basically normalises rows to 1
rowSums(sweep(X,1,rowSums(X),FUN="/"))

These operations will become handy when we will revisit the joint and marginal probability distributions. 

## Lists ##

Lists are complex objects that can contain **named elements** of arbitrary structure

In [None]:
x <- c(1,2,3)
y <- c("a","b")
z <- 1:10
X <- matrix(c(1,2,3,10,20,30,100,200,300), nrow = 3, ncol = 3, byrow=T)

aList <- list(num=x, text=y, sequence=z, matr=X)
aList

#we can access each one of the fields of the list using their names as follows
aList$num
aList$text
aList$sequence

#set the value of a field
aList$num <- 4

aList

#access the names of the fields of the list
attributes(aList)

## Loops

R provides different loop structures but for most operations involving matrices and vectors we will not use them, because such operations are typically vectorised.

In [None]:
for(i in 1:10){print(i)}

for(i in c("a","b","c")){print(i)}

X <- matrix(c(1,2,3,10,20,30,100,200,300), nrow = 3, ncol = 3, byrow=T)
print(X)
d<-dim(X)
#we will NOT be using such loops
for(line in 1:d[1]){
    for(row in 1:d[2])
        print(X[line,row])
}

Looping with the ```apply``` function

In [None]:
#lets compute the maximum element per line of a matrix
#How would you do it?
#In R we will use a function called apply

X <- matrix(c(1,2,3,10,20,30,100,200,300,1000,2000,3000), nrow = 3, ncol = 4, byrow=T)
X

apply(X,1,max) #the 1 argument indicates that the max function is applied over each row of the matrix

apply(X,2,max) #the 2 argument indicates that the max function is applied over each column of the matrix

apply(X,c(1,2),log) #the c(1,2) indicates that the function is applied over each element of the matrix, log computes the natural logarithm

apply(X,c(1,2),log, base=2) #if the function we apply takes additional arguments we set them, 
                            #here base=2 is an argument of the log



Related to the ```apply``` function are the ```sapply``` and ```lapply``` which work over vectors and lists. 

## Functions 

Lets write a simple function that computes the power of x to y, x^y

In [None]:
myPower <- function(x,y){
    x^y #we could also write return(x^y) but it is not necessary. R returns the last evaluated expression
}

myPower(2,3)

Most of the times we will be raising at the power of 2, it would nice to have a default value of 2 for y. R allows for default values.

In [None]:
myPower <- function(x,y=2){
    x^y
}

myPower(3) #if we do not pass y it uses the default

myPower(3,3) #when we pass y it uses the value we pass

myPower(y=2,x=3) #what will be the result here?
                 #In R if we name the arguments we can pass them in any order we want. 

In [None]:
myPower <- function(x=2,y=2){
    x^y
}

myPower() #we can call it without any arguments if every argument has a default value

**Exercise:** 
* create a 3 x 4 matrix with elements 1 to 12 organised by rows
* compute the maximum per row
* compute the maximum per column
* normalise the elements of every row by the maximum element of the row
* do the same for the columns
* write a function that takes as input a matrix and returns by default the above row normalised matrix, unless we specify that we want the column normalised. 

In [10]:
# create a 3 x 4 matrix with elements 1 to 12 organised by rows
# matrix(data, nrow, ncol, byrow, dimnames)
X <- matrix(1:12, nrow = 3, ncol = 4, byrow = TRUE)
X

# compute the maximum per row
maxPerRow<-apply(X,1,max)
maxPerRow

# compute the maximum per column
maxPerCol<-apply(X,2,max)
maxPerCol

# Sweep => https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/sweep
# normalise the elements of every row by the maximum element of the row
sweep(X,1,maxPerRow,FUN="/")

# normalise the elements of every col by the maximum element of the row
sweep(X,2,maxPerCol,FUN="/")

# write a function that takes as input a matrix and returns by default the above row normalised matrix, unless we specify that we want the column normalised.
normaliseMatrix <- function(X, byRow=TRUE){
    if(byRow)
        sweep(X,1,apply(X,1,max),FUN="/")
    else
        sweep(X,2,apply(X,2,max),FUN="/")
}

normaliseMatrix(X)
normaliseMatrix(X, FALSE)

0,1,2,3
1,2,3,4
5,6,7,8
9,10,11,12


0,1,2,3
0.25,0.5,0.75,1
0.625,0.75,0.875,1
0.75,0.8333333,0.9166667,1


0,1,2,3
0.1111111,0.2,0.2727273,0.3333333
0.5555556,0.6,0.6363636,0.6666667
1.0,1.0,1.0,1.0


0,1,2,3
0.25,0.5,0.75,1
0.625,0.75,0.875,1
0.75,0.8333333,0.9166667,1


0,1,2,3
0.1111111,0.2,0.2727273,0.3333333
0.5555556,0.6,0.6363636,0.6666667
1.0,1.0,1.0,1.0


## Reading datasets and working with data frames ##

In [7]:
getwd()

lets go where the data are

In [12]:
#give your working directory
setwd("C:/Users/huniv/jnotebook/datasets")


Lets read the data, and store it in an object of type *data frame*.
A data frame is a table that can store both quantitative and qualitative values	



In [13]:
myData <- read.table(file="iris.csv", header=T, sep = ",")

In [14]:
myData

sepal_length,sepal_width,petal_length,petal_width,type.
<dbl>,<dbl>,<dbl>,<dbl>,<chr>
5.1,3.5,1.4,0.2,Iris_setosa
4.9,3.0,1.4,0.2,Iris_setosa
4.7,3.2,1.3,0.2,Iris_setosa
4.6,3.1,1.5,0.2,Iris_setosa
5.0,3.6,1.4,0.2,Iris_setosa
5.4,3.9,1.7,0.4,Iris_setosa
4.6,3.4,1.4,0.3,Iris_setosa
5.0,3.4,1.5,0.2,Iris_setosa
4.4,2.9,1.4,0.2,Iris_setosa
4.9,3.1,1.5,0.1,Iris_setosa


In [15]:
#and lets take a look at a summary of the data
summary(myData)

  sepal_length    sepal_width     petal_length    petal_width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.054   Mean   :3.759   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
    type.          
 Length:150        
 Class :character  
 Mode  :character  
                   
                   
                   

lets see what are the names of the attributes

In [16]:
names(myData)


lets see what is the dimensionality of myData the data frame that contains the data

In [17]:
dim(myData)


Lets access now different elements of the myData data frame


In [18]:
# Lets get the first line
myData[1,]

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width,type.
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
1,5.1,3.5,1.4,0.2,Iris_setosa


In [19]:
#second column
myData[,2]

how to get lines 1 to 5 and 10 to 15


In [20]:
myData[ c(1:5,10:15), ]

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width,type.
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
1,5.1,3.5,1.4,0.2,Iris_setosa
2,4.9,3.0,1.4,0.2,Iris_setosa
3,4.7,3.2,1.3,0.2,Iris_setosa
4,4.6,3.1,1.5,0.2,Iris_setosa
5,5.0,3.6,1.4,0.2,Iris_setosa
10,4.9,3.1,1.5,0.1,Iris_setosa
11,5.4,3.7,1.5,0.2,Iris_setosa
12,4.8,3.4,1.6,0.2,Iris_setosa
13,4.8,3.0,1.4,0.1,Iris_setosa
14,4.3,3.0,1.1,0.1,Iris_setosa


In general which rows to get is specified before the "," and which columns after the ","

``` myData[line(s), column(s)] ```

So now lets get a more elaborate filter, lets retrieve all instances
of type **Iris_virginica** and all columns

```myData[,5]=="Iris_virginica"``` : returns a vector of TRUE/FALSE of the
same length as the number of lines/instances in myData and we can use
it to index the lines that we want to keep (TRUE) as follows

In [21]:
myData[,5]=="Iris_virginica"

We will now use the FALSE/TRUE vector to index which lines to keep from our data frame

In [22]:
myData [  myData[,5]=="Iris_virginica" ,   ]

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width,type.
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
101,6.3,3.3,6.0,2.5,Iris_virginica
102,5.8,2.7,5.1,1.9,Iris_virginica
103,7.1,3.0,5.9,2.1,Iris_virginica
104,6.3,2.9,5.6,1.8,Iris_virginica
105,6.5,3.0,5.8,2.2,Iris_virginica
106,7.6,3.0,6.6,2.1,Iris_virginica
107,4.9,2.5,4.5,1.7,Iris_virginica
108,7.3,2.9,6.3,1.8,Iris_virginica
109,6.7,2.5,5.8,1.8,Iris_virginica
110,7.2,3.6,6.1,2.5,Iris_virginica
