# Introduction into R coding
 

This notebook is to get used to the R environment and manage to carry out basic tasks such as reading files and simple plots.



## Installation of libraries and necessary software
Install the necessary libraries (only needed once) by executing (shift-enter) the following cell:


In [None]:
install.packages("MASS", repos='http://cran.us.r-project.org')


## Loading data and libraries
This requires that the installation above have been finished without error

In [None]:
library("MASS")



### Exercise 1
Answer the following questions about the R framework



#### Add your answers here
(double-click here to edit the cell)

##### Question I:  <u>What is the difference between the commands ```install.packages``` and ```library```?</u>

_Answer_

##### Question II:  <u>What is an object in R and what can it contain?</u>

_Answer_

##### Question III: <u>A ```data.frame``` is one of the most important R objects. Explain how rows, columns, rownames and colnames corresponds to what you see on an Excel sheet:</u>

_Answer_

##### Question II:  <u>What is the difference between an object and a function?</u>

_Answer_



### Exercise 2

Data can often have missing values corresponding to cases where the experiment did not deliver a measurement of e.g. a gene in a specific experimental sample.
We will count the number of missing values to estimate how well an experiment went.

Read the data frame ```Pima.tr2``` from the MASS library. 
Calculate the dimensions of the data frame by executing the following cell. 
Count the number of missing values for each column. For that add R code that contains the functions ```is.na()``` and ```colSums()```. 

Use ```table()``` on the output of ```rowSums()``` to count how many rows do have how many missing values. 


In [None]:
?colSums

In [None]:
data(Pima.tr2)
# the data.frame is in the Pima.tr2 object
dim(Pima.tr2)
# add your code here:


### Exercise 3

We will now use standard statistical descriptors on different data sets. Visual aids like histograms help estimating whether these descriptors correctly describe the data.

Use the functions ```mean()``` and```range()``` to find the mean and range of.
- the numbers 1, 2, ..., 21
- the sample of 50 random values generated from a normal distribution with mean 0 and variance 1 using ```rnorm(50)```. Repeat several times with new set of 50 random numbers to get a feeling about random numbers.
- the columns ```height``` and ```weight``` in the data frame \```women``` 

Repeat all above, but now with the functions ```median()```, ```sum()``` and ```sd()```.



In [None]:
x <- 1:21

y <- rnorm(50)

z1 <- women$height
z2 <- women$weight
hist(women$weight)
mean(women$weight)

##### Question I:  <u>Which are good descriptors for the different samples?</u>

_Answer_

##### Question II:  <u>What is the most accurate way to describe normally distributed data?</u>

_Answer_

### Exercise 4
Now we will look into data that cannot be accurately described by standard visualization or descriptors. However, data transformation can often be used to achieve a more proper summarization and visualization.

Get dataset ```mammals``` that is part of package MASS. Plot ```brain``` versus ```body```. Additionally, do the same plot accessing the columns directly. Try to visualize the data on logarithmic scale applying the ```log``` argument in the ```plot``` function. 


In [None]:
library(MASS)
data("mammals")
x <- mammals$body
y <- mammals$brain
plot(x,y)

##### Question I:  <u>Why is the logarithmic scale more suitable?</u>

_Answer_

##### Question II:  <u>What is the relation between x and y when they show a linear relationship on double-logarithmic scale (both axis on logarithmic scale)?</u>

_Answer_

### Exercise 5

Now a more biological example with different categories. We will inspect the data for expected properties.

Get data set ```genotype``` from library(MASS) and read about it (```?genotype```). Sort the data by column ```Wt``` with the ```order``` function. Sort the data also by column ```Mother```. Then sort by both ```Wt``` and then ```Mother```.


In [None]:
library(MASS)
data(genotype)
A <- genotype[order(genotype$Wt),]


##### Question I:  <u>Do you see any relation between having a mother of the same genotype and the average weight gain?</u>

_Answer_

### Exercise 6

One can solve the same tasks in multiple ways, such as using a for loop or a while loop.

Look at the ```for``` loop below that prints each number of a vector on a separate line, with its square and cube alongside.

Look up ```help("while")```. Show how to use a ```while``` loop to achieve the same result.


In [None]:
vec <- 1:10
for (i in vec) {
  print(paste(i, i*i, i^3))
}


### Exercise 7

The ```paste``` function is very useful to build strings, as well as merge and display different values.

Carry out the commands below



In [None]:
paste("Leo","the","lion")
paste("a","b")
paste("a","b", sep="")
paste(1:5)
paste(1:5, collapse="")


##### Question I:  <u>What do the arguments ```sep``` and ```collapse``` achieve (test by making your own examples)?</u>

_Answer_

### Exercise 8

We will now play with a function and increase its capabilities. You might have to read a bit more about how function can be used in R

The following function calculates the mean and standard deviation of a numeric vector.
```
MeanAndSd <- function (x) {
  av <- mean(x)
  sdev <- sd(x)
  c(mean=av, sd=sdev)
}
```

Modify the function so that:   
(a) the default is to use ```rnorm()``` to generate 20 random numbers;  
(b) if there are missing values, the mean and standard deviation are calculated for the remaining values.


In [None]:
# example for using 100 random numbers and calculate mean for the remaining values
MeanAndSd <- function (x=rnorm(100), narm=T) {
  av <- mean(x,na.rm=narm)
  sdev <- sd(x)
  c(mean=av, sd=sdev)
}

sample <- c(rnorm(30),NA,rnorm(30))

MeanAndSd()
MeanAndSd(sample)
?mean
example(mean)

##### Question I:  <u>Which would be the expected values for mean and standard deviation?</u>

_Answer_