<a href="https://colab.research.google.com/github/ARU-Bioinf-MSB-2020/week_1/blob/main/R_programming_language.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# R programming language

![picture](https://www.dropbox.com/s/g1thp9n3j3u1tpe/prof_jones.gif?dl=1)
<br>
*The importance of learning statistical analysis for yourself!*









## Overview
R is a popular language and environment that allows powerful and fast manipulation of data, offering many statistical and graphical options.

This course aims to introduce R as a tool for statistics and graphics, with the main aim being to become comfortable with the R environment. It will focus on entering and manipulating data in R and producing simple graphs. A few functions for basic statistics will be briefly introduced, but statistical functions will not be covered in detail. The example of a variant call format (vcf) file is used throughout to illustrate the immediate application to medical genetics.

## Audience
This course is aimed at students or early researchers beginning to use statistics and data visualisation in their studies/research who wish to learn more about the R programming language. No prior knowledge of R is required, but undergraduate level knowledge of genetics would be useful.

## What is R?
The official R project web page describes R as a ' language and environment for statistical computing and graphics'. It can be daunting if you haven't done any programming before but it is worth taking some time to familiarise yourself with the R environment as once you have grasped some of the basics it can be a very useful tool.  A wide variety of statistical functions come with the default install and there are many other packages that can be installed if required. 

It is very quick and easy to produce graphs with default parameters for a quick view of your data and there are all manner of parameters that can be specified to customise your graphs. R is often used to perform analysis and produce graphs for publication.

### Good things about R
*   It's free
*   It works on all platforms
*   It can deal with much larger datasets than Excel for example
*   Graphs can be produced to your own specification
*   It can be used to perform powerful statistical analysis
*   It is well supported and documented

### Bad things about R
*   It can struggle to cope with extremely large datasets
*   The environment can be daunting if you don't have any programming experience 
*   It has a rather unhelpful name when it comes to googling problems (though you can use http://www.rseek.org/ or google ‘R help forum’ and try that instead).

## Data in VCF format
This course will make use of data in the appearance of a variant call format (VCF) file that will be familiar to students and researchers in clinical bioinformatics. If this data format is unfamiliar to you please complete the "[Understanding variant call format](https://www.ebi.ac.uk/training/online/courses/human-genetic-variation-introduction/variant-identification-and-analysis/understanding-vcf-format/)" tutorial from EBI

# Basic commands in R
In this section we will go over some very basic commands that we can issue in R.

The console can work just like a calculator to do addition and division. 

Type `8 + 3` and press run the code cell. It doesn't matter whether there are spaces between the values or not.

In [None]:
8 + 3

The answer is printed in the console as above. Now type `27/5` and run the cell

In [None]:
27/5

These calculations have just produced output in the console - no values have been saved.

To save a value, it can be assigned to a data structure name.  This is done by drawing an arrow, like  `<-` The arrow points from the data you want to store towards the name you want to store it under. For now we'll use x, y and z as names of data structures, though more informative names can be used as discussed later in this section. 

We will assign the calculation 8 + 3 to the data structure x.

If R has performed the command successfully you will not see any output, as the value of  8 + 3  has been saved to the data structure called x. You can access and use this data structure at any time and can print the value of x into the console.

In [None]:
x <- 8 + 3
x

Create another data structure called y and assign the value 3 to it. In this case we’ll draw the arrow the other way around.  Functionally this is the same as the first example and it’s up to you whether you do data `->` name or name `<-` data. 

In [None]:
3 -> y
y

Now that we know how to assign values to a data structure we can use` x `and` y` in a calculation.

In [None]:
x + y

#### Question: Modify the examples above to change the values of x to 21 and y to 37, what is the solution to the equation?

Warning: R is case sensitive so x and X are not the same. If you try to print the value of X out into the console an error will be returned as X has not been used so far in this session.

In [None]:
X + y

Using an equals sign also works in most situations, eg` x = 8+3`  but  `<-`  is generally preferred. However, if you enter a space between the less than and minus characters that make up the assignment operator then you would be asking R a question that has a logical answer i.e. is x less than -5.

In [None]:
x  =3
x < - 5

Data structures can be named anything you like (within reason), though they have to start with a letter. Having informative, descriptive names is useful and this often involves using more than one word. Providing there are no spaces between your words you can join them in various ways, using dots, underscores and capital letters though the Google R style guide recommends that names be joined with a full stop. 


Joining names by capitalising words is generally used for function names - don't worry about what these are for now - suffice to say it is not recommended to create names in the format  genomeSize; use genome_size or preferably genome.size. Numbers can be incorporated into the name as long as the name does not begin with a number.

In [None]:
genome.size <-112
genome.size

R has some built in data structures. The most popular for demonstrating statisitcal concepts is `lynx`, `mtcars` and `iris`. We will look at R Fisher's Iris dataset in detail when we explore machine learning.


## Functions

Most of the work that you do in R will involve the use of functions.  A function is simply a named set of code to allow you to manipulate your data in some way.  There are lots of built in functions in R and you can also write your own if there isn’t one which does exactly what you need.  


Functions in R take the format function.name(parameter 1, parameter 2 … parameter n). The brackets are always needed. Some functions are very simple and the only parameter you need to pass to the function is the data that you want the function to act upon.


For example, the function dim(data.structure) returns the dimensions (i.e. the numbers of rows and columns) of the data structure that you insert into the brackets, and it will not accept any additional arguments/parameters.


Try this simple function out with the `lynx` data structure we encountered in the last page.

In [None]:
# return the dimensions of the lynx data structure
dim(lynx)

NULL

So how do we cope with simple functions that will only accept a single parameter?

For example:

The square root and log10 functions can accept just one parameter as input. So if we wanted to identify the log10 of the square root of 245 we  would have to calculate by stages. First calculate the `sqrt(245)`, which is 15.65248, and then  log10 (15.65248).

We can use data structures to keep it tidier. 
The following example assigns the `sqrt(245)` to x, and then calculates the log10 of x.


In [None]:
x <- sqrt(245)
log10(x)

#### Question: Repeat the last example but this time starting with the value 356

If you’re not sure of the parameters to pass to a function, or what additional parameters may be valid you can use the built in help to see this.  For example to access the help for the `read.table()` function you could do: `?read.table`

In [None]:
?read.table

...and you would be taken to the help page for the function.

#### Question: Repeat the last example to find a help page but this time for the function `dim`.