
<p><img src="https://www.csc.fi/documents/10180/161914/CSC_2012_LOGO_RGB_72dpi.jpg/c65ddc42-63fc-44da-8d0f-9f88c54779d7?t=1411391121769" alt="Heading" style="float:right;width:100px;height:100px;"></p>


 # <font color="#7e8bd9">Introduction to R</font>


#### <font color="#7e8bd9"> Part of the Data analytics, AI and Machine Learning Summer School for CSC Summer Trainees</font>
<br/>

## Outline


* [What is R? And how to use this notebook?](#What)
* [Writing simple things in R](#Simple)
* [Vectors and functions in R](#Vectors)
* [Datatypes in R](#Datatypes)
* [Data frames in R](#Dataframes)
* [Missing values and error messages](#Missing)
* [How to learn more?](#How)
* [Extra examples and exercises](#Extra) 

## What is R? <a id='What'></a>

R is a programming language for statistical computing and graphics. R is available for free, under a free software licence called the [GNU General public license](https://www.r-project.org/COPYING).

When we are programming, we are actually writing a problem for the computer to solve. The language that we use to describe the problem to the computer is called __code__. In this introductory session you get to know the basics of a language that is written in __R code__.

R is a language and not a program like, for example, Excel. That is why we need an interface to use R. Today, we will write some R code using a Jupyter notebook. __Jupyter__ is a web based interface that is able to run R, but it can also do a lot of other things, which are not important right now. 

###### Some details for those that are interested

Because R is a language, it could be basically written with anything. However, there are some methods that are more practical than others. The most common method is to use RStudio (see [rstudio.com](https://www.rstudio.com/)). So, if you want to later start using R on your own computer, you can start by [downloading R](https://www.r-project.org/) and RStudio on your computer. Both are free. Today we will not download anything and the code that is written is run "in the cloud".

## Let's begin!

### Running cells in Jupyter

This notebook consists of cells. For example, this textblock is a cell, that you can edit by double-clicking it (try this!). Now that you have entered the *edit mode*, you can *run* the cell by clicking the <i class="fa fa-step-forward"></i> Run above, or by using `Ctrl-Enter`.

You can edit everything in your notebook (for example, delete stuff). Also, you can add cells in the Insert-menu, run a cell in the Cell-menu and save and download this notebook in the File-menu. __Saving your work__ is possible by using checkpoints (go to the File-menu and click Save and Checkpoint).

There are two important types of cells: __Markdown__ and __Code__ cells. This cell is a markdown cell and it contains text. The cell below that has the marking `In[ ]` next to it is the first code cell. R code can be executed in code cells. You can write, for example, 1+1 in the cell below and run the cell. Test it!

You can erase the calculation, modify it, and <i class="fa fa-step-forward"></i> Run it again.

In [None]:
5+10

## Writing simple things in R <a id='Simple'></a>

### Calculations in R
It is possible to do calculations in R using familiar math symbols. The following math symbols are used as *arithmetic operators* in R:

| R code | Explanation       | Example    |
| -------|:---------:        | -----:     |
| `+`      | addition        | `1+1`        |
| `-`      | reduction       |   `10-5`     |
| `*`      | multiplication  |    `10*2`    |
| `/`      | division        |    `10/2`    |
| `^`      | exponential     |    `10^2`    |
| `.`      | decimal point   | `3.14159`    |

### Assigning values to names/objects
You can store info in objects. This is useful when you want to use your object later.
The standard way to assign a value to an object is `object <- value`. Here we deal mostly with numeric values, but also "Text" can be used as a value. Text values are actually called *strings*, but we will discuss more about that a bit later. Some examples of assignment are given below:
```r
price <- 140
carPrice <- 1
piRounded <- 3.14159
name <- "Text"
cookiePrice <- 2
```


### Calculation using objects
Now you may understand why it is convenient to store values in objects
```r
discount <- 20
finalPrice <- price - discount
finalSale <- (price - discount)*0.5
```

Now you know how very basic math operations can be done in R, and it is time to put these skills into use! The code cell below has already some code written into it, and you can just modify it according to the exercise.
#### Exercise 0.1

Calculate what is the final price, when the discount is __a)__ 40, or __b)__ 15% of the starting price.

Modify the code in the cell below, and run the cell! Notice that you can see the value of an object by writing its name in the code cell and running it.

In [None]:
price <- 140
discount <- 20
finalPrice <- ???
finalPrice

## Vectors and functions in R <a id='Vectors'></a>

### What vectors are

Even though the examples above dealt with single numbers this is not how R is typically used. Usually you have plenty of data (many items and many prices, for example).
The way to store a sequence of numbers (or other types of data), is to use a vector. When we are dealing with numeric data, the vector is called a *numeric vector*. When we are dealing with text data, the vector is called a *characeter vector*.

Let's make one vector and call it *testVector*.</b>
See what comes out by running the cell!

In [None]:
testVector <- c(10, 30, 1, 4)
testVector
sort(testVector)

Ok, now we have created a numeric vector and sorted it. You can try to figure out how to make a character vector by yourself. If you can not figure it out, do not worry, we will learn it soon.

### What functions are

A surprise! We have already used two functions, `sort()` and `c()`. As you might have quessed, `sort()` is a function that orders the values in a vector, and `c()` is a function that combines its arguments to form a vector. In the case of `testVector <- c(10,30,-20,-1)`, the arguments are 10, 30, -20, and -1.

Some useful functions include convenient ways to calculate things, such as `mean()`, which calculates the mean of a numeric vector. The function call `mean(testVector)` will give you 4.75 as the output, since it treats the vector `c(10,30,-20,-1)` as a whole. On the other hand, the function `abs(testVector)`, will actually return a vector that contains the absolute values of its arguments. In other words, the function `abs` treats its arguments *element by element*. Run the cell below, and try to figure out what is happening!

In [None]:
testVector <- c(10, 30, -20, -1)
mean(testVector)
abs(testVector)
mean(testVector) - abs(testVector)

__Some useful functions__ are listed next. You don't have to memorize any functions, but it is useful to know some that are available. If you are unsure how a function works, you can always open the help. You can also search the internet or [R documentation](https://www.r-project.org/other-docs.html) for many more functions.

* Most mathematical functions also work *element by element* and return multiple values<br/>
   `exp()`, `log()`, `abs()`
* Some functions treat the vector as a whole and return a single value, like the length of a vector<br/>
    `length()`, `sum()`, `mean()`, `median()`, `sd()`, `var()`, `min()`, `max()` <br/>
* Other functions tell several things at once<br/>
    `summary()`, `str()` <br/>
* It is also possible to ask for help and open R documentation. <br/>
    `help(functions name)`, or `?functions name` will both work<br/>
* Basic R can also be used to create simple graphics.<br/>
    `plot()`, `barplot()`<br/>

By the way, commenting your code is done by using the `#` character in R. Commenting your code is especially necessary when writing a bit longer projects. Comments are used to explain what you are doing in your code. The `#` will turn the following text into "non-code" and it will be ignored by the interpreter that turns the R-code into action.

In [None]:
# This is a comment, see how it works
help(summary)
summary(testVector)
# See what happens if you erase the # character below
# mean(testVector)

## Next some exercises! 

Use the `help()`, search the internet, or just quess the correct commands.
Writing something "wrong" does not break anything, it will just give an error message. Find the mistake, fix the mistake, and run the cell again!

By the way, if you store the value of some calculation in an object, you can read its content by simply writing its name.

#### Exercise 1: Learning to use external resources
Find out the variance of the following numbers: 1, 4, 9.5, 10, and -2.<br/>
*Hint!* You don't have to know what variance is, there is a function for it :)

In [None]:
# Exercise 1 answer below
variance <- ...

#### Exercise 2: Finding out how a function works
Learn what the function `seq()` does and what arguments it needs to work. Test it. Do not be afraid to get an error message!

In [None]:
# Exercise 2 answer below
mysterySeq <- ...

#### Exercise 3: Thinking on your own
Create a vector `t` that contains the numbers from 0 to 10 with 0.1 interval, and plot the `sin()` of the vector `t` at the points in the `t` vector.

In [None]:
# Exercise 3 answer below
t <- 
y <-
plot(t, y)

#### Extra exercise for the fast ones
Find out how to write a title into the plot and change the scatter plot from exercise 3 into a line plot. <br/>
*Hint!* The `plot()` function can take more arguments than two. Do not worry about finishing this exercise if you run out of time.

In [None]:
# Answer for the extra exercise
help(plot)


#### Extra exercise for the really fast ones

Let's say that there is a small company that sells boxes of books in an auction. The amount of books in each box varies, and the price of each box is set in the auction. You are the summer trainee of this book auction company, and must find out: __a)__ How much you sold in total, __b)__ what is the average price of a box, __c)__ what is the average price of a book, and __d)__ what is the standard deviation of book price?

In [None]:
# Answer for the extra exercise 2
prices <- c(100, 49, 10, 16, 88, 2, 51)
booksInBoxes <- c(30, 1, 12, 5, 8, 1, 14)



## Data types <a id='Datatypes'></a>

### Numeric vectors vs. character vectors

Now that you hopefully can write some simple R code and search the R documentation or the internet for information about functions, it is time to learn something new! As you might remember, there are *numeric* and *character* vectors, which are used to store numbers and text. A character vector can obviously contain numbers as well, but in text format.  Character vectors can be created using quotes `" "` or single quotes `' ' `.

For example, what is the difference between the following commands?
```r
numVector <- c(1, 2, 3, 4)
chVector <- c("1", "2", "3", "4")
```
Try running the cell below and inspect the message.

In [None]:
# Demonstrating the difference between character and numeric vectors
numVector <- c(1, 2, 3, 4)
chVector <- c("1", "2", "3", "4")

sum(numVector)
sum(chVector)

### Logical vectors  and logical operators <a id='Logical'></a>

In addition to numeric and character vectors, there are *logical* vectors. Logical vectors have only two possible values, `TRUE` and `FALSE`. They are answers to questions, such as, is the price above 50? Or is the price exactly 10? Or is this object numeric or not? 

A logical vector can be created very similarly to other vectors, for example, 
```r
answers <- c(TRUE, FALSE, TRUE, TRUE)
```

Logical vectors (notice that a single value is a vector with the length of one) are often created as the output of functions. For example, the function `is.numeric()` tests if its argument is numeric and returns a logical value.

In [None]:
# Run this cell to see what happens
is.numeric("some character string")
is.numeric(50385)

Logical vectors are not only created as function outputs, but also as the result of *logical operations* that are written in an intuitive way, very similarly to the arithmetic operators.

The *logical operators* in R are listed below with some use examples:

| R code    | Explanation       | Example      | Output         |
| -------   |:---------:        | -----:       | -----:         |
| `<`       | less than         |  `10 < 15`   |  `TRUE`        |
| `>`       | greater than      |  `3 > 3`     |  `FALSE`       |
| `<=`      | less or equal     |  `3 <= 3`    |  `TRUE`        |
| `>=`      | greater or equal  |  `5 >= 6`    |  `FALSE`       |
| `==`      | equal             |  `7 == 7`    |  `TRUE`        |
| `!=`      | not equal         |  `2 != 2`    |  `FALSE`       |

It is also possible to combine different rules by using logical operators, which are and `&`, or `|`, and not `!`.

Try to play with logical operators a bit in the code cell below.

In [None]:
# Test some logical operators to see how they work
age <- 15
age < 18
# Combining rules
age < 18 & age > 16

Now some exercises to flex your brain with logical operators.

#### Exercise 4: Getting to know logical operators
Write in R code: Is the sum of numbers from 1 to 100 more than 5000 and less than 6000?<br/>
_Hint!_ Some functions you might want to use: `seq()` and `sum()`

In [None]:
# Exercise 4 answer here


#### Exercise 5: Logical operators in use

Let's say that you run a small company that sells boxes of books in an auction. The amount of books in each box varies, and the price of each box is set in the auction.

The object `prices` has the prices of the boxes that were sold each day, and `booksInBoxes` has the amounts of books that were in each box.

Your task is to find out whether the total sales was above our weekly target, 500?

In [None]:
# Exercise 5 answer here
prices <- c(100, 49, 10, 16, 88, 2, 51)
booksInBoxes <- c(30, 1, 12, 5, 8, 1, 14)


#### Exercise 6: Combining rules
Now we have evaluated our goal again, and set the goal to sell above 500 in a week, or at least sell above 40 per day on average. Was this new goal met?<br/>
*Hint!* Combine the rules using a suitable logical operator.

In [None]:
# Exercise 6 answer here


### Choosing values
Sometimes we want to choose a part of the values from a larger dataset for further analysis.
We can do this using square brackets `[ ]`.<br/>
Observe the output of the cell below by running it. Feel free to modify the code and test how it works.

In [None]:
prices[1]             # choosing the first value
prices[2:5]           # choosing the 2nd, 4th, and 5th value
prices[prices > 40]   # choosing values based on logical operations

### The workspace

The workspace holds the objects the user (you!) have defined during the session. For example, the object `prices` should be in your workspace. Here are some useful commands to get you started with the workspace:
* `getwd()` prints the current working directory location
* `ls()` lists all the objects in the workspace
* `rm(list=ls())` removes all the objects in the workspace
* `rm(prices)` removes only the object `prices` from the workspace

Now might also be a good time to create a checkpoint if you want to save the state of your work before moving on.

In [None]:
# Find out what objects you have in your workspace

### Data frames - the most important data structure in R <a id='Dataframes'></a>
We moved on from having objects that were single values to objects that were single vectors. However, it would be useful if the vectors that are related to each other could also be represented by the same object.
Now we will proceed to storing our data in even bigger entities, __data frames__. Data frames can include our previously very separate vectors in the same object.

Some of the functions will work with entire data frames, like `summary()`, but for some functions, like `mean()`, you should specify which data inside the data frame you want to use in the function.

Some important functions and commands that are useful for working with data frames include:
* `data.frame()` creates a data frame of the arguments that it is given
* the dollar sign `$` is used to take a column from a data frame
* square brackets `[ ]` can be used to take a part of the dataframe
* `subset()` returns a subset of a data frame

In the following examples we will learn how to create, modify, and use data frames in R.
But first, let's create a couple of vectors and combine them into one data frame for storage. 

In [None]:
# Here we create some data to put into a data frame
amount <- c(2, 3, 5, 1)
book <- c("Comics", "Novels", "Manuals", "Comics") 
prices <- c(7.50, 20.00, 66.0, 500)

# Combining the vectors into a dataframe
bookTypes <- data.frame(amount, book, prices)
bookTypes

### Factors

There is still one data type to discuss, __factor__. Factors can be described as categories, such as `Comics`, `Novels`, and `Manuals`. You can check that the class of `bookTypes$book` is actually factor by the function `class()`.
The class changed from character to factor because the function `data.frame()` turns character vectors into factors by default. The conversion can be avoided by the option `stringsAsFactors=FALSE`. Factors are necessary in statistical modeling, where different categories are often compared.

Let's create a new data frame, but this time, we'll put everything inside it immediately. Also, see how conveniently we can continue our code on the next line.

In [None]:
# Here we create yet another data
bookReview <- data.frame(storage = c(4, 1, 3, 2), 
            titles = c("Good book", "Excellent book", "Bad book", "Horrible book"),
            cost = c(12.1, 20.0, 3.33, -10 ),
            stringsAsFactors=FALSE)
bookReview

Now if we want to get only the titles, we must specify from which data frame it is found.
This is done by using the `$` -sign. You can also try to take the column without specifying the data frame, and see what happens.

In [None]:
bookReview$titles              # take a column from the data frame
titles                         # take the variable 'titles'

### Creating and modifying columns
Now we will create (and modify) some new columns into the data frame using the `$` -sign. See what happens to our `bookReview` and try to figure out how the column `storage.value` is created!<br/>
_Hint!_ The next exercise will be easier if you focus on this part.

In [None]:
# Creating some new columns
bookReview$type <- "book"
bookReview$storage.value <- bookReview$storage * bookReview$cost
bookReview

# Modifying our new columns
bookReview$type[4] <- "trash"
bookReview

#### Exercise 7: Modifying the data frame
Now we have recieved a shipment of new books and luckily and their information should be stored into our data frame.
The shipment included 2 books of each type. Don't forget to update the storage value!

By the way, the cost of the books stays the same.

In [None]:
# Exercise 5 answer here


### Choosing values from a data frame

As you might have already learned, there are many ways to achieve the same goal in R. Both `subset()` and `[ ]` can be used to choose columns and rows. "Unselecting" columns is done by the minus-sign (`-`).
Uncomment the method you want to test. Can you figure out how each method works?

In [None]:
# Create a column with a sequence of numbers
bookReview$delete.column <- seq(1, 100, length=4)
bookReview

# multiple methods to delete the 6th column in the data frame.
bookReview <- bookReview[1:4, 1:5]
#bookReview <- bookReview[1:4,-6]
#bookReview <- subset(bookReview, select = -6)
#bookReview["delete.column"] <- NULL

bookReview

We can also set rules for the observations that we want to take from the data frame.

In [None]:
# take a part (a subset) of the data frame
bookReview[bookReview$cost > 5, ]
bookReview[bookReview$type == "book", ]

# also test the same with subset function
subset(bookReview, storage > 1 & cost > 5)
bookReview

#### Exercise 8: Finding observations from the data frame
Find out the average cost of the books that are not horrible.

In [None]:
# Answer to exercise 8


Next, let's try to use some functions on our data frame. Run the cell below and observe the output!

In [None]:
summary(bookReview)  # see how this works on a data frame
mean(bookReview)     # see how this does not work on a data frame

### Missing values and error messages <a id='Missing'></a>

Learning to read error messages is a good skill to have, because it will help you locate and fix errors in your code. Did you read the <span style="background-color:#ffcccc"> warning message </span>  that appeared when we tried to take the `mean()` from the entire data frame? The message says that the argument we gave the function is not in suitable format. However, the function did still return something.

`NA` is short for "not available", which means that the answer is a missing value. In real life, datasets are full of missing values. You should remember that missing values can disturb some calculations, such as if you want to take the `mean()` from a vector that has `NA`s. Luckily, you can tell the function `mean()` to skip `NA`s in calculation by giving the function an extra attribute `na.rm=TRUE`.

There is also another value, `NULL`, that could be confused with `NA`, but the difference between `NULL` and `NA` is a bit out of the scope of this session. Check out the R documentation if you are interested!

Now let's play a bit with missing values and the option `na.rm`.

In [None]:
# Create a column full of missing values
bookReview$test.column <- NA

# Add some values to replace the missing values
bookReview$test.column[1:3] <- c(2,4,1)
bookReview

# Calculations
mean(bookReview$test.column)
mean(bookReview$test.column, na.rm = TRUE)

## How to learn more <a id='How'></a>

### Getting data into R

For now we have dealt with very small "datasets" that were written directly into the R code, but in the real world you would use data from external resources. R has many functions for data import, such as `read.csv()` to read .csv files and  `read.table()` to read .txt files. Notice that your data might not be exactly in the format that the function expects! For example, if your data uses a different decimal point than the default `.`, you can five the function an extra argument to specify the separator. This would be written as `read.csv("filename.csv", dec = ",")`.

R has also some built-in datasets that you can practice with. If you are interested, available datasets are listed by the `data()` function. The same function also loads the dataset when the name of the dataset is given as an argument. The *iris* dataset is a good one to start with.

In [None]:
# An example of exploring a built-in dataset
rm(list=ls())  # clear workspace
data(iris)     # load data
head(iris)     # explore data
str(iris)
?iris

### Packages

Packages are the strength of R! You can extend R by installing new packages. There are several packages already installed with R, but installing new ones will probably be necessary if you are working with more than the very basic stuff. You only have to install a package once, but load it into the library for use every time you start a new session. Here are some useful functions to get you started!

* `installed.packages()` will list all the packages that are already installed
* `install.packages("packageName")` will install the package from CRAN
* `library()` will list all the packages that are available in the library and ready to use
* `library("packageName")` will load the package to be used during the session

The amount of packages available for R is enormous. See this [article at rstudio.com](https://support.rstudio.com/hc/en-us/articles/201057987-Quick-list-of-useful-R-packages)  to find a nice list of useful packages.
All available packages at CRAN can be found at [cran.r-project.org](https://cran.r-project.org) under Packages, and a useful listing of packages by topic is under Task Views.

Now you have the basic tools to move on to the next section of this Data Analytics School. If you want to go through these materials later on, go ahead and download this notebook as PDF!




## Extra examples and exercises <a id='Extra'></a>

Here are some extra examples and exercises for those who have some coding experience, have perhaps used R before, or are just very fast.

### Examples with built-in datasets

Let's continue with the dataset *iris* (see [Wikipedia: Iris flower data set](https://en.wikipedia.org/wiki/Iris_flower_data_set)).

There are often special groups in a study that should be compared. In the *iris* -dataset, there are three different species of iris flowers. We can use the `summary()` to create all kinds of statistics, but these do not tell anything about the differences between each species. We could also manually separate each group and calculate statistics individually for each species. However, there is also a way* to tell R to do all these calculations for us at once.

`aggregate()` will split the data into subsets for grouped calculations, and return the output in a data frame. For example, the following function call will give us the mean of sepal length for each species separately:
```r
aggregate(iris["Sepal.Length"], list(iris$Species), mean)
```
#### Extra 1: Aggregate

Calculate the mean of all the variables by species in the *iris* dataset.



In [None]:
# Answer for Extra 1


You probably solved the previous exercise using `[ ]` and `:`, but there is also another method to tell a function how we want to user our data. The __Formula__ interface specifies which columns are used in a function and how. Understanding the formulae will be more important when you get to using functions like `ggplot()` for graphics or `lm()` for linear models (e.g. regression). Some references can be found at [rstudio.com](https://rviews.rstudio.com/2017/02/01/the-r-formula-method-the-good-parts/) and [dummies.com](http://www.dummies.com/programming/r/how-to-use-the-formula-interface-in-r/). Reading the `?formulas` can help you as well.

Here are some examples to get you started! Feel free to modify the formula in the cell below.

In [None]:
# Demonstrating the use of formulas
# Calculating the mean sepal length BY species
aggregate(Sepal.Length~Species, data = iris, FUN = mean)

# Plotting BY species
plot(Sepal.Length~Species, data = iris)


#### Extra 2: Formula

Plot the petal width by petal length of irises, where the color indicates the species.



In [None]:
# Answer for Extra 2


#### Extra 3: Explore
Feel free to explore another built-in dataset in R, _ToothGrowth_ 

In [None]:
# Free exploration