# Introduction

## Goals

By the end of this course, you should be able to
- Do basic data analysis using R or Python/Pandas, with a special emphasis on
  - triton, or other similar HPC cluster environments
  - workflows, I/O strategies etc. that work on HPC clusters.

What this course is NOT:
- A basic course in programming. We don't expect you to have prior knowledge of R or Python, but some programming experience is required.
- A basic course in statistics / machine learning. As part of the course we'll do some simple stuff, but we expect that you either understand the statistics from before or learn it on your own.

Topics that we're going to cover
- The dataframe data structure, and how it relates to other common data structures.
- Working with dataframes. Indexing, etc.
- Visualizing your results.

## Note about syntax

Anything written like `this` refers to R-code. Depending on context it might be a type or a function.

# Basic R structures and objects

## Libraries

R has a wide variety of libraries supplied in the R-CRAN-network (Comprehensive R Archive Network). Installing these libraries is usually as simple as running:

`install.package("packagename")`

Loading these libraries is done with:

`library(packagename)`

## Data types and variables

In R for the most of your time your data will be of the following types:
- `logical`is a boolean value with values `TRUE` OR `FALSE`.
- `integer` is a integer number. It is differentiated from generic numbers with an `L`-letter in the end. E.g. `3L`.
- `numeric` or `double` is a generic floating point number.
- `complex` refers to complex point numbers. One can specify them in the form `4 + 17i`.
- `character` refers both to individual characters and character strings.

Data type of an object can be found with `typeof`-function.

Variables in R are set with the assignment operator `<-`. `print`-function prints variables.

In [51]:
boolean_var <- FALSE
integer_var <- 73L
numeric_var <- 3.2
character_var <- 'Aalto'
complex_var <- 3+2i

typeof(boolean_var)
typeof(integer_var)
typeof(numeric_var)
typeof(character_var)
typeof(complex_var)

print(integer_var)

[1] 73


## Atomic vectors

R is a vectorized language. This means that most objects are `vector`-objects or built from `vector`s.

A atomic `vector` in R is a contiguous series of values where all values share a common data type. `vector`'s length is specified at its creation time.

To create a `vector` one can use the `vector`-function. It takes as its arguments the type and the length.

More often one uses the `c`-function (concatenate/combine) to combine data into a `vector`. When using this it is a good idea to make sure that types of the input data match or you will get a type conversion.

In [52]:
# Initialization using the vector-function
integer_vector <- vector('integer',5)
print(integer_vector)

# A vector of integers (note the L-letters)
integers <- c(1L, 2L, 5L, 7L)
typeof(integers)

# A vector of floating point numbers
floats <- c(7.5, 28.1, 0.3, 3.4)
typeof(floats)

# A vector containting character strings
characters <- c('fox', 'lynx', 'tiger', 'cat')
typeof(characters)

# A vector of generic numbers (note that type is not integer)
not_integers <- c(1, 2, 5, 7)
typeof(not_integers)

# Output determined by highest type in hierarchy:
# logical < integer < double < complex < character
type_conversion_warning <- c(TRUE,1L,7.5,'fox',1)
typeof(type_conversion_warning)

[1] 0 0 0 0 0


### Vector indexing

Length of a `vector` can be obtained with function `length`.

`vector`s can be accessed by index using the `[start:finish]`-syntax, but typically one uses functions with `apply`-structures to access and modify them. More on that later.

One can also provide an indexing `vector` with the indices given as numbers. Indices in R start from 1.

In [37]:
# Value at index 2
integers[2]
# Values from index 2 to the end
floats[2:length(floats)]
# Values from index 2 to one before the end
characters[2:(length(characters)-1)]
# Values at indices 2 and 4 (note indexing with a vector)
not_integers[c(2,4)]

Indexing with indices is quite rarely the way you want to access data as that forces you to keep track of data placement. More often you're interested in a certain range of data. For this you need logical indexing, where you create an indexing `vector` with a logical expression.

This indexing `vector` can then be used to slice the `vector`. Most of the time you don't need to store the indexing `vector`.

In [70]:
# Create an indexing vector
indexing_vector <- integers > 3
# Result is a boolean vector with the same dimensions as the original vector 
print(indexing_vector)
# This indexing vector can be used to index the original vector
integers[indexing_vector]

# Same as a one-liner
integers[integers > 3]

[1] FALSE FALSE  TRUE  TRUE


### Performance alert: Changing vector's size slows code down

One can edit the `vector` and their values after they have been created, but one needs to be mindful of the operations.

Some operations (like changes in values) are done *in place*. This means that the memory values are changed, but there was no need to create a new `vector`. This type of operation is fast and does not require additional memory.

On the other hand some operations (like appending) change the `vector` length. This forces the creation of a new `vector`, into which values from the old object are copied. This type of operation slows the code down and requires additional memory.

In R it is usually best to reserve a `vector` of the size you will use first and populate it with values afterwards.

## Factors

`factor`s are a special case of `vector`s that are used to store character strings in a more efficient way. This is especially important with data that can be split into categories.

In a `factor` each unique string in a `vector` is stored as an integer and the mapping between these integers and the original strings is stored as levels.

By using the `ordered=TRUE`-argument one can tell that data is ordered. With ordered `factor`s one can use logical indexing.

In [72]:
# R contains a constant list of month names
month.name

# Get a random sample of 10000 months from the month.name vector (e.g. birth months of participants in a study)
months <- sample(month.name, 10000, replace=T)

# Show months distribution in a table
table(months)

# Check how big the vector is in the memory
object.size(months)

# Now make the vector into a factor
months_f <- factor(months)

# Print month distribution in the table
table(months_f)

# Check how big the vector is in the memory
object.size(months_f)

# Make a new factor, but this time give the possible levels that data can take and tell that the levels are ordered
months_f2 <- factor(months, levels=month.name, ordered=TRUE)

# Print table (note that it is in order)
table(months_f2)

# Check how many of the months in the factor are on the second half of the year
sum(months_f2 > 'June')

months
    April    August  December  February   January      July      June     March 
      794       821       873       825       778       845       852       858 
      May  November   October September 
      860       850       812       832 

80648 bytes

months_f
    April    August  December  February   January      July      June     March 
      794       821       873       825       778       845       852       858 
      May  November   October September 
      860       850       812       832 

41136 bytes

months_f2
  January  February     March     April       May      June      July    August 
      778       825       858       794       860       852       845       821 
September   October  November  December 
      832       812       850       873 

## Lists

Sometimes you need a `vector` with mixed data types. In this case you need a `list`.

In [74]:
example_list <- list('a',3L,4.3,3+2i)
typeof(example_list)
print(example_list)

[[1]]
[1] "a"

[[2]]
[1] 3

[[3]]
[1] 4.3

[[4]]
[1] 3+2i



`list` can also have named elements. During list creation they can be given as keyword arguments to the function call.

New elements can also be created with `$name`.

After `list` has been created there's quite a few ways to access its contents:
- `[ ]` returns a slice of the list that keeps the names intact. Thus it always returns a list.
- `[[ ]]` or `$` returns the value stored in key. Type of the object depends on the stored object.



In [93]:
# Initialize a list with various data types
example_list2 <- list(3L,42.4,something='nothing',c(13.2,32.2),boolean=TRUE, l=list(32.2,1.6))

# Add a new variable to the list
example_list2$variable <- 3.4

# Return a slice of the list
example_list2['something']
class(example_list2['something'])

# Return the value of 'something'
example_list2[['something']]
class(example_list2[['something']])
example_list2$something
class(example_list2$something)

# Print the list. Note that the named values are still indexed.
print(example_list2)

[[1]]
[1] 3

[[2]]
[1] 42.4

$something
[1] "nothing"

[[4]]
[1] 13.2 32.2

$boolean
[1] TRUE

$l
$l[[1]]
[1] 32.2

$l[[2]]
[1] 1.6


$variable
[1] 3.4



### Performance alert: list is still a vector

The comment about `vector` size changes still matters with `list`s contents. If sizes of `vectors` contained in a `list` are modified, they need to be copied. Likewise if additional entries are added to the list, the whole list needs to be copied as well. 

## Matrices

`matrix` is two-dimensional `vector`. It can be created with the `matrix`-function. Matrices have dimensions instead of lenght and these can be accessed through `dim`-function.

In [100]:
# Using seq-function to create a sequence
example_matrix <- matrix(seq(0,5),nrow=3,ncol=2)
print(example_matrix)

dim(example_matrix)

     [,1] [,2]
[1,]    0    3
[2,]    1    4
[3,]    2    5


Two-dimensional indicing is done with `[ , ]`. Rows are to the left of comma, columns to the right. 

In [108]:
# Get first row (note that leaving columns empty means all columns)
example_matrix[1,]
# Get second column
example_matrix[,2]
# Get submatrix
example_matrix[2:3,1:2]

# One can also traverse the matrix in a flattened order (R uses column-major order)
example_matrix[4]

0,1
1,4
2,5


## Data frames

The `list`-object is much more usable than `vector`

The single most important object when dealing with data in R is the `data.frame`-structure.

`data.frame` contains data in tabular form.[1](https://stat.ethz.ch/R-manual/R-devel/library/base/html/data.frame.html)

Let us load a internal R sample data set that has car information.

In [65]:
df <- mtcars
is.data.frame(df)

# Print only first rows
head(df)

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1


The names of the columns and rows can be accessed through `colnames` and `rownames`. 

Number of columns and rows can be obtained with `ncol` and `nrow`.

In [26]:
colnames(df)
rownames(df)
ncol(df)
nrow(df)

`data.frame` access is typically done through these names.

In [58]:
# Get specific row, all columns
df["Dodge Challenger",]
# Get all rows, specific columns
df[,c("cyl","disp")]
# Create a boolean vector where each row is TRUE or FALSE based on whether cyl == 6 or not
cyl6 <- df[,"cyl"] == 6
cyl6
# Get all rows based on an indexing vector
df[cyl6,]

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Dodge Challenger,15.5,8,318,150,2.76,3.52,16.87,0,0,3,2


Unnamed: 0,cyl,disp
Mazda RX4,6,160.0
Mazda RX4 Wag,6,160.0
Datsun 710,4,108.0
Hornet 4 Drive,6,258.0
Hornet Sportabout,8,360.0
Valiant,6,225.0
Duster 360,8,360.0
Merc 240D,4,146.7
Merc 230,4,140.8
Merc 280,6,167.6


Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4
Merc 280C,17.8,6,167.6,123,3.92,3.44,18.9,1,0,4,4
Ferrari Dino,19.7,6,145.0,175,3.62,2.77,15.5,0,1,5,6


Let's create a new column from *mpg* that shows litre per 100 kilometres *lkm*: 

In [66]:
df["lkm"] = 235.214583 / df["mpg"]
head(df)

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,lkm
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4,11.20069
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4,11.20069
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1,10.31643
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1,10.99134
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2,12.57832
Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1,12.99528
