# 03 Basics: Matrices and Data frames

* Matrices
* Data frame
* Data Manipulation

---

Welcome to session 03 of our program. 
This time, we want to start with matrices, data frames and their data management. Most data you will encounter will be stored in somesort of matrix, which are often your base for your future data analyse.
A good structured and clear data is always the base for a good research.
Let's dig right into it!

![](https://raw.githubusercontent.com/GC-alex/QM/master/figs/header_sized_small.jpg)


## Matrices 
A `matrix` is a two-dimensional Array. Therefore we have two additional attributes specifying the 2 dimensions: `nrow` und `ncol`.

Create a matrix with the following statement:
`m <- matrix(1:9, nrow = 3, ncol = 3)`
And take a first look at our `matrix` with the following:
`dim(m)`, `nrow(m)`, `ncol(m)`

Create a second matrix called `n` as you like.

---
**Note**

* *Do you want to fill your matrix not by columns but by rows, add the `, byrow = TRUE` in the description of your matrix. (within the brackets!)*
* *Don't forget to type the name of your variable to actually see it!*

---

In [5]:
m <- matrix(1:9, nrow = 3, ncol = 3)
m

0,1,2
1,4,7
2,5,8
3,6,9



Just as vectors in session 02 you can use calculations on your matrices.
Try the following and take a look closer look at it's behaviour.

`m * 2`

`m * n` (`n`is our second matrix) 

`cbind(m,n)`

Check out how to access certain variables in our `matrix` !
How do you get the value in the **second row** of the  **first column**?

---
**Hint:** *In vectors we accessed values this way: `v[1]` Now you have two dimensions to specify your value.*

---

Do you remember the vector classes we discovered in session 02? Check the **type** and **class** of our Matrix `m`!

---
**Note:** *JupyterNotebook already prints the **type** of your matrix values when printing the matrix!*

---

In [3]:
class(m)
typeof(m)

Create a new **vector** called `x` that with the value "Hello" and now `cbind` our matrix `m` to `x`. What happens if you check for its **type**?

---
**Note:** *You know that the matrix should contain integer values AND character values. Instead you only get `character` as a response to the `typeof()`function - right? If you combine more types, always the "weakest" gets used as output in the end. In this case it is `character` - please keep this in mind, when analyzing your data!* 

---

## Data frames

So now we did a little recap with the help of matrices. Now we want to introduce another important feature of how to store values: data frames.

Data frames are propably your most important tool to storing your data in R. 
But what is the difference compared to a Matrix?

Let's use our data from **session 01**: `airquality`. Use the function `as.data.frame` on the data `airquality`and store it in a variable called `data`.

In [32]:
data <- as.data.frame(airquality)
data

Ozone,Solar.R,Wind,Temp,Month,Day
<int>,<int>,<dbl>,<int>,<int>,<int>
41,190,7.4,67,5,1
36,118,8.0,72,5,2
12,149,12.6,74,5,3
18,313,11.5,62,5,4
,,14.3,56,5,5
28,,14.9,66,5,6
23,299,8.6,65,5,7
19,99,13.8,59,5,8
8,19,20.1,61,5,9
,194,8.6,69,5,10


Take a look at the types of your different values within the data frame. You can see, that they all have their own type. Regardless of the others. (`Wind` for example is stored in `double`. All the others are stored as `integer`) This is one of the major differences to metrices.


Now if you look at the print of our `data` again, you may notice some strange `NA` values.
When an element or value is “not available” or a “missing value” in the statistical sense, a place within a vector, data frame or matrix may be reserved for it by assigning it the special value `NA`. 

The `is.na()`function helps you determine, wether an element is `NA`or an actual value.
With the help of the `na.omit()` function we can ignore rows in a data frame that contain `NA`s. Please use the functions on our `data` to have a look what happens!


In [45]:
data <- na.omit(data)

---
**Note**
*Why do we need to ignore our `Na`values?*

* *All arithmetic functions and operations applied to `NA`s return `NA`.*

---

## Data Manipulation

In case you actually want to start working with your data, we need to establish some basics how to address your different values correctly.

We start with only adressing **row 10 to 110 of your data**. *Remember how to do it from your knowledge about matrices and vectors!*

Store this subset in a new **variable** (choose a fitting name yourself!)

Then print only **number of rows** from this new variable.


In [66]:
airquality_zeile_10_100 <- data[10:110, ]
nrow(airquality_zeile_10_100)


---
**Check** *If your result is 101, then you did it right!*

---


As you may remember from session 01, we can use the functions `head()` to get only the first rows of our data printed, often this is better than printing all your lines, especially with big datasets.
Alternatively youc can use `names()` to extract only your column names.

We want to adress only a certain column of our dataset. Use `$` to specify the column of our data.

`data$Ozone`

What happens if you try this? 
`airquality[, c("Ozone", "Temp")]`

You can also subset your data. Try this:

`airquality[airquality$Temp > 70, ]`

Assign the subset to a new **variable** and then print the **number of rows** again!


In [48]:
airquality_temp_gr_70 <- data[data$Temp > 70, ]
nrow(airquality_temp_gr_70)

---
**Check** *If your result is 85, then you did it right!*

---

How about we try this:

`airquality_june_13 <- data[data$Month == 6 & data$Day == 13, ]`

You can see, that we can make more complex subsets with logical operators.

Try to create a subset of the **Ozone** values from the **16th of June**.



In [65]:
data[data$Month == 6 & data$Day == 16, "Ozone"]

---
**Check** *Your result should be 21!*

---

With ` x %in% y` you can compare two vectors. For each element in vector `x`, `%in%` evaluates if the element is contained in vector `y`.
Try:

`data[data$Month %in% c(6, 7), ]`

In this case it is the same as `data[data$Month == 6 | data$Month == 7, ]`

Often you can accomplish the same goals in different ways! So don't worry if your way sometimes looks different, than the one of someone else. You will find your style of coding with practice!

Of course we can also do calculations on our values in our columns. Try to calculate the `log()` of our `Ozone`.


Great we have now log transformed data but we want to store it in our data frame `data`.
Simply make use of the `$` operator.

`data$log_Ozone <- log(data$Ozone)`

This should add a new column to your data frame! Check it out!

In [78]:
data$log_Ozone <- log(data$Ozone)

data

Unnamed: 0_level_0,Ozone,Solar.R,Wind,Temp,Month,Day,log_Ozone
Unnamed: 0_level_1,<int>,<int>,<dbl>,<int>,<int>,<int>,<dbl>
1,41,190,7.4,67,5,1,3.713572
2,36,118,8.0,72,5,2,3.583519
3,12,149,12.6,74,5,3,2.484907
4,18,313,11.5,62,5,4,2.890372
7,23,299,8.6,65,5,7,3.135494
8,19,99,13.8,59,5,8,2.944439
9,8,19,20.1,61,5,9,2.079442
12,16,256,9.7,69,5,12,2.772589
13,11,290,9.2,66,5,13,2.397895
14,14,274,10.9,68,5,14,2.639057


# Summary

* A matrix is a two-dimensional Array with two additional attributes specifying the 2 dimensions: nrow und ncol
* Data frames store your information in columns that heep their own type
* NA means "not available" and NA-values should be ignored in data, if you want to do any arithmetics


# List of functions Used
**Matrices**

`matrix(m:n, nrow = x, ncol = y)`

`dim(matrix)` `nrow(matrix)` `ncol(matrix)` `cbind(matrix1,matrix2)`

`class()` `typeof()`

**Data frame**

`as.data.frame()`

`is.na()` `na.omit()`

**Data manipulation**

`head()` `names()`

`$`
