# Data Processing and Security

<img style="float: right;" src="https://www.kozminski.edu.pl/sites/leon/files/favicon-192x192.png" width="80">

## Class 1

*Piotr Zegadło, November 10, 2021*

### Why this format?

Jupyter is a really popular format to do simple projects in. You can lay everything out nicely, with maximum readability for your readers, as well as for your own future reference. You can take notes here, too!

JupyterLab is a new tool, a standalone application to manage your Notebooks. You can still use the typical browser interface - just go to the `Help` menu and pick `Launch Classic Notebook`.

When we're at it, you can change your theme to Dark in `Settings` -> `JupyterLab Themes`. The `File` menu will enable you to open a new view of the same notebook or run an R terminal.

### Quick testing

Let's test the functionality. Just run the code in the box below. How? Try activating that box (cell) and then hitting "Shift + Enter".

In [None]:
print("Welcome to the class")

Now try adding your own code to the cell below and executing it.

Try one more in the next cell. You can exit the cell you are in by switching from the Edit Mode to Command Mode. Just hit "Esc".

Having your code in the cells above, test 3 ways of running code in a cell: "Shift + Enter", "Ctrl + Enter", "Alt + Enter". What are the differences? Please explain in the cell below. You can easily edit the Markdown cell simply by double-clicking it.

* Shift+Enter
    
    This option runs the code in the cell and

* Ctrl+Enter
    
    This option runs the code in the cell and

* Alt+Enter
    
    This option runs the code in the cell and

To revert the notebook to its original state, use the `Kernel` menu and `Restart Kernel and Clear All Outputs`. To show the final intended state of the notebook, use `Restart Kernel and Run All Cells`.

To get more info on Jupyter Notebooks, you're welcome to check out a short LinkedIn Learning course: <https://www.linkedin.com/learning/introducing-jupyter>.

### Object type and class

Let us revise the basics. This comes up all the time in data processing in R.

#### 1. typeof

Type is the type of the value of your data. Five basic ones exist.

In [None]:
typeof(1)
typeof(1L)
typeof(1+0i)
typeof(TRUE)
typeof("one")

#### 2. class

Class indicates a method of storing your data.

In [None]:
class(1)
class(1L)
class(1+0i)
class(TRUE)
class("one")

In other words, `class()` gives you the class of an object, and `typeof()` gives you its internal type. The difference is not that apparent yet. It will become clearer later on.

### Larger objects: vector, matrix, array, list, data frame...

All these objects serve to store single values (or simpler objects) within a single, larger one.

#### Vector

Vector is a single "line" of multiple values. It is often created using `c()`:

In [None]:
v <- c(1,2,3,4,5)
v
typeof(v)
class(v)

Vector cannot store more than 1 type, so it will coerce other values to the most general type.

In [None]:
v <- c(TRUE,FALSE)
typeof(v)
v

v <- c(TRUE,FALSE,1L)
typeof(v)
v

v <- c(TRUE,FALSE,1L,1)
typeof(v)
v

v <- c(TRUE,FALSE,1L,1, 1+0i)
typeof(v)
v

v <- c(TRUE,FALSE,1L,1, 1+0i, "one")
typeof(v)
v

#### Matrix

Matrix is a collection of vectors of the same length, arranged into rows and columns.

In [None]:
m <- matrix(0L,3,3)
m
typeof(m)
class(m)

Now the difference between type and class becomes more obvious. We are storing data of type `integer` inside an object of class `matrix`.

However, similarly to a vector, a matrix cannot store mixed types of data and it will coerce. Let's create a matrix by binding 2 different vectors as columns with `cbind()`.

In [None]:
v1 <- c(1,2,3,4)
typeof(v1)
v1

v2 <- c("one","two","three","four")
typeof(v2)
v2

m <- as.matrix(cbind(v1,v2))
typeof(m)
class(m)
m


#### Array

Think of the array as a multi-dimensional matrix. What does that mean? Well, a matrix has 2 dimensions (rows and columns). With 3 dimensions, an array is a 3-dimensional data cube. However, an array can have n dimensions. Besides, like a matrix, array can only contain data of a single type.

To imagine things better:

* Vector is like a single line of text.

* Matrix is a like a page full of text.

* 3-dimensional array is like having multiple pages full of text in a book.

* 4-dimensional array is like having multiple books.

* 5-dimensional array is like having multiple shelves full of books in a single bookcase.

* 6-dimensional array is like having multiple bookcases.

* 7-dimensional array is like having multiple rooms with multiple bookcases in a library.

* 8-dimensional array is like having multiple libraries in a city.

* 9-dimensional array is like having multiple ...

I don't need to go on further, do I?

In business practice, 3-dimensional arrays can be useful. They are an intuitive way of presenting panel data. Such an array consists of multiple matrices. For example, imagine that a matrix contains sales data for all types of your products in three US states in January 2020.

In [None]:
pencils <- c(320,543,78)
pens <- c(123,2324,6)
notebooks <- c(12,1212,4)
rulers <- c(7,45,9)

m1 <- rbind(pencils,pens,notebooks,rulers)
colnames(m1) <- c("TX","MA","OR")

m1

We can create another matrix like that, showing us the sales for February 2020.

In [None]:
pencils <- c(534,440,112)
pens <- c(687,1500,3)
notebooks <- c(23,989,9)
rulers <- c(3,23,2)

m2 <- rbind(pencils,pens,notebooks,rulers)
colnames(m2) <- c("TX","MA","OR")

m2

Let us combine these two matrices into a 3-dimensional array. First dimension of that array will be the row of our `m1` or `m2` matrix - sales of a specific product. Second dimension of that array will be the column of our `m1` or `m2` matrix - the state where the sales occured.
The third dimension of that array will be simply the number of the month. 

In [None]:
a <- array(data = NA, dim = c(4,3,2), dimnames = 
           list(c("pencils","pens","notebooks","rulers"), 
                c("TX","MA","OR"), 
                c("January","February")))
a[,,1] <- m1
a[,,2] <- m2

a

As you can see, displaying the 3-dimensional array just lists its values. But we can access the individual matrices in various cuts across dimensions like that:

In [None]:
a[,,1]
a[,1,]
a[1,,]

##### Task 1

What do the 3 matrices above exactly show us? Which sales data are they showing?

#### List

Lists are powerful objects in R. Their elements can be other objects -  vectors, matrices, arrays, plots, other objects, even other lists.

In [None]:
a <- TRUE

x <- 356

y <- c(1,2,3,4)

z <- c("one","two","three","four")

m <- as.matrix(cbind(v1,v2))

l <- list(a,x,y,z,m)

l

#### Data frame

Let us now create a simple data frame.

In [None]:
v1 <- c(1,2,3,4)
typeof(v1)
v1

v2 <- c("one","two","three","four")
typeof(v2)
v2

df <- data.frame("v1" = v1, "v2" = v2)
typeof(df)
class(df)
df

As you can see, a data frame is actually a special case of a list.

You might have also noticed that our `v2` column is treated as a yet another type of data: factor. How to avoid it?

In [None]:
df2 <- data.frame("v1" = v1, "v2" = v2, stringsAsFactors=FALSE)
df2

Why do we need data frames? They are like matrices, but with two major differences.

##### Task 2

Tell me what are the differences between a data frame and a matrix. Demonstrate them with code.

### Basic operations

##### Task 3

Let `x` be 15 and `y` be 27. 

* Make `z` to be their multiple.

* What is the remainder of dividing `z` by 4?

* Create a vector `v` containing all 3 aforementioned variables.

* Calculate the average, median, maximum and minimum of `v`.

##### Task 4

* Create a new vector `w`, which is equal to `3*v`.

* What is the result of element-by-element multiplication of `v` and `w`?

* What is the result of matrix multiplication of `v` and `w`?

* Create a dataframe with columns made up of vectors `v` and `w`. Name the columns `volume` and `weight`. Display the dataframe.

### Referencing and subsetting

With the provided vector, complete the following tasks.

Use this page to help you when you get lost:
https://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting/

In [None]:
x <- c(3.5, 2.1, 5.7, 8.4, 9.0)

##### Task 5

Use single lines of code and no additional functions.

* Pick the first number from the vector.

* Pick the 3rd and 4th number from the vector.

* Pick the 3rd and 5th number from the vector.

* Pick the 2nd number twice.

##### Task 6

Use single lines of code and no additional functions.

* Pick all but the 2nd number.

* Pick all but the 2nd and 3rd number.

* Pick all but the 2nd and 5th number.

##### Task 7

Use single lines of code and no additional functions.

* Pick all elements larger than 5.

##### Task 8

Using `mtcars` dataset, fix the mistakes in the attempts below.

In [None]:
data(mtcars)

Pick only those rows, where cyl is equal to 4

In [None]:
mtcars[mtcars$cyl = 4, ]

Pick all rows without rows 1 to 4

In [None]:
mtcars[-1:4, ]

Pick only rows with cyl less than 5

In [None]:
mtcars[mtcars$cyl <= 5]

Pick the rows were cylinders is between 4 and 6

In [None]:
mtcars[mtcars$cyl == 4 | 6, ]