# **Data Structures**

---

## R Data Types

Last time, we covered the 5 basic data types of the R programming language:
*   character (strings)
*   double (numeric)
*   integer
*   complex
*   logical (boolean, TRUE/FALSE)



## R Data Structures

Data types can be considered the most basic form of data. A collection of data types form what we call a **data structure**.

There are many data structures in R. Data structures we will commonly use in R are:
*  vectors
*  matrices
*  factors (categorical)
*  dataframes
*  lists

## Vectors


### Defined

A vector is a collection of elements that share the same data type. You can create a vector in R using the `c()` function. Think of the letter 'c' in `c()` as combine or concatenate.

The following is a numeric vector comprising a collection of numbers.

In [None]:
# This is a vector of numbers
my_vector <- c(1, 2, 3, 4)
my_vector

Similar to how we use the `typeof()` function to determine the R data type, we can use the `class()` function to determine the type of data structure or class.

In [None]:
# Use class to see the type of data structure
class(my_vector)

In [None]:
# use typeof() to check the data type of elements within the vector
typeof(my_vector)



<br>




The following is a vector comprising a collection of characters strings.

In [None]:
# This is a vector of character strings
my_vector <- c("hi", "hello", "hey there")
my_vector

In [None]:
# Use class() to see the type of data structure
class(my_vector)

In [None]:
# use typeof() to check the data type of elements within the vector
typeof(my_vector)



<br>




Can we mix different data types in a vector? What do you think will happen in the following command?

In [None]:
# creating a vector of the number 1 and the logical TRUE
my_vector <- c(1, TRUE)
my_vector

typeof(my_vector)



<br>




What happens when we combine a number with a string?

In [None]:
# first let's define a vector
my_vector <- c(1, 5)
my_vector

In [None]:
# check the class and data type
class(my_vector)
typeof(my_vector)

In [None]:
# add a character to the end of the vector 
my_vector <- c(my_vector, "hello there!")
my_vector

In [None]:
class(my_vector)
typeof(my_vector)

### Functions of Vectors

R has several built-in functions that will calculate metrics for us.

In [None]:
# define a vector
my_vector <- 1:5
my_vector

In [None]:
# length of my_vector
length(my_vector)

In [None]:
# sum of a vector
sum(my_vector)

In [None]:
# mean of my_vector
mean(my_vector)

sum(my_vector) / length(my_vector)

In [None]:
# variance
var(my_vector)



<br>




### Operations on Vectors

Operations are vectorized in R. This makes math very easy!

In [None]:
# create a vector from 1:100
my_vector <- 1:100
my_vector

In [None]:
# add 200 to each element
my_vector + 200



<br>




What if we divide by 0?

In [None]:
my_vector / -0



<br>




We can add two vectors element by element.

In [None]:
# define two vectors
x <- 1:5
y <- 6:10

x
y

In [None]:
# add two vectors
x + y



<br>




This is where things get a little weird. What do you think happens in the following code cell?

In [None]:
x <- 1:5
y <- 1:10

x
y

In [None]:
x + y

length(x + y)



<br>




What if we set `y <- 1:11`?

In [None]:
x <- 1:5
y <- 1:11

x
y

In [None]:
# add vectors
x + y

In [None]:
length(x + y)



<br>




## Matrices

A matrix is a rectangular array of numbers. You may also think of matrices as vectors that are combined by row or column.

In R, you can create a matrix using the `matrix()` function.

In [None]:
my_matrix = matrix(5)
my_matrix

# the matrix above consists of a single number and is dimension 1x1

In [None]:
# check the type of data structure
class(my_matrix)

In [None]:
# check the data type of the matrix elements
typeof(my_matrix)

In [None]:
# create a matrix with 3 rows and 2 columns
my_matrix <- matrix(c(1,2,3,4,5,6), nrow = 3, ncol = 2)
my_matrix

In [None]:
# create a matrix with 3 rows and 2 columns
my_matrix <- matrix(c(1,2,3,4,5,6), nrow = 3, ncol = 2, byrow=TRUE)
my_matrix



<br>




You can also combine vectors to create matrices. 

Combining vectors by row is called row concatenation using `rbind()`.

In [None]:
# create two vectors
vector1 <- c(1, 2, 3)
vector2 <- c(4, 5, 6)

class(vector1)

In [None]:
# row bind the two vectors to create a matrix
new_matrix <- rbind(vector1, vector2)
new_matrix

In [None]:
# combining the two vectors alters the type of data structure
class(new_matrix)



<br>




You can also combine vectors by columns using the `cbind()` function.

In [None]:
# column bind the two vectors to create a matrix
new_matrix <- cbind(vector1, vector2)
new_matrix



<br>




Operations and functions on matrices are similar to what we observed with vectors.

In [None]:
# mean of all elements in the matrix
mean(new_matrix)

In [None]:
# sum of all elements in the matrix
sum(new_matrix)

In [None]:
# matrix transpose
t(new_matrix)

In [None]:
# adding a matrix to a vector
vector1
new_matrix

In [None]:
new_matrix + vector1



<br>




## Factors

### Defined

* Factors are data structures used to categorize data into different "levels".
* They can store both strings and integers
* Factors are very useful for statistical modeling of categorical variables  (gender, ethnicity, etc.)

In [None]:
# let's define a few variables to understand factors and how they work
height <- c(132,151,162,139,166,147,122)
weight <- c(48, 49, 66, 53, 67, 52, 40)
gender <- c("male","male","female","female","male","female","male")

In [None]:
# class of the gender variable
class(gender)



<br>




We can convert the `gender` variable into a factor using the `factor()` function

In [None]:
gender_factor <- factor(gender)
gender_factor



<br>




Unique categories of the factor variable are called levels in R.

We can observe the levels of the variable using the `levels()` function

In [None]:
levels(gender_factor)



<br>




Here, females are listed before males, but what if we wanted males listed before females?

Sometimes this is desired in a statistical analysis.

In [None]:
gender_factor <- factor(gender, levels = c("male", "female"))
gender_factor



<br>




Numeric variables can be converted to factors, but this doesn't make much sense. Why?

Here is an example anyway...


In [None]:
height_factor <- factor(height)
height_factor



<br>




Can we add variables to a factor if the categories are "numbers"?

In [None]:
# adding 10 to factor
height_factor + 10



<br>




How can we convert the factor back to a number? Let's try...

In [None]:
height_factor

In [None]:
# convert the factor to numeric
as.numeric(height_factor)

CAREFUL!!! What happens in the above result?




<br>




If you want to convert the factor back to numeric using `as.character()` then `as.numeric()`

In [None]:
height_numeric <- as.numeric(as.character(height_factor))
height_numeric
class(height_numeric)



<br>




* Note: Sometimes, categorical variables are coded with numbers (e.g., 1 = "Female", 2 = "Male")
* Be careful not to treat these variables as numerical, since they are categorical



<br>




## Dataframes

* Dataframes are possibly the most important data structure for this course
* If our data is not in the form of a dataframe, then we must convert the data to a dataframe
* Many of the algorithms in R take dataframes as input

* A dataframe is a special type of list where every element of the list has the same length
* We can combine vectors to form a dataframe using the `data.frame()` function

In [None]:
# variables
height <- c(132,151,162,139,166,147,122)
weight <- c(48, 49, 66, 53, 67, 52, 40)
gender <- c("male","male","female","female","male","female","male")

# create dataframe
dat <- data.frame(height, weight, gender)
dat

In [None]:
# view the first few rows of the dataframe
head(dat, 3)

In [None]:
# view the last few rows of the dataframe
tail(dat, 3)

In [None]:
# view the variable names of the dataframe
names(dat)

In [None]:
# dimensions of the dataframe
dim(dat)

In [None]:
# number of rows
nrow(dat)

# number of columns
ncol(dat)

In [None]:
# description of each variable in the dataframe
str(dat)



<br>




* We call/extract a variable (or column) from the dataframe using the `$` symbol
* Each column is a vector

In [None]:
# this calls the weight variable and multiplies by 2
dat$weight * 2

In [None]:
class(dat$weight)
typeof(dat$weight)



<br>




* We can store new variables using the `$` symbol
* You must be careful not to overwrite your original data!

In [None]:
# store weight * 2 as twice_weight
dat$twice_weight <- dat$weight * 2

head(dat)



<br>




## List

* The list class allows you to store multiple data structures in a single variable
* This is useful if you have multiple data structures associated with a single object
* For example, a patient may have demographic data, a CT image of their lungs, and genomic data

* Lists are created using the `list()` function

In [None]:
# create a dataframe
height <- c(132,151,162,139,166,147,122)
weight <- c(48, 49, 66, 53, 67, 52, 40)
gender <- c("male","male","female","female","male","female","male")
dat <- data.frame(height, weight, gender)

# create a vector
vec <- sample(1:100, 5)

# create a matrix
mat <- matrix(1:10, ncol=2)

# create a factor
fac <- factor(c("0-20 hours", "21-40 hours", 
                ">40 hours", "0-20 hours", ">40 hours"))

In [None]:
# combine these into a list
my_list <- list(my_data = dat, my_vector = vec, my_matrix = mat, my_factor = fac)
my_list



<br>




* Notice that each of the elements of the list has a name (e.g., `$my_factor`)
* You can observe the names of the list elements using the `names()` function

In [None]:
names(my_list)



<br>




* You can call elements of the list by their name

In [None]:
my_list$my_data

In [None]:
my_list$my_matrix