# **Other Data Structures**

---

<br>

## Packages

In [None]:
# install packages for today's lecture
install.packages("dslabs")

In [None]:
# load the necessary packages
library(dslabs)

<br>

<br>



## Data Structures

<br>

* Data types can be considered the most basic form of data

* A collection of data types form what we call a **data structure**

* The `R` programming language has many data structures

<br>

* Data structures we covered previously are:
  *  data frames
  *  vectors

* Today, we will cover the `factor` and `list` data structures
  

<br>

<br>

### Factors

* Let's reacquaint ourselves with the `murders` data frame from the `dslabs` package

In [None]:
# view first few lines
head(murders)

In [None]:
# view attributes of data frame
str(murders)

<br>

* From the output above, notice that the `region` column is a `factor` class

<br>

In [None]:
# confirm the class of the region variable
class(murders$region)

<br>

<br>

* Although factors appear similar to character vectors, they are not programmatically the same

* Factors are useful for storing categorical data (i.e. data that can be partitioned into distinct groups)

* Think of factors as categorical variables

<br>

* For example, each observation (row) of the `murders` dataframe can be placed in a `region` category
  * Northeast
  * South
  * North Central
  * West

* In `R`, categories of a `factor` are called "levels"

<br>

* The colab interface allows you to view the levels of a variable

In [None]:
murders$region

<br>

#### `levels()`

* We can list  the unique categories of a `factor` using the `levels()` function

In [None]:
# list levels of the region factor
levels(murders$region)

<br>

<br>

#### Change the names factor levels

* You can also change the names of the unique categories (i.e. levels)

* Simply create a character vector of your desired category names and overwrite the levels

In [None]:
# view levels of the region variable
levels(murders$region)

In [None]:
# create new names for the levels
levels(murders$region) <- c("NE", "S", "NC", "W")

In [None]:
# check our work
head(murders)

<br>

<br>

#### `table()`

* The function `table()` counts the occurence of each category in a factor

In [None]:
# apply table() to the region variable
table(murders$region)

<br>

* The output of `table()` indicates there are
  * 9 states in the northeast region category
  * 17 states in the southern category
  * 12 states in the north central category
  * 13 states in the western category

<br>

<br>

#### `factor()`

* Note that factors have a preset ordering

* For example the factor above shows `NE`, `S`, `NC`, and `W`

<br>

* The function `factor()` is useful for creating factor data structures and reordering the categories

* This can be useful for data reporting where there is some ordering to the categories (e.g., age groups, education level, etc.)

<br>

In [None]:
# reorder the levels of the region variable
new_order <- c("W", "S", "NC", "NE")
murders$region <- factor(murders$region, new_order)

In [None]:
# check our work
murders$region

In [None]:
# apply table() to the region variable with new ordering
table(murders$region)

<br>

<br>

#### `as.factor()`

* When working with a new dataset, it is common to have `factor` variables in the `character` class

* We can use the function `as.factor()` to convert these character variables into factors

<br>

In [None]:
# convert the factor to a character using as.character()
example <- as.character(murders$region)
example

In [None]:
class(example)

<br>

In [None]:
# convert the character vector back to a factor vector
example <- as.factor(example)
example

In [None]:
class(example)

<br>

* Note that the levels have been reordered

* You can repeat the reordering process to revert back to the order we desire

In [None]:
factor(murders$region, new_order)

<br>

<br>

### Lists

* Data frames are a special kind of (structured) list
  * Each variable in a dataframe is an item in the list
  * ***Each list item must be the same length***

* But what if we have list items of different lengths?

* What if our list items aren't vectors? What if they are other lists? Matrices?

* We can then use the (unstructured) `list` data structure

<br>

* Below is an example showing how to create an unstructured list using the `list()` function

* Notice how the variables in the list do not need to be the same length or the same data structure

In [None]:
record <- list(name = "John Doe",
               student_id = 1234,
               grades = c(95, 82, 91, 97, 93),
               final_grade = "A")

In [None]:
record

<br>

* Similar to dataframes, we can extract items from a list using the `$` operator

In [None]:
# extract grades from the record list
record$grades

<br>

#### `names()`

In [None]:
# extract the list names
names(record)

<br>

#### `length()`

In [None]:
# number of list items
length(record)

<br>

#### `as.list()`

* As mentioned earlier, data frames are structured lists
* We can convert data frames to an unstructured list using the function `as.list()`

In [None]:
# convert the murders data frame to an unstructured list
murders_list <- as.list(murders)
murders_list

<br>

* This conversion is useful if you would like to add list items
* (Unstructured) list items do not need to be the same length or class!

In [None]:
# add the indicator for country
murders_list$country <- "USA"
murders_list

<br>

<br>

## Vectorized Operations

* A major advantage of using R is operations are vectorized

* That is, R can perform a mathematical operation on a collection of vector values all at once, as opposed to on each element of the vector

* Other languages are capable of vectorized operations, but typically require libraries to perform these operations

<br>

<br>

* Let's define two vectors

In [None]:
# define two numeric vectors
x <- c(1, 2, 3, 4)
y <- c(2, 3, 4, 5)

<br>

* Adding a scalar to a vector adds the scalar to each element of the vector

In [None]:
# show the contents of x
x

In [None]:
# add a scalar to a vector
10 + x

<br>

* Adding two vectors adds each corresponding element of the two vectors

In [None]:
# show the contents of x
x

In [None]:
# show the contents of Y
y

In [None]:
# add the two vectors
x + y

<br>

* The same is true for subtraction, multiplication, division, and exponentiation!

In [None]:
# show the contents of x
x

In [None]:
# scalar subtraction
x - 1

In [None]:
# vector subtraction
y - x

<br>

In [None]:
# show the contents of x
x

In [None]:
# scalar multiplication
2 * x

In [None]:
# vector multiplication
x * y

<br>

In [None]:
# show the contents of x
x

In [None]:
# scalar division
x / 4

In [None]:
# vector division
x / y

<br>

In [None]:
# show the contents of x
x

In [None]:
# exponential to a scalar
x^2

In [None]:
# exponentiation to a vector
x^y

<br>

* Even other functions can perform vectorized operations

* For example, we can convert every numeric value into a character

In [None]:
# show the contents of x
x

In [None]:
# convert each element of x into a character
as.character(x)

<br>

<br>

## Overwriting and Inserting Columns in a Data Frame

#### Overwriting columns/variables

* In the last lecture, we created the following data frame with columns
  * mood
  * hours
  * weather
  * calories

In [None]:
# create variables using c()
mood        <- c("happy", "sad", "sad", "happy")
hours       <- c(1, 2, 1, 4)
weather     <- c("sunny", "cloudy", "cloudy", "sunny")
temperature <- c(75, 62, 71, 83)

In [None]:
# create data frame from vectors
my_data <- data.frame(mood, hours, weather, temperature)
my_data

<br>

* Notice that the `mood` and `weather` variables are `character` class

* However, these are categorical and should therefore be a `factor` class

* We can convert each of these variables using `as.factor()`

In [None]:
# convert character data types to factors
mood_factor    <- as.factor(my_data$mood)
weather_factor <- as.factor(my_data$weather)

<br>

* To overwrite the existing variables, we simply assign the factors to the same columns

In [None]:
# convert character data types to factors
my_data$mood    <- mood_factor
my_data$weather <- weather_factor

In [None]:
# check our work
head(my_data)

<br>

* Or we can do it all at once!

In [None]:
# convert character data types to factors
my_data$mood    <- as.factor(my_data$mood)
my_data$weather <- as.factor(my_data$weather)

In [None]:
# check our work
head(my_data)

<br>

<br>

#### Inserting columns/variables

* We often need to create new columns in a data frame

* For example, what if we have a new variable?

* What if we want to store the result of a vector operation?

<br>

* To illustrate this concept, let's first create a vector describing if we ate lunch or not

In [None]:
# create a vector indicating if we ate lunch
lunch <- c("yes", "no", "no", "yes")
lunch

<br>

* We can insert the `lunch` variable in the `my_data` data frame using the `$` operator

* We will call this new variable `ate_lunch`

In [None]:
# insert the variable lunch in the data frame
my_data$ate_lunch <- lunch
my_data

<br>

* Note this should be a factor as well

In [None]:
# insert the variable lunch in the data frame
my_data$ate_lunch <- as.factor(lunch)
my_data

<br>

* What if we want our temperature in celsius?

* We know the equation for this conversion is

\begin{align*}
\text{celsius} = \frac{5}{9}(\text{fahrenheit} - 32)
\end{align*}

In [None]:
# convert temperature into celsius and insert as new column
my_data$temperature_celsius <- (5/9) * (my_data$temperature - 32)

In [None]:
# check our work
head(my_data)

<br>

<br>

<br>