# **Data Types and Data Structures**

---

<br>

## Packages

In [None]:
# install packages for today's lecture
install.packages("dslabs")

In [None]:
# load the necessary packages
library(dslabs)

<br>

<br>



## Data Types

* The `R` programming language has several data types

* For this class, and in practice, the data types most often used for data science are
  *   `numeric`
  *   `character`
  *   `logical`
  *   `integer`

<br>

<br>

* **Today, we focus on the `numeric` and `character` data types**

<br>

<br>

* The `logical` data type will be covered extensively in the lecture on subsetting

* The `integer` data type will be briefly mentioned in the lecture on loops

<br>

### `numeric` data type

* The numeric data type is, well, a number!

* You can check the data type of your variables using the `class()` function

In [None]:
# store a numeric value as a variable
x <- 3.14
x

<br>

In [None]:
# check the data type of the stored variable using class()
class(x)

<br>

* We see above that the class of the variable `x` is `numeric`

<br>

In [None]:
# store a value as a variable
y <- 5
y

<br>

In [None]:
# check the data type of the stored variable using class()
class(y)

<br>

* Although the variable `y` is assigned the "integer" value of 5, the class of the variable `y` is still `numeric`

<br>

<br>

### `character` data type

* The character data type is essentially text

* A character is denoted by quotations `" "` or apostrophes `' '` surrounding a letter or phrase

In [None]:
# store text value as a variable
my_text <- "this is my text"
my_text

<br>

In [None]:
# check the data type of the stored variable using class()
class(my_text)

<br>

* We see above that the class of the variable `my_text` is `character`

<br>

* We can also use apotrophes to define a `character` data type

In [None]:
# store a value as a variable
my_text2 <- 'this is my text'
my_text2

<br>

In [None]:
# check the data type of the stored variable using class()
class(my_text2)

<br>

* Caution: Numeric data can be stored as a `character` data type

In [None]:
# store a value as a variable
my_text3 <- '3.14'
my_text3

<br>

In [None]:
# check the data type of the stored variable using class()
class(my_text3)

<br>

<br>

### `logical` data type

* The `logical` data type (also known as a Boolean) is VERY IMPORTANT!!!
* They are binary values that take on values of `TRUE` or `FALSE`. In R,
  *   `0` is `FALSE`
  *   `1` is `TRUE`

<br>

In [None]:
# store a logical variable
my_logical <- TRUE
my_logical

<br>

In [None]:
# check the data type of the stored variable using class()
class(my_logical)

<br>

In [None]:
# store a logical variable
my_logical <- FALSE
my_logical

<br>

In [None]:
class(my_logical)

<br>

<br>

### `integer` data type

* The `integer` data type is often used for looping mechanisms (covered later)
* Numbers are stored as the `numeric` data type by default
* We can convert the `numeric` data type to an `integer` data type using `as.integer()`

<br>

In [None]:
# store a numeric data type
b <- 5

<br>

In [None]:
# check the data type of the stored variable using class()
class(b)

<br>

In [None]:
# convert to an integer data type using as.integer()
b <- as.integer(b)

<br>

In [None]:
# check the data type of the stored variable using class()
class(b)

<br>

<br>

## Converting Between Data Types

* Sometimes, you may encounter data stored in the incorrect data type
* Therefore, it is helpful to convert between data types using the following functions
  * `as.numeric()`
  * `as.character()`
  * `as.logical()`
  * `as.integer()`

<br>

### `as.numeric()`

In [None]:
# store a character data type
c <- '3.14'
c

In [None]:
# checking the data type using class()
class(c)

<br>

In [None]:
# convert the character data type to numeric
c <- as.numeric(c)
c

In [None]:
# checking the data type using class()
class(c)

<br>

<br>

### `as.character()`

In [None]:
# store a character data type
c <- 3.14
c

In [None]:
# checking the data type using class()
class(c)

<br>

In [None]:
# convert the numeric data type to character
c <- as.character(c)
c

In [None]:
# checking the data type using class()
class(c)

<br>

<br>

### `as.logical()`

In [None]:
# store a numeric data type
c <- 1
c

In [None]:
# checking the data type using class()
class(c)

<br>

In [None]:
# convert the character data type to numeric
c <- as.logical(c)
c

In [None]:
# checking the data type using class()
class(c)

<br>

In [None]:
# store a numeric data type
c <- 0
c

In [None]:
# checking the data type using class()
class(c)

<br>

In [None]:
# convert the character data type to numeric
c <- as.logical(c)
c

In [None]:
# checking the data type using class()
class(c)

<br>

<br>

---

<br>

## Data Structures

<br>

* Data types can be considered the most basic form of data

* A collection of data types form what we call a **data structure**

* The `R` programming language has many data structures

<br>

* Data structures we will commonly use in `R` are:
  *  dataframes
  *  vectors
  *  factors (categorical data)
  *  lists
  

<br>

* In today's lecture, we will cover data frames as they are the most important data structure for data science in R!

<br>

<br>

### Dataframes

* Dataframes are possibly the most important data structure for this course

* If our data is not in the form of a dataframe, then we typically convert our data into a dataframe

<br>

* We can load the `murders` dataset from the `dslabs` library as a data frame example

* A dataframe, as shown below, is essentially a table with rows and columns

* Each row represents an **observation**

* Each column represents a **variable** describing some attribute of each observation

In [None]:
murders

<br>

* Similar to the data types, we can also check the type of data structure using the `class()` function

In [None]:
# check the type of of class
class(murders)

<br>

<br>

#### `head()`

* Data frames can have many observations
* The `head()` function is useful for viewing the first few lines of a data frame for initial exploration

In [None]:
# using head() to view first 6 lines (by default)
head(murders)

<br>

In [None]:
# view the first 2 lines using head()
head(murders, 2)

<br>

<br>

#### `tail()`

* The `tail()` function is useful for viewing the last few lines of a data frame for initial exploration

In [None]:
# using tail() to view last 6 lines (by default)
tail(murders)

<br>

In [None]:
# view the last 2 lines using tail()
tail(murders, 2)

<br>

<br>

#### `str()`

* We can use the `str()` function to describe attributes of the dataframe

* From the output below, we observe the following
  * The dataframe `murders` contains 51 observations (rows)
  * There are 5 variables (columns)
  * The variables in the dataframe are `state`, `abb`, `region`, `population`, `total`
  * The dataframe contains `character` (chr), `factor`, and `numeric` (num) data

In [None]:
# using str()
str(murders)

<br>

<br>

#### `ncol()`, `nrow()`, `dim()`

* `ncol()` provides the number of variables (columns) in a data frame


In [None]:
# view the number of columns using ncol()
ncol(murders)

<br>

* `nrow()` provides the number of observations (rows) in a data frame


In [None]:
# view the number of columns using nrow()
nrow(murders)

<br>

* `dim()` provides both the number of observations and variables of a data frame


In [None]:
# view the dimensions of a data frame
dim(murders)

<br>

<br>

#### `names()`

* The `names()` function extracts all of the variable names of a data frame

In [None]:
# extract the variable names of the murders data frame
names(murders)

In [None]:
# variables names are a character class
class(names(murders))

<br>

<br>

#### Extracting variables using the `$` operator

* To access a single variable, we use the `$` operator

* The following code extracts the `population` variable from the `murders` dataframe

In [None]:
# extract a variable from a dataframe using $
murders$population

<br>

* Accessing a single variable from the dataframe results in a vector!

* This is our next data structure...

<br>

<br>

### Vectors

* A vector is a collection of elements that share the same data type

* That is,
  * all values in a numeric vector must be numeric
  * all values in a character vector must be a character



<br>

* For example, the `population` variable from the `murders` dataframe is a numeric vector

* We can check the data type in the vector using the `class()` function

In [None]:
# store the population column as pop
pop <- murders$population

# check stored properly
pop

<br>

In [None]:
# check the data type of the vector pop
class(pop)

<br>

<br>

* The `state` variable from the `murders` dataframe is a character vector

* We can check the data type in the vector using the `class()` function

In [None]:
# show the contents of the state variable
murders$state

<br>

In [None]:
# check the data type of the state variable
class(murders$state)

<br>

<br>

#### `length()`

* The `length()` function outputs the length of the vector

In [None]:
# determine the length of a vector
length(murders$total)

<br>

<br>

## Creating Vectors and Data Frames

* If we collected our own data, we would like to store the data in a data frame

* Therefore, it would be helpful to know how to create our own vectors and data frames



<br>

<br>

### Creating Vectors

* We can create vectors using the `c()` function

* Here, the 'c' in `c()` represents the word "concatenate"

* To concatenate a set of objects is to link them or connect them together

<br>

<br>

* The example below concatenates the number 2 and 8 using the `c()` to create a vector

* The vector is then assigned to the variable `my_vector`

In [None]:
# create a vector using the c() function
my_vector <- c(2, 8)
my_vector

In [None]:
# check the class of my_vector
class(my_vector)

<br>

<br>

### Creating data frames

* Variables (columns) of a data frame are vectors

* We can create a data frame from several vectors

* Note that these vectors must all be the same length

<br>

<br>

* Let's create a data frame with 4 observations (rows) and 4 variables (columns)

* The first step is to create the vectors for each variable
  * mood
  * hours
  * weather
  * calories

In [None]:
# create variables using c()
mood        <- c("happy", "sad", "sad", "happy")
hours       <- c(1, 2, 1, 4)
weather     <- c("sunny", "cloudy", "cloudy", "sunny")
temperature <- c(75, 62, 71, 83)

<br>

<br>

* Using the four vectors above, we can create a data frame using the `data.frame()` function

In [None]:
my_data <- data.frame(mood, hours, weather, temperature)
my_data

<br>

<br>

#### Variable names

* Setting variable names when defining the data frame

In [None]:
# specifying variable names
my_data <- data.frame(my_mood = mood, number_hours = hours, type_of_weather = weather, temperatureF = temperature)
my_data

<br>

* Overwriting existing variable names

In [None]:
# redefine the data frame using old names
my_data <- data.frame(mood, hours, weather, temperature)
my_data

In [None]:
# check names of the data frame using names()
names(my_data)

<br>

* Create a new character vector of variable names

In [None]:
# we can create a whole new character vector of names
new_names <- c("my_mood", "number_hours", "type_of_weather", "temperatureF")
new_names

<br>

* Overwrite the existing variable names with the new variable names

In [None]:
# overwrite the old names using the new names
names(my_data) <- new_names

In [None]:
# check our work
names(my_data)

<br>

<br>

#### Row names

* In addition to variable (column) names, R also allows names for rows of a data frame
* We can check the row names of a data frame using the `rownames()` function

In [None]:
my_data

In [None]:
# view the row names
rownames(my_data)

<br>

* By default, row names of a data frame simply count the number of rows as shown above
* We an customize our row names, similar to the variable names of a data frame

In [None]:
# create character vector of new row names
new_row_names <- c("Person 1", "Person 2", "Person 3", "Person 4")

In [None]:
# overwrite old row names
rownames(my_data) <- new_row_names
my_data

<br>

<br>