# <a name="03-title"><font size="6">Module 03: Creating and Slicing Data Frames</font></a>

---

# <a name="02structure">The Structure of Data Frames</a>
---

Recall from [Module 02](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap4/13-Estimation-MLE.ipynb) that <font color="dodgerblue">**data frames**</font> are two-dimensional data objects and are the fundamental data structure used by most of R's libraries of functions and data sets. Tabular data is tidy if:

- Each row corresponds to a different observation and
- Each column corresponds to a different variable stored as a vector of possibly different data types.
- Each column vector must be a single data type.



# <a name="create-df">Creating Data Frames from Scratch</a>

---

Data frames are created by passing vectors into the `data.frame()` function.

- First we assign each variable (column) to a separate vector.
- Then we assign the variables to the data frame using the `data.frame()` function.

Consider the following example:

In [None]:
# define the variables as separate vectors
d <- c(2L, 4L, 6L, 8L)  # vector of integers (note 2L is the integer 2)
e <- c(2, 2.1, 2.2, 2.3)  # vector of decimals (note 2 is a decimal)
f <- c("red", "white", "blue", NA)  # vector of characters with one missing value
g <- c(TRUE, TRUE, TRUE, FALSE)  # vector of logicals

# create a data frame named df with 4 columns
df <- data.frame(d, e, f, g)

# print the date frame df to screen
df

## <a name="name-col">Naming Column Headers</a>

---

The columns of a data frame can be renamed using the `names()` function on the data frame.

In [None]:
# name columns of data frame
names(df) <- c("ID", "Measure", "Color", "Passed")
df

The columns of a data frame can be named when you are first creating the
data frame by using `[new_name] = [orig_vec_name]` for each vector of
data.

In [None]:
# create data frame with better column names
df2 <- data.frame(ID = d, Measure = e, Color = f, Passed = g)
df2

## <a name="check-structure">Checking Data Structure</a>
---

-   The `is.matrix(x)` function tests whether or not an object `x` is a matrix.
-   The `is.vector(x)` function test whether `x` is a vector.
-   The `is.data.frame(x)` function test whether `x` is a data frame.

In [None]:
is.matrix(df)
is.vector(df)
is.data.frame(df)

# <a name="extract">Extracting and Slicing Data Frames</a>
---



## <a name="extract-name">Extracting a Column By Name</a>
---

The column vectors of a data frame may be extracted using `$` and
specifying the name of the desired vector.

-   `df$Color` would access the `Color` column of data frame `df`.

In [None]:
df$Color  # prints column of data frame df named Color

## <a name="indexing">Slicing Rows and Columns By Indexing</a>
---

Part of a data frame can also be extracted by thinking of at as a general matrix and specifying the desired rows or columns in square brackets `[ , ]` after the object name.

- As with matrices, we first indicate the row indices we want to slice inside the square brackets, followed by a comma, and then we indicate the column indices.

- For a continuous range of rows or columns, we use a semicolon.

- For a non-continuous range of rows or columns, we enter the indices as a vector using the syntax `c(index1, index2, ...)`.

-  <font color="dodgerblue">**Note R starts with index 1 which is different from Python which indexes starting from 0.**</font>

For example, if we had a data frame named `df`:

-   `df[6, ]` would slice row 6 of `df` and include all columns.
-   `df[3:8, ]` would slice all rows 3 thru 8 of `df` and include all columns.
- `df[c(1, 5, 9), 4]` slice rows 1, 5 and 9 of `df` and keep only column 4 of those rows.
-   `df[, c(1, 8)]` would keep all rows and slice only columns 1 and 8 of `df`.
-   `df[1:4, 2:6]` would slice rows 1 thru 4 of columns 2 thru 6 of `df`.

In [None]:
df

In [None]:
head(df)  # prints row indices to screen

## <a name="quest1">Question 1</a>

---

Let data frame `df` be the data frame defined above. What would be the output of the following commands? Explain what the output would be in the text cell, then run the code cells below to check your work.

<br>  

a.  `df[2, ]`

<br>  

b.  `df[, 2]`

<br>  

c.  `df[1:2, c(1,3)]`

<br>  

### <a name="sol1">Solution to Question 1</a>

---

Explain what the output would be in the text cell, then run the code cells below to check your work.

<br>  

a.  

<br>  

b.  

<br>  

c.  

<br>  
<br>  

In [None]:
df[2, ]

In [None]:
df[, 2]

In [None]:
df[1:2, c(1,3)]

### <a name="mult-names">Extracting Multiple Columns by Name</a>

----

If you need to select multiple columns of a data frame by name, you can pass a character vector with column names in the column position of `[]`.

-   `df[, c("ID", "Passed")]` would extract the `ID` and `Passed` columns of `df`.

In [None]:
df[, c("ID", "Color", "Passed")]

In [None]:
df[, c(1, 3, 4)]  # another way to pick columns 1, 3 and 4

## <a name="exclude">Excluding Rows and/or Columns</a>

---

We can exclude rows or columns from a data frame using a minus sign `-`.

For example, if we had a data frame named `df`:

-   `df[-6, ]` would slice all rows and columns from `df` *except for row 6*.
-   `df[, -c(3, 6, 11)]` would extract all rows and all columns from `df` *except for columns 3, 6, and 11*.
-   `df[-c(2:4), -c(4:7)]` would extract all rows *except rows 2 thru 4* and all columns *except columns 4 thru 7* of `df`.


In [None]:
# another we to pick columns 1, 3 and 4
df[, -2]  # exclude column 2

In [None]:
df[-c(1:2), ]  # exclued rows 1 thru 2

# <a name="Extract-vec">Extracting Parts of a Vector</a>

---

Similarly, subsets of the elements of a vector can be extracted by appending an index vector in square brackets `[]` to the name of the vector.

- With vectors, we only need to specify one dimension, the positional index, inside the square brackets.



## <a name="quest2">Question 2</a>

---

Consider the vector `a` created by code cell below.


In [None]:
# define a sequence 2, 4, ..., 16
a <- seq(2, 16, by = 2)
a

### <a name="quest2a">Question 2a</a>

---

Extract the 2nd, 4th, and 6th elements of the vector `a`.

### <a name="quest2b">Question 2b</a>

---

Extract elements in `a` except the 2nd, 4th, and 6th using the minus (`-`) sign in method.



### <a name="quest2c">Question 2c</a>

---

Extract elements in `a` except elements 3 through 6 using the minus (`-`) sign in method.



# <a name="importing">Importing an External File as a Data Frame</a>

---

The `read.table` function imports data from file into R as a data frame.

Usage: `read.table(file, header = TRUE, sep = ",")`

-   `file` is the file path and name of the file you want to import into
    R.
    -   If you don’t know the file path, set `file = file.choose()` will
        bring up a dialog box asking you to locate the file you want to
        import.
-   `header` specifies whether the data file has a header (variable
    labels for each column of data in the first row of the data file).
    -   If you don’t specify this option in R or use `header = FALSE`,
        then R will assume the file doesn’t have any headings.
    -   `header = TRUE` tells R to read in the data as a data frame with
        column names taken from the first row of the data file.
-   `sep` specifies the delimiter separating elements in the file.
    -   If each column of data in the file is separated by a space, then
        use `sep = " "`
    -   If each column of data in the file is separated by a comma, then
        use `sep = ","`
    -   If each column of data in the file is separated by a tab, then
        use `sep = "\t"`.

Here is an example reading a csv (comma separated file) with a header:

In [None]:
# import data as data frame
bike_store <- read.table(file = "https://raw.githubusercontent.com/CU-Denver-MathStats-OER/Statistical-Theory/main/Data/Transactions.csv",
                  header = TRUE,  # Keep column headers as names
                  sep = ",")  # comma as separator of columns

In [None]:
str(bike_store)

## <a name="read-csv">Loading .csv Files with `read.csv`</a>

---

If the data we are importing is stored in a comma separate file (.csv), then we can also use the function `read.csv()` to import the csv file into an R data frame.

In [None]:
# import data as data frame
bike_store <- read.csv(file = "https://raw.githubusercontent.com/CU-Denver-MathStats-OER/Statistical-Theory/main/Data/Transactions.csv")

In [None]:
summary(bike_store)

## <a name="clean-bike">Cleaning the Bike Store Data</a>

---

- `na.omit()` removes all observations that have an `NA` value for at least one of the variables.
- We convert categorical variables to factors using the `factor()` function.
- We extract date from columns 6 thru 12 and will focus our analysis on data from those columns.

In [None]:
# import data as data frame
bike_store <- na.omit(bike_store)

bike_store$order_status <- factor(bike_store$order_status)
bike_store$brand <- factor(bike_store$brand)
bike_store$product_line <- factor(bike_store$product_line)
bike_store$product_class <- factor(bike_store$product_class)
bike_store$product_size <- factor(bike_store$product_size)
bike_clean <- bike_store[, c(6:12)]

summary(bike_clean)

## <a name="quest3">Question 3</a>

---

Use indexing to slice rows 23 thru 28 and variables `brand`, `product_line` and `list_price` from the data frame `bike_clean` created in the previous code cell. Do not write the subsetted data over the data frame `bike_clean`. Simply print the extracted data to the screen and do not assign it to any object.

<br>

*Hint: Be sure you have run the previous code cells to load the original data frame `bike_store` and stored the cleaned data to `bike_clean` before solving this question.*

In [None]:
bike_clean[??, ??]

# <a name="logical">Logical Statements</a>

---

Sometimes we need to know if the elements of an object satisfy certain
conditions. This can be determined using the logical operators `<`,
`<=`, `>`, `>=`, `==`, `!=`.

-   `<` means strictly less than.
-   `<=` means less than or equal to.
-   `>` means strictly greater than.
-   `>=` means greater than or equal to
-   `==` means equal to.
-   `!=` means NOT equal to.




Execute the following commands in R and see what you get.

In [None]:
a <- seq(2, 16, by = 2) # creating the vector a
a
a > 10
a <= 4
a == 10
a != 10

## <a name="and-or">And and Or Statements</a>

---

More complicated logical statements can be made using `&` and `|`.

-   `&` means “and”
    -   Both statements must be true for `state1 & state2` to return
        `TRUE`.
-   `|` means “or”
    -   Only one of the the two statements must be true for
        `state1 | state2` to return `TRUE`.
    -   If both statements are true in an “or” statement, the statement is also `TRUE`.

Below is a summary of “and” and “or” logic:

-   `TRUE & TRUE` returns `TRUE`
-   `FALSE & TRUE` returns `FALSE`
-   `FALSE & FALSE` returns `FALSE`
-   `TRUE | TRUE` returns `TRUE`
-   `FALSE | TRUE` returns `TRUE`
-   `FALSE | FALSE` returns `FALSE`

In [None]:
# relationship between logicals & (and), | (or)
TRUE & TRUE
FALSE & TRUE
FALSE & FALSE
TRUE | TRUE
FALSE | TRUE
FALSE | FALSE

We can execute the following commands in R and check the output.

In [None]:
b <- 3  # b is equal to the number 3

# complex logical statements
(b > 6) & (b <= 10)  # FALSE and TRUE
(b <= 4) | (b >= 12)  # TRUE or FALSE

## <a name="logic-index">Logical Indexing</a>
---

We can use a logical statement as an index to extract certain entries from a vector or data frame. For example, if we want to to know the `order_status` (column 1), `brand` (column 2), `product_line` (column 3), and `list_price` (column 6) of all transactions that have a `list_price` greater than \$2,090, then:

-   We use a logical index for the row to extract just the rows that have a `list_price` value strictly greater than 2090.
-   We indicate we want to keep just columns 1 thru 3, and 6 with the column index `c(1:3, 6)`.
-   We store the results to a new data frame named `expensive`.
-   Finally, we print the first 6 rows of our new data frame with the `head()` function to check the results.

In [None]:
head(bike_clean)

In [None]:
expensive <- bike_clean[bike_clean$list_price > 2090, c(1:3, 6)]
head(expensive)

## <a name="quest4">Question 4</a>

---

Use logicals and indexing to create a new data frame from the `bike_clean` data frame that satisfies the given conditions.

<br>  



## <a name="quest4a">Question 4a</a>

---

Contains all observations in the `bike_clean` data frame with `product_line` equal to `Road`. Assign the extracted data to a new data frame named `road_sales`.

<br>  



In [None]:
road_sales <- bike_clean[??, ??]
head(road_sales)  # check first 6 rows

## <a name="quest4b">Question 4b</a>

---

Contains all observations in the `bike_clean` data frame with `list_price` that is stricly less than the average list price. Assign the extracted data to a new data frame named `below_ave`.

<br>  


In [None]:
below_ave <- bike_clean[??, ??]
head(below_ave)  # check first 6 rows

## <a name="quest4c">Question 4c</a>

---


Contains all observations in the `bike_clean` data frame with `list_price` that is stricly less than the average list price AND has a `product_line` equal to `Road`. Assign the extracted data to a new data frame named `both_conditions`.

<br>  


In [None]:
both_conditions <- bike_clean[??, ??]
head(both_conditions)  # check first 6 rows

# <a name="subset">Slicing Data with the `subset()` Function</a>

---

As the name implies, the `subset()` function in base R is a really useful function for subsetting! We can open the help documentation with `?subset` to learn how to apply this function. Below are some examples of different ways we may want to subset the `bike_clean` data frame.

In [None]:
# keeps all variables for observations with product_line equal to Road
road_ver1 <- subset(bike_clean,  # name of data frame
                    product_line == "Road")  # logical condition

# keeps only the list_price and product_line columns
road_ver2 <- subset(bike_clean,  # name of data frame
                    select = c(list_price, product_line),  # column(s) to select
                    product_line == "Road")  # logical condition

# stores object as a vector instead of a data frame
road_ver3  <- subset(bike_clean,  # name of data frame
                     select = list_price,  # column(s) to select
                     product_line == "Road",  # logical condition
                     drop = TRUE)  # store object as a vector not a data frame

In [None]:
# all variables of product_line equal to Road are selected
head(road_ver1)

In [None]:
# just list_price and product_line columns are selected
head(road_ver2)

In [None]:
# list_price for product_line equal to Road stored in a vector
head(road_ver3)

In [None]:
#ave_price <- mean(bike_clean$list_price)  # compute and store mean list price

road_below <- subset(bike_clean,  # data frame
                    select = c(list_price, product_line),  # name(s) of selected variable(s)
                    product_line == "Road" & list_price < ave_price)  # logical condition(s)

head(road_below)

## <a name="CC License">Creative Commons License Information</a>
---

![Creative Commons
License](https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png)

Materials created by the [Department of Mathematical and Statistical Sciences at the University of Colorado Denver](https://github.com/CU-Denver-MathStats-OER/)
and is licensed under a [Creative Commons
Attribution-NonCommercial-ShareAlike 4.0 International
License](http://creativecommons.org/licenses/by-nc-sa/4.0/).