# What Are Packages in R?

------------------------------------------------------------------------

R packages are a collection functions, sample data, and/or other code
scripts. R installs a set of default packages during installation. In
this case, we are working with R in a cloud using [Google
Colaboratory](https://colab.research.google.com/). The files, code, and
data associated to installed packages are saved in the cloud and not
locally on your computer. Many R packages have already been installed.

## Loading Packages with the `library()` Command

------------------------------------------------------------------------

Each time we start or restart a new session and want to access the
library of functions and data in the package, we need to load the
library of files in the package with the `library()` command.

To demonstrate how to create common statistical plots in R, we will use
the `storms` data set which is located in the package `dplyr`.

-   The `dplyr` package is already installed in Google Colaboratory
-   We still need to use a `library` command to load the package.
-   **Run the code cell below to load the `dplyr` package.**

In [None]:
# load the library of functions and data in dplyr
library(dplyr)

## Reloading Packages When Restarting a Session

------------------------------------------------------------------------

If we take a break in our work, it is possible our R session will time
out and close. <span style="color: tomato;">**Each time we restart an R
session, we will need to rerun `library()` commands in order reload any
packages we plan to use**</span>.

The same caution applies to any objects, vectors, or data frames we
create or edit in an R session. If a session times out, and we want to
use an object `x` that we previously created, we will need to run the
code cell(s) where object `x` is created again before we can refer back
to `x` in the current session.

**BE SURE YOU RUN THE COMMAND `library(dplyr)` BEFORE ATTEMPTING TO RUN
ANY OF THE CODE CELLS BELOW!**

## Help Documentation

------------------------------------------------------------------------

The plotting functions introduced in this document have robust help
documentation with lots of options to customize your plots. If you want
to view help documentation for any of the functions used in this
document, run commands such`?hist`, `?plot`, `?table`, and so on.

In [None]:
# access help documentation for storms
?storms  # side panel should open with help manual for storms

In [None]:
# access help documentation for typeof
?typeof

# Getting to Know Our Data

------------------------------------------------------------------------

The package `dplyr` contains a data set called `storms`. Let’s find some
useful information about this data.

-   The code cell below will provide a numeric summary of all variables
    in the `storms` data.
-   Recall we need to first run the command `library(dplyr)` in the code
    cell above to be able to access `storms`.

In [None]:
# get a numerical summary of all variables
summary(storms)

## Missing Data

------------------------------------------------------------------------

A <span style="color: dodgerblue;">**missing value**</span> occurs when
the value of something isn’t known. R uses the special object `NA` to
represent missing value. If you have a missing value, you should
represent that value as `NA`. Note: The character string `"NA"` is not
the same thing as `NA`.

-   The `storms` data has properly coded 14,382 missing values for
    `category` since storms that are not hurricanes do not have a
    category.
-   The `storms` data has properly coded 9,512 missing values for each
    of `tropicalstorm_force_diameter` and `hurricane_force_diameter`
    since these were recorded starting in 2004.

# Assignment to New (or Existing) Objects

------------------------------------------------------------------------

To store a data structure in the computer’s memory we must assign it a
name.

Data structures can be stored using the assignment operator `<-` or `=`.

Some comments:

-   In general, both `<-` and `=` *can* be used for assignment.
-   `<-` and `=` can be used identically most of the time, but not
    always.
-   <span style="color: dodgerblue;">**It’s safer and more conventional
    to use `<-` for assignment**</span>.

In the following code, we compute the mean of a vector. **Why can’t we
see the result after running it**?

In [None]:
w <- storms$wind  # wind is now stored in w
xbar.w <- mean(w)  # compute mean wind speed and assign to xbar.w

-   Once an object has been assigned a name, it can be printed by
    executing the name of the object.

In [None]:
xbar.w  # print the mean wind speed to screen

-   We can also print an object to screen using the `print()` function.

In [None]:
print(xbar.w)  # print the mean with print() command

-   We can calculate, assign, and print the result by putting
    parenthesis around the assignment.

In [None]:
# calculate, assign, and print standard deviation
(s <- sd(w))  # note ( ) around the entire command

-   **Sometimes you want to see the result of a code cell, and sometimes
    you will not.**

# Basic Data Types

------------------------------------------------------------------------

R has 6 basic [data
types](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Basic-types):

1.  **character**: collections of characters. E.g., `"a"`,
    `"hello world!"`
2.  **double**: decimal numbers. e.g., `1.2`, `1.0`
3.  **integer**: whole numbers. In R, you must add `L` to the end of a
    number to specify it as an integer. E.g., `1L` is an integer but `1`
    is a double.
4.  **logical**: Boolean values, `TRUE` and `FALSE`
5.  **complex**: complex numbers. E.g., `1+3i`
6.  **raw**: a type to hold raw bytes.

## Checking Data Type Using `typeof()`

------------------------------------------------------------------------

-   The `typeof()` function returns the R internal type or storage mode
    of any object.

In [None]:
typeof(1.0)
typeof(2)
typeof(3L)
typeof("hello")
typeof(TRUE)
typeof(storms$status)
typeof(storms$year)
typeof(storms$name)

## Investigating Data Types with `is.numeric()` and `str()`

------------------------------------------------------------------------

-   The `is.numeric(x)` function tests whether or not an object `x` is
    numeric.
-   The `is.character(x)` function test whether `x` is a character or
    not.
-   The `is.factor(x)` function test whether `x` is a factor or not.
-   <span style="color: dodgerblue;">**Note: Categorical data is
    typically stored as a `factor` in R.**</span>

In [None]:
is.numeric(storms$year)  # year is numeric
is.numeric(storms$category)  # category is also numeric
is.numeric(storms$name)  # name is not numeric
is.character(storms$name)  # name is character string

In [None]:
is.numeric(storms$status)  # status is not numeric
is.character(storms$status)  # status is not a character
is.factor(storms$status)  # status is a factor which is categorical

-   The `str(x)`: provides information about the levels or classes of
    `x`.

In [None]:
str(storms$status)

## Changing Data Types

------------------------------------------------------------------------

### Converting to Categorical Data with `factor()`

------------------------------------------------------------------------

-   Sometimes we think a variable is one data type, but it is actually
    being stored (and thus interpreted by R) as a different data type.
-   One common issue is categorical data is stored as characters. We
    would like observations with the same values to be group together.
-   The `status` variable in `storms` is being properly stored as a
    `factor`!
-   The `category` variable in `storms` is being stored as a `numeric`
    since it is ordinal.
-   With ordinal categories, we may choose to keep it stored as
    `numeric`, or we may prefer to treat them as factors.

In [None]:
summary(storms$category)

-   The summary of `category` computes statistics such as mean and
    median.
-   Typically with categorical data, we prefer to count how many
    observations are in each class of the variable.
-   In the code cell below, we convert `category` to a factor, and then
    observe the resulting summary.

In [None]:
storms$category <- factor(storms$category)
summary(storms$category)

### Converting Data Types with `as.numeric()`, `as.integer()`, etc.

------------------------------------------------------------------------

From the summary of the `storms` data set we first found above, we see
that the variables `year` and `month` are being stored as `double`.
These variables actually are integer values.

We can convert another variable of one format into another format using
`as.[new_datatype]()`

-   For example, to convert to year to `integer`, we use
    `as.integer(storms$year)`.
-   To convert a data type to character, we can use `as.character(x)`.
-   To convert to a decimal (`double`), we can use `as.numeric(x)`

In [None]:
typeof(storms$year)
typeof(storms$month)
storms$year <- as.integer(storms$year)
storms$month <- as.integer(storms$month)
typeof(storms$year)
typeof(storms$month)

# Data structures

------------------------------------------------------------------------

R operates on <span style="color: dodgerblue;">**data
structures**</span>. A data structure is simply some sort of “container”
that holds certain kinds of information

R has 5 basic data structures:

-   **vector**: One dimensional object of a single data type.
-   **matrix**: Two dimensional object of a single data type.
-   **array**: $n$ dimensional object of a single data type.
-   **data frame**: Two dimensional object where each column can be a
    different data type.
-   **list**: An object that contains elements of different types like
    (and possibly another list inside it).

[See R
documentation](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#List-objects)
for more info.

## Vectors

------------------------------------------------------------------------

A <span style="color: dodgerblue;">**vector**</span> is a
single-dimensional set of data of the same type.

### Creating Vectors from Scratch

------------------------------------------------------------------------

The most basic way to create a vector is the combine function `c`. The
following commands create vectors of type numeric, character, and
logical, respectively.

In [None]:
x1 <- c(1, 2, 5.3, 6, -2, 4)
x2 <- c("one", "two", "three")
x3 <- c(TRUE, TRUE, FALSE, TRUE)
x4 <- c(TRUE, 3.4, "hello")
typeof(x1)
typeof(x2)
typeof(x3)
typeof(x4)

-   We can check the data structure of an object using commands such as
    `is.vector()`, `is.list()`, `is.matrix()`, and so on.

In [None]:
is.list(x1)
is.vector(x1)
is.list(x4)
is.vector(x4)

## Data Frames

------------------------------------------------------------------------

<span style="color: dodgerblue;">**Data frames**</span> are
two-dimensional data objects and are the **fundamental** data structure
used by most of R’s libraries of functions and data sets.

-   Tabular data is <span style="color: dodgerblue;">**tidy**</span> if
    each row corresponds to a different observation and column
    corresponds to a different variable.

Each column of a data frame is a <span
style="color: dodgerblue;">**variable**</span> (stored as a **vector**).
If the variable:

-   Is measured or counted by a number, it is a <span
    style="color: dodgerblue;">**quantitative**</span> or <span
    style="color: dodgerblue;">**numerical**</span> variable.
-   Groups observations into different categories or rankings, it is a
    <span style="color: dodgerblue;">**qualitative**</span> or <span
    style="color: dodgerblue;">**categorical**</span> variable.

### Creating Data Frames from Scratch

------------------------------------------------------------------------

Data frames are created by passing vectors into the `data.frame()`
function.

The names of the columns in the data frame are the names of the vectors
you give the `data.frame` function.

Consider the following simple example.

In [None]:
# create basic data frame
d <- c(1, 2, 3, 4)
e <- c("red", "white", "blue", NA)
f <- c(TRUE, TRUE, TRUE, FALSE)
df <- data.frame(d,e,f)
df

### Naming Column Headers

------------------------------------------------------------------------

The columns of a data frame can be renamed using the `names()` function
on the data frame.

In [None]:
# name columns of data frame
names(df) <- c("ID", "Color", "Passed")
df

The columns of a data frame can be named when you are first creating the
data frame by using `[new_name] = [orig_vec_name]` for each vector of
data.

In [None]:
# create data frame with better column names
df2 <- data.frame(ID = d, Color = e, Passed = f)
df2

## Checking Data Structure

------------------------------------------------------------------------

-   The `is.matrix(x)` function tests whether or not an object `x` is a
    matrix.
-   The `is.vector(x)` function test whether `x` is a vector.
-   The `is.data.frame(x)` function test whether `x` is a data frame.

In [None]:
is.matrix(df)
is.vector(df)
is.data.frame(df)

# Extracting and Slicing Data Frames

------------------------------------------------------------------------

## Extracting a Column By Name

------------------------------------------------------------------------

The column vectors of a data frame may be extracted using `$` and
specifying the name of the desired vector.

-   `df$Color` would access the `Color` column of data frame `df`.

In [None]:
df$Color  # prints column of data frame df named Color

## Slicing Rows and Columns By Indexing

------------------------------------------------------------------------

Part of a data frame can also be extracted by thinking of at as a
general matrix and specifying the desired rows or columns in square
brackets after the object name.

-   <span style="color: dodgerblue;">**Note R starts with index 1 which
    is different from Python which indexes starting from 0.**</span>

For example, if we had a data frame named `df`:

-   `df[1,]` would access the first row of `df`.
-   `df[1:2,]` would access the first two rows of `df`.
-   `df[,2]` would access the second column of `df`.
-   `df[1:2, 2:3]` would access the information in rows 1 and 2 of
    columns 2 and 3 of `df`.

In [None]:
df[,2]  # second column is Color

In [None]:
df[2,]  # second row of df

In [None]:
df[1:2,2:3]  # first and second rows of columns 2 and 3

If you need to select multiple columns of a data frame by name, you can
pass a character vector with column names in the column position of
`[]`.

-   `df[, c("ID", "Passed")]` would extract the `ID` and `Passed`
    columns of `df`.

In [None]:
df[, c("Color", "Passed")]

In [None]:
df[, c(1, 3)]  # another we to pick columns 1 and 3

In [None]:
# another we to pick columns 1 and 3
df[, -2]  # exclude column 2

# Importing an External File as a Data Frame

------------------------------------------------------------------------

The `read.table` function imports data from file into R as a data frame.

Usage: `read.table(file, header = TRUE, sep = ",")`

-   `file` is the file path and name of the file you want to import into
    R.
    -   If you don’t know the file path, set `file = file.choose()` will
        bring up a dialog box asking you to locate the file you want to
        import.
-   `header` specifies whether the data file has a header (variable
    labels for each column of data in the first row of the data file).
    -   If you don’t specify this option in R or use `header = FALSE`,
        then R will assume the file doesn’t have any headings.
    -   `header = TRUE` tells R to read in the data as a data frame with
        column names taken from the first row of the data file.
-   `sep` specifies the delimiter separating elements in the file.
    -   If each column of data in the file is separated by a space, then
        use `sep = " "`
    -   If each column of data in the file is separated by a comma, then
        use `sep = ","`
    -   If each column of data in the file is separated by a tab, then
        use `sep = "\t"`.

Here is an example reading a csv (comma separated file) with a header:

In [None]:
# import data as data frame
bike.store <- read.table(file = "https://raw.githubusercontent.com/CU-Denver-MathStats-OER/Statistical-Theory/main/Data/Transactions.csv ",
                  header = TRUE,  # Keep column headers as names
                  sep = ",")  # comma as separator of columns

glimpse(bike.store)

-   The `glimpse()` function provides a nice summary of the structure.
-   Run the code cell below to see the various options of
    `read.table()`.
-   There are other functions and packages that may be better at reading
    in tabular data. `read.table()` is a good place to start!

In [None]:
?read.table

# Logical Statements

------------------------------------------------------------------------

## Basic comparisons

------------------------------------------------------------------------

Sometimes we need to know if the elements of an object satisfy certain
conditions. This can be determined using the logical operators `<`,
`<=`, `>`, `>=`, `==`, `!=`.

-   `==` means equal to.
-   `!=` means NOT equal to.

Execute the following commands in R and see what you get.

In [None]:
a <- seq(2, 16, by = 2) # creating the vector a
a
a > 10
a <= 4
a == 10
a != 10

## And and Or Statements

------------------------------------------------------------------------

More complicated logical statements can be made using `&` and `|`.

-   `&` means “and”
    -   Both statements must be true for `state1 & state2` to return
        `TRUE`.
-   `|` means “or”
    -   Only one of the the two statements must be true for
        `state1 | state2` to return `TRUE`.
    -   If both statements are true in an “or” statement, the statement
        is also `TRUE`.

Below is a summary of “and” and “or” logic:

-   `TRUE & TRUE` returns `TRUE`
-   `FALSE & TRUE` returns `FALSE`
-   `FALSE & FALSE` returns `FALSE`
-   `TRUE | TRUE` returns `TRUE`
-   `FALSE | TRUE` returns `TRUE`
-   `FALSE | FALSE` returns `FALSE`

In [None]:
# relationship between logicals & (and), | (or)
TRUE & TRUE
FALSE & TRUE
FALSE & FALSE
TRUE | TRUE
FALSE | TRUE
FALSE | FALSE

Execute the following commands in R and see what you get.

In [None]:
b <- 3  # b is equal to the number 3

# complex logical statements
(b > 6) & (b <= 10)  # FALSE and TRUE
(b <= 4) | (b >= 12)  # TRUE or FALSE

## Logical Indexing

------------------------------------------------------------------------

We can use a logical statement as an index to extract certain entries
from a vector or data frame. For example, if we want to to know the
`product_id` (column 2), `brand` (column 7), `product_line` (column 8),
and `list_price` (column 11) of all transactions that have a
`list_price` greater than \$2,090, we can run the code cell below.

-   We use a logical index for the row to extract just the rows that
    have a `list_price` value strictly greater than 2090.
-   We indicate we want to keep just columns 2, 7 through 8, and 11 with
    the column index `c(2, 7:8, 11)`.
-   We store the results to a new data frame named `expensive`.
-   Finally, we print the first 6 rows of our new data frame with the
    `head()` function to check the results.

In [None]:
expensive <- bike.store[bike.store$list_price > 2090, c(2, 7:8, 11)]
head(expensive)