# Table of Contents:

------------------------------------------------------------------------

This file is intended to be a quick reference to give an overview of
some common aspects of working in R.

-   [Loading Packages and Getting Help in
    R](#loading-packages-and-getting-help-in-r)
-   [Basic Data Types](#basic-data-types)
-   [Checking Data Type Using
    `typeof()`](#checking-data-type-using-typeof) and
-   [Changing Data Types](#changing-data-types)
-   Different [Data structures](#data-structures): Vectors, Matrices,
    Data Frames, etc.
-   [Importing an External File as a Data
    Frame](#importing-an-external-file-as-a-data-frame)
-   [Logical statements](#logical-statements)

# Loading Packages and Getting Help in R

------------------------------------------------------------------------

To illustrate some of features and terminology in R, we will use the
`storms` dataset which is located in the package `dplyr`.

-   The `dplyr` package should already be installed (if not look for the
    link to install at the top of this window).
-   You still need to load the package to assess the libraries of
    functions and datasets in the packge.
-   Run the code cell below to load `dplyr`.

In [None]:
# load the library of function and data in dplyr
library(dplyr)

## Getting Help in R

------------------------------------------------------------------------

-   You can click on the `Help` menu from the menu bar
    `Markdown Quick Reference` that opens in the Help Pane.
-   You can enter `?` followed by a command, function, dataset to view
    help documentation that opens in the Help Pane.

In [None]:
?storms

In [None]:
?mean

In [None]:
summary(storms)

## Missing Data

------------------------------------------------------------------------

A <span style="color: blue;">**missing value**</span> occurs when the
value of something isn’t known. R uses the special object `NA` to
represent missing value.

If you have a missing value, you should represent that value as `NA`.
Note: `"NA"` is not the same thing as `NA`.

# Assignment to New (or Existing) Objects

------------------------------------------------------------------------

To store a data structure in the computer’s memory we must assign it a
name.

Data structures can be stored using the assignment operator `<-` or `=`.

Some comments:

-   In general, both `<-` and `=` *can* be used for assignment.
-   `<-` and `=` can be used identically most of the time, but not
    always.
-   It’s safer and more conventional to use `<-` for assignment.
-   **Pressing the “Alt” and “-” keys simultaneously on a PC** or Linux
    machine **(Option and - on a Mac)** will **insert `<-` into the R**
    console and script files.

In the following code, we compute the mean of a vector. **Why can’t we
see the result after running it**?

In [None]:
w <- storms$wind  # wind is now stored in w
xbar.w <- mean(w)  # compute mean windspeed and assign to xbar.w

-   Once an object has been assigned a name, it can be printed by
    executing the name of the object or using the `print` function or
    just entering the object name.

In [None]:
xbar.w  # print the mean wind speed to screen
print(xbar.w)  # printa different way

-   **Sometimes you want to see the result of a code cell, and sometimes
    you will not.**

# Basic Data Types

------------------------------------------------------------------------

R has 6 basic [data
types](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Basic-types):

1.  **character**: collections of characters. E.g., `"a"`,
    `"hello world!"`
2.  **double**: decimal numbers. e.g., `1.2`, `1.0`
3.  **integer**: whole numbers. In R, you must add `L` to the end of a
    number to specify it as an integer. E.g., `1L` is an integer but `1`
    is a double.
4.  **logical**: Boolean values, `TRUE` and `FALSE`
5.  **complex**: complex numbers. E.g., `1+3i`
6.  **raw**: a type to hold raw bytes.

## Checking Data Type Using `typeof()`

------------------------------------------------------------------------

-   The `typeof()` function returns the R internal type or storage mode
    of any object.

In [None]:
typeof(1.0)
typeof(2)
typeof(3L)
typeof("hello")
typeof(TRUE)
typeof(storms$status)
typeof(storms$year)
typeof(storms$category)

## Investigating Data Types with `is.numeric()` and `str()`

------------------------------------------------------------------------

-   The `is.numeric(x)` function tests whether or not an object `x` is
    numeric.
-   The `is.character(x)` function test whether `x` is a character or
    not.
-   The `str(x)`: provides information about the class of `x`.

In [None]:
is.numeric(storms$year)
is.numeric(storms$category)
is.numeric(storms$status)
is.character(storms$status)
str(storms$category)

## Changing Data Types

------------------------------------------------------------------------

### Converting to Categorical Data with `factor()`

------------------------------------------------------------------------

-   Sometimes we think a variable is one data type, but it is actually
    being stored (and thus interpreted by R) as a different data type.
-   One common issue is categorical data is stored as characters. We
    would like observations with the same values to be group together.
-   Categorical data should be stored as a `factor` in R.

In [None]:
storms$status <- factor(storms$status)
summary(storms$status)

### Converting Data Types with `as.numeric()`, `as.integer()`, etc.

------------------------------------------------------------------------

From the summary of the `storms` dataset above, we see that the variable
`year` is being stored as `double`. Same for `day` and `month`. All of
these variables are integers. We can convert another variable of one
format into another format using `as.[new_datatype]()`

-   For example, to convert to year to `integer`, we use
    `as.integer(storms$year)`.
-   To convert a data type to character, we can use `as.character(x)`.
-   To convert to a decimal (`double`), we can use `as.numeric(x)`

In [None]:
storms$year <- as.integer(storms$year)
summary(storms$year)
storms$day <- as.integer(storms$day)
storms$month <- as.integer(storms$month)

In [None]:
storms$month <- as.numeric(storms$month)
summary(storms$month)

# Data structures

------------------------------------------------------------------------

R operates on <span style="color: blue;">**data structures**</span>. A
data structure is simply some sort of “container” that holds certain
kinds of information

R has 5 basic data structures:

-   **vector**: One dimensional object of a single data type.
-   **matrix**: Two dimensional object of a single data type.
-   **array**: $n$ dimensional object of a single data type.
-   **data frame**: Two dimensional object where each column can be a
    different data type.
-   **list**: An object that contains elements of different types like
    (and possibly another list inside it).

[See R
documentation](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#List-objects)
for more info.

## Vectors

------------------------------------------------------------------------

A <span style="color: blue;">**vector**</span> is a single-dimensional
set of data of the same type.

### Creation

------------------------------------------------------------------------

The most basic way to create a vector is the combine function `c`. The
following commands create vectors of type numeric, character, and
logical, respectively.

In [None]:
x1 <- c(1, 2, 5.3, 6, -2, 4)
x2 <- c("one", "two", "three")
x3 <- c(TRUE, TRUE, FALSE, TRUE)
x4 <- c(TRUE, 3.4, "hello")
typeof(x1)
typeof(x2)
typeof(x3)
typeof(x4)

-   We can check the data structure of an object using commands such as
    `is.vector()`, `is.list()`, `is.matrix()`, and so on.

In [None]:
is.list(x1)
is.vector(x1)
is.list(x4)
is.vector(x4)

## Data Frames

------------------------------------------------------------------------

<span style="color: blue;">**Data frames**</span> are two-dimensional
data objects and are the **fundamental** data structure used by most of
R’s libraries of functions and datasets.

-   Tabular data is <span style="color: blue;">**tidy**</span> if each
    row corresponds to a different observation and column corresponds to
    a different variable.

Each column of a data frame is a
<span style="color: blue;">**variable**</span> (stored as a **vector**).
If the variable:

-   Is measured or counted by a number, it is a
    <span style="color: blue;">**quantitative**</span> or
    <span style="color: blue;">**numerical**</span> variable.
-   Groups observations into different categories or rankings, it is a
    <span style="color: blue;">**qualitative**</span> or
    <span style="color: blue;">**categorical**</span> variable.

### Creating Data Frames from Scratch

------------------------------------------------------------------------

Data frames are created by passing vectors into the `data.frame`
function.

The names of the columns in the data frame are the names of the vectors
you give the `data.frame` function.

Consider the following simple example.

In [None]:
# create basic data frame
d <- c(1, 2, 3, 4)
e <- c("red", "white", "blue", NA)
f <- c(TRUE, TRUE, TRUE, FALSE)
df <- data.frame(d,e,f)
df

The columns of a data frame can be renamed using the `names` function on
the data frame.

In [None]:
# name columns of data frame
names(df) <- c("ID", "Color", "Passed")
df

The columns of a data frame can be named when you are first creating the
data frame by using `name =` for each vector of data.

In [None]:
# create data frame with better column names
df2 <- data.frame(ID = d, Color = e, Passed = f)
df2
df[1,]

## Checking Data Structure with `is.data.frame()`

------------------------------------------------------------------------

In [None]:
is.matrix(df)
is.vector(df)
is.data.frame(df)

## Extracting and Slicing Data Frames

------------------------------------------------------------------------

The column vectors of a data frame may be extracted using `$` and
specifying the name of the desired vector.

-   `df$Color` would access the `Color` column of data frame `df`.

Part of a data frame can also be extracted by thinking of at as a
general matrix and specifying the desired rows or columns in square
brackets after the object name.

-   <span style="color: blue;">**Note R starts with index 1 which is
    different from Python which indexes starting from 0.**</span>

For example, if we had a data frame named `df`:

-   `df[1,]` would access the first row of `df`.
-   `df[1:2,]` would access the first two rows of `df`.
-   `df[,2]` would access the second column of `df`.
-   `df[1:2, 2:3]` would access the information in rows 1 and 2 of
    columns 2 and 3 of `df`.

If you need to select multiple columns of a data frame by name, you can
pass a character vector with column names in the column position of
`[]`.

-   `df[, c("Color", "Passed")]` would extract the `Color` and `Passed`
    columns of df.

# Importing an External File as a Data Frame

------------------------------------------------------------------------

The `read.table` function imports data from file into R as a data frame.

Usage: `read.table(file, header = TRUE, sep = ",")`

-   `file` is the file path and name of the file you want to import into
    R.
    -   If you don’t know the file path, set `file = file.choose()` will
        bring up a dialog box asking you to locate the file you want to
        import.
-   `header` specifies whether the data file has a header (variable
    labels for each column of data in the first row of the data file).
    -   If you don’t specify this option in R or use `header = FALSE`,
        then R will assume the file doesn’t have any headings.
    -   `header = TRUE` tells R to read in the data as a data frame with
        column names taken from the first row of the data file.
-   `sep` specifies the delimiter separating elements in the file.
    -   If each column of data in the file is separated by a space, then
        use `sep = " "`
    -   If each column of data in the file is separated by a comma, then
        use `sep = ","`
    -   If each column of data in the file is separated by a tab, then
        use `sep = "\t"`.

Here is an example reading a csv (comma separated file) with a header:

In [None]:
# import data as data frame
dtf <- read.table(file = "https://raw.githubusercontent.com/jfrench/DataWrangleViz/master/data/covid_dec4.csv",
                  header = TRUE,
                  sep = ",")
str(dtf)

Note that the `read_table` function in the **readr** package and the
`fread` function in the **data.table** package are perhaps better ways
of reading in tabular data and use similar syntax.

# Logical statements

------------------------------------------------------------------------

## Basic comparisons

------------------------------------------------------------------------

Sometimes we need to know if the elements of an object satisfy certain
conditions. This can be determined using the logical operators `<`,
`<=`, `>`, `>=`, `==`, `!=`.

-   `==` means equal to.
-   `!=` means NOT equal to.

Execute the following commands in R and see what you get.

In [None]:
a <- seq(2, 16, by = 2) # creating the vector a
a
a > 10
a <= 4
a == 10
a != 10

## And and Or statements

------------------------------------------------------------------------

More complicated logical statements can be made using `&` and `|`.

-   `&` means “and”
    -   Only `TRUE & TRUE` returns `TRUE`. Otherwise the `&` operator
        returns `FALSE`.
-   `|` means “or”
    -   Only a single value in an `|` statement needs to be true for
        `TRUE` to be returned.

Note that:

-   `TRUE & TRUE` returns `TRUE`
-   `FALSE & TRUE` returns `FALSE`
-   `FALSE & FALSE` returns `FALSE`
-   `TRUE | TRUE` returns `TRUE`
-   `FALSE | TRUE` returns `TRUE`
-   `FALSE | FALSE` returns `FALSE`

In [None]:
# relationship between logicals & (and), | (or)
TRUE & TRUE
FALSE & TRUE
FALSE & FALSE
TRUE | TRUE
FALSE | TRUE
FALSE | FALSE

Execute the following commands in R and see what you get.

In [None]:
# complex logical statements
(a > 6) & (a <= 10)
(a <= 4) | (a >= 12)