# <a name="02overview">An Overview of Exploratory Data Analysis</a>
---

<font color="dodgerblue">**Exploratory data analysis**</font>, or EDA
for short, can be thought of as a cycle:

-   Generate questions about your data.
-   Search for answers by visualizing, transforming, and modeling your
    data.
-   Use what you learn to refine your questions and/or generate new
    questions.

The main goal of EDA is to develop an understanding of your data. When
you ask a question, the question focuses your attention on a specific
part of your dataset and helps you decide which graphs, models, or
transformations to make.



# <a name="02loading">Preliminaries: Loading Packages in R</a>
---

- The `dplyr` should already be installed in Google Colaboratory.

- Each time we start a new session and want to access the library of functions and data in the package, we need to load the library with the `library` command.

**Run the code cell below to load the libraries in the `dplyr` package.**


In [3]:
library(dplyr)

# <a name="02summary">Some Useful Functions for Summarizing Data</a>
---

Here are some commands for creating commonly used tables and graphics.

-   `help(package = "dplyr")` displays a glossary of all (most?)
    functions and data in the package dplyr.
-   `data()` will list all datasets currently loaded in your R session
    (across all packages).
-   `summary(df)` gives numerical summary of all variables in data frame
    with generic name `df`.
-   `glimpse(df)` gives a glimpse of the data frame with name `df`.
-   `head(df)` view first 6 rows in data frame.
-   `tail(df)` view last 6 rows in data frame.
-   `view(df)` to view the full data frame in a separate tab.
-   `table(x)` creates a frequency table for categorical variable `x`.
-   `table(x, y)` creates a contingency table for two categorical
    variables `x` and `y`
-   `prop.table(x)` creates a joint distribution table for **table** `x`
    relative to grand total.
    -   `prop.table(x, 1)` conditional distribution table so sum across
        each row of table `x` equals 1.
    -   `prop.table(x, 2)` conditional distribution table so sum across
        each column of table `x` equals 1.
-   `barplot(x)` creates a bar chart of data in table `x`.



## <a name="02ques1">Question 1</a>
---

The package `dplyr` contains many datasets, one of which is `storms`.
How many observations are in `storms`? How many variables? Which
variables are numerical and which are categorical? Enter R code in the
blank cell below and then type your answer in the space below.


In [None]:
# Enter R command(s) to answer the questions above:


### <a name="02sol1">Solution to Question 1</a>
---

<br> <br> <br> <br>



# Displaying Categorical Variables: Bar Plots

---

## Question 2

---

Let’s explore the following question:

> Which storm types occurred most frequently over the period from 1975
> to 2015?

### Question 2a

---

Create a frequency table to identify how many storms there are in each
`status`.

-   Recall you can enter `?plot`, `?barplot`, `?table`, etc for help doc
    for the command which often summarizes possible options.

#### Solution to Question 2a

---

  
  

### Question 2b

---

Create a bar graph to visually present the table.

-   Try (not required) adding a title, labels to axes, color, legend,
    etc if they are helpful. This is not required.
-   [See Overview of Plots Help
    Document](https://htmlpreview.github.io/?https://github.com/aspiegler/Statistical-Theory/blob/main/Overview-of-Plots.html)
    for examples with code of displaying different types of variables.

#### Solution to Question 2b

---

  
  

# Displaying Numerical Variables

---

The variable `status` in the dataset `storms` we consider a
**categorical variable**, so we can **count** how many or what
**proportion** of the observations fall into each classification.

-   There is not a natural notion for the **average value** of the
    categorical variable `status`.
-   The type of analysis and visualizations we can use depend on what
    type of data we have.
-   We will revisit categorical data shortly.
-   For now we turn our attention to summarizing and presenting
    **numerical variables**.

## Histograms with `hist()`

---

A <font color="dodgerblue">**histogram**</font> is special bar chart
we use to display the distribution of values for a numerical variable.

-   Values of the numerical variable are measured on the horizontal
    axis.
-   The height of each bar gives the total number of observations in the
    dataset (called the <font color="dodgerblue">**frequency**</font>)
    in the specified <font color="dodgerblue">**bin range**</font>.
-   There are no gaps between bars. Empty space means no values are in
    that bin range.
-   The R function `hist(x, [options])` creates a histogram.
-   There are lots of ways to customize options for your plots. Run
    `?hist` for more info.
-   Like using colors, [here’s a guide to colors in
    R](https://bookdown.org/hneth/ds4psy/D-3-apx-colors-basics.html).

In [None]:
hist(storms$wind, 
     breaks = 15,
     main = "Distribution of Windspeed from 1975-2020",
     xlab="Wind speed (in knots)",
     xlim = c(0, 160), 
     ylim = c(0,2500), 
     col = "steelblue")

## Question 3

---

How would you describe the shape of the distribution of wind speed shown
in the histogram above?

### Solution to Question 3

---

  
  
  

## Question 4

---

Create a histogram to display the variable `month`. What does the shape
of that graph tell you?

### Solution to Question 4

---

  
  
  

## Question 5

---

Create a histogram to display the variable `long`. What does the shape
of that graph tell you?

### Solution to Question 5

---

  
  
  

# The Shape of Data

---

In [None]:
par(mfrow = c(1, 3))  # Create a 1 x 3 array of plots

# The next 3 plots created will be arranged in one row

hist(storms$wind, xlab = "wind speed (in knots)",   # x-axis label
     ylab = "Frequency",  # y-axis label
#     main = "Distribution of Storm Wind Speed 1975-2020",  # main label
     col = "steelblue")  # change color of bars

hist(storms$month, 
     breaks = 12, 
     xlab="Month",
     xlim = c(1, 12), 
     ylim = c(0,4500), 
     col = "coral1",
#     main = "Distribution of Storms by Month",
     xaxt='n')
axis(1, at=seq(1, 12, 1), pos=0)

hist(storms$long, 
     breaks = 15, 
     xlab="Degrees of Longitude",
     xlim = c(-120, 0), 
     ylim = c(0,1000), 
     col = "aquamarine4",
#     main = "Distribution of Storms by Longitude",
     xaxt='n')
axis(1, at=seq(-120, 0, 10), pos=0)

In [None]:
par(mfrow = c(1, 1))  # reset so one plot per figure

-   The distribution of wind speeds is <span
    style="color: blue;">**skewed right**</font>.
-   The distribution of months is <font color="dodgerblue">**skewed
    left**</font>.
-   The distribution of longitude is approximately <span
    style="color: blue;">**symmetric**</font>.

# Measurements of Center

---

Typical measurements of center are:

-   The <font color="dodgerblue">**mean**</font> is the average.
    -   Use the command `mean(x)` .
    -   We use $\color{dodgerblue}{\mathbf{\bar{x}}}$ (pronounced x-bar) to
        denote a <font color="dodgerblue">**sample**</font> mean.
    -   We use $\color{dodgerblue}{\mathbf{\mu}}$ (Greek letter mu) to denote
        a <font color="dodgerblue">**population**</font> mean.
-   The <font color="dodgerblue">**median**</font> is the
    $50^{\mbox{th}}$ percentile. 50% of the values in the dataset are
    less than the median.
    -   Use the command `median(x)` .

## Question 6

---

Compute the mean wind speed of all storms and the median wind speed of
all storms. Interpret in practical terms what each tells us.

### Solution to Question 6

---

  
  
  

## Question 7

---

Why do you think the mean wind speed is greater than the median wind
speed of all storms?

### Solution to Question 7

---

  
  
  

## Relation of Shape to Measurements of Center

---

<figure>
<img
src="https://lh6.googleusercontent.com/ndXutxHp17jiMQ8ee8YI_wTfqKwaK94xnGYRqnw5W9ZADDPTyuQ7Wirv_4tIbKzmZmM=w2400"
width="600" alt="Symmetric Distributions" />
<figcaption aria-hidden="true">Symmetric Distributions</figcaption>
</figure>

<figure>
<img
src="https://lh5.googleusercontent.com/P72V9y4FyHTKPEHufXCE_jygITZnOvc5WDhAi9Dd05BZ1qQ0jSYISY7gQqsvacPZ6JU=w2400"
width="600" alt="Skewed Distributions" />
<figcaption aria-hidden="true">Skewed Distributions</figcaption>
</figure>

-   If the shape of the histogram is <span
    style="color: blue;">**symmetric**</font>, then the <span
    style="color: blue;">**mean is equal to the median**</font>.
-   If the shape of a histogram is <span style="color: red;">**skewed to
    the left**</font>, the <span style="color: red;">**mean is less than
    the median**</font>.
-   If the shape of a histogram is <span style="color: green;">**skewed
    to the right**</font>, the <span style="color: green;">**mean is
    greater than the median**</font>.

# Filtering and Subsetting Data

---

We have seen that the most frequent month is August, followed by July as
the second most frequent month. How can we compare the strength of
storms that occur in July to August?

Here are different methods for filtering out a subset of all
observations based on some additional condition(s).

## Using the `filter` in `dplyr`.

---

Using the `filter` function in `dplyr`, we can filter out just the July
observations

In [None]:
july <- filter(storms, month == "7")  # filter requires dplyr package

## Using `subset` in `base` R.

---

Using the `subset` function in base R, we can perform the same
operation:

In [None]:
# keeps all variables, same as filter above
july <- subset(storms, month == "7")

# keeps only wind speed variable for july saves as data frame
july.wind <- subset(storms, select = wind, month == "7")

# Option drop=TRUE drops header and treated as vector
july.wind.vec <- subset(storms, select = wind, month == "7", drop = T) 

## Using Logical Statements

---

Using **Logical Statements**.

-   `storms[storms$month == "7", ]` extracts just the rows that have a
    `month` value equal to 7.
-   `july.logic[ , c("wind")]` keeps just the wind speed column from
    `july.logic`.
-   We could do this in one step with
    `july.logic <- storms[storms$month == "7", c("wind")]`.
-   This method requires more proficiency, which is why the functions
    `filter` and `subset` are nice!

In [None]:
# pull of rows from storm that
july.logic <- storms[storms$month == "7", ]

july.logic.wind <- july.logic[ , c("wind")]

## Question 8

---

Compute the mean and median wind speed of all storms in July. Compare
the values of the mean and median. What does this tell us about the
shape of the data?

### Solution to Question 8

---

  
  
  

## Question 9

---

In which month are the storms more severe? What statistics did you use
to draw your conclusion?

### Solution to Question 9

---

  
  
  

# Measurements of Spread

---

Typical measurements of spread are:

-   The <font color="dodgerblue">**range**</font>
    $= \mbox{max} - \mbox{min}$.
-   The <font color="dodgerblue">**standard deviation**</font>
    approximately measures the average distance of each value from the
    mean value.
    -   For a sample,
        $\displaystyle s = \sqrt{\dfrac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}}$.
    -   The command `sd(var_name)` computes in R.
    -   We use $\color{dodgerblue}{\mathbf{s}}$ to denote a <span
        style="color: blue;">**sample**</font> standard deviation.
    -   We use $\color{tomato}{\mathbf{\sigma}}$ (Greek letter sigma) to
        denote a <span style="color: red;">**population**</font>
        standard deviation.

<figure>
<img
src="https://lh5.googleusercontent.com/xZmodsIk96IYr3fpnM3MDx_mChwucKeAY6ZmKu_M7UDVL4J6Y1z4340ugDU2TTd3st4=w2400"
width="600" alt="Comparing Standard Deviations of Distributions" />
<figcaption aria-hidden="true">Comparing Standard Deviations of
Distributions</figcaption>
</figure>

## Question 10

---

Which of the histograms (i)-(vi) has the largest range? The smallest
range?

### Solution to Question 10

---

  
  
  

## Question 11

---

Which of the histograms (i)-(vi) has the largest standard deviation? The
smallest standard deviation?

### Solution to Question 11

---

  
  
  

# Quantiles

---

-   The $25^{\mbox{th}}$ percentile <font color="dodgerblue">**first
    quartile**</font> is denoted $\color{dodgerblue}{\mathbf{Q_1}}$. Use
    `quantile(x, probs=0.25)`.
-   The $75^{\mbox{th}}$ percentile <font color="dodgerblue">**third
    quartile**</font> is denoted $\color{dodgerblue}{\mathbf{Q_3}}$. Use
    `quantile(x, probs = 0.75)`.
-   The <font color="dodgerblue">**Interquartile Range
    (IQR)**</font>$\color{dodgerblue}{=Q_3-Q_1}$. Use `IQR(x)`
-   The <font color="dodgerblue">**five number summary**</font> can
    also provide a good description of the spread of the values since we
    know 25% of the values in a dataset fall between each consecutive
    pair of values.
    $$\color{dodgerblue}{(\mbox{min}, Q_1 , \mbox{median}, Q_3, \mbox{max} )}$$
-   Use `summary(x)`to compute in R. Note `x` can be a vector or a data
    frame.

## Question 12

---

Give the five number summary for the wind speed of all storms in July.

### Solution to Question 12

---

  
  

# Boxplots and Five Number Summaries

---

The five number summary for August wind speeds is
$(10, 30, 45, 65, 150)$. Below is a <span
style="color: blue;">**boxplot**</font> for this data.

In [None]:
aug <- subset(storms, month == "8")
summary(aug$wind)
boxplot(aug$wind, 
        main = "August Wind Speeds", 
        xlab = "Wind Speed (in knots)",
        horizontal = TRUE)

## Question 13

---

Create a boxplot to illustrate the distribution of wind speeds of July
storms.

### Solution to Question 13

---

  
  

## Question 14

---

Create a side by side box plot to compare the distribution of wind
speeds between July and August.

### Solution to Question 14

---

  
  

## How to Read and Create Boxplots

---

To create a boxplot:

-   Find the values of $Q_1$, median, and $Q_3$.
-   Draw a box with bottom edges at $Q_1$ and $Q_2$ and line inside the
    box for the median.
-   Identify the upper and lower fence:
    -   Upper fence $=Q_3 + 1.5(\mbox{IQR})$.
    -   Lower fence $=Q_1 - 1.5(\mbox{IQR})$.
-   Extend whiskers from the lower edge to the smallest observation
    greater than the lower fence, and from the upper edge to the largest
    value that is less than the upper fence.
-   The observations that are less than the lower fence or greater than
    the upper fence are considered <span
    style="color: blue;">**outliers**</font>. These values are marked by
    individual points.

# Appendix: Assignment of Objects

---

To store a data structure in the computer’s memory we must assign it a
name.

Data structures can be stored using the assignment operator `<-` or `=`.

Some comments:

-   In general, both `<-` and `=` *can* be used for assignment.
-   `<-` and `=` can be used identically most of the time, but not
    always.
-   It’s safer and more conventional to use `<-` for assignment.
-   **Pressing the “Alt” and “-” keys simultaneously on a PC** or Linux
    machine **(Option and - on a Mac)** will **insert `<-` into the R**
    console and script files.

## Why Can’t I See the Output?

---

In the following code, we compute the mean of a vector. **Why can’t we
see the result after running it**?

In [None]:
w <- storms$wind  # wind is now stored in w
xbar.w <- mean(w)  # compute mean wind speed and assign to xbar.w

In the code cell above, the output has been stored in an object that we
can refer to later.

## Printing Output to Screen

---

Once an object has been assigned a name, it can be printed by executing
the name of the object or using the `print` function or just entering
the object name.

In [None]:
xbar.w  # print the mean wind speed to screen
print(xbar.w)  # print a different way

## Assigning and Printing An Object At Once

---

Another nice way to both execute, store, and print the output of a
command is the parentheses `( )` method.

In [None]:
(sd.w <- sd(w))  # using ( ) around a command will execute, store and print output

### Sometimes you want to see the result of a code cell, and sometimes you will not.