Click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap1//03-EDA-Quantitative.ipynb) to open interactive version of the full text section.


For a shorter [in-class lab version of the section, part 1, click here.](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Labs/03-EDA-LabA-Quantitative.ipynb)

For a shorter [in-class lab version of the section, part 2, click here.](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Labs/03-EDA-LabB-Quantitative.ipynb)

# <a name="03intro">1.3: Exploring Quantitative Data</a>

---


Additional Reading:

-   See [Overview of Plotting Data in R](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Appendix/Overview-of-Plots.ipynb) for further reading and examples about plotting in R.
-   See [Fundamentals of Working with Data](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Appendix/Intro-to-Vectors-Dataframes.ipynb) for more information about data types and structures in R.
-   The [R Graph Gallery](https://r-graph-gallery.com/) has examples of many other types of graphs.


# <a name="03data">Getting to Know Our Data</a>

---

The `dplyr` package contains a data set from the [NOAA Hurricane Best
Track Data](https://www.nhc.noaa.gov/data/#hurdat) that contains data on
the following attributes of tracked North Atlantic storms since 1975:

-   Storm name: `name`
-   Date and time: `year`, `month`, `day`, and `hour`
-   Storm position: `lat` and `long`
-   Storm classification: `status`
-   Category of hurricane: `category` (non-hurricanes are `NA`)
-   Wind speed (in knots): `wind`
-   Pressure (in millibars): `pressure`
-   Tropical storm force diameter (in nautical miles): `tropicalstorm_force_diameter`
-   Hurricane force diameter (in nautical miles): `hurricane_force_diameter`

See [Exploring Categorical Data](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap1/02-EDA-Categorical.ipynb) for a refresher on our initial exploration with the `storms` data frame.



## <a name="03load">Loading Required Package</a>

---

In order to access the `storms` data frame in the `dplyr` package, we first load the package with the `library()` function.

In [None]:
library(dplyr)  # load dplyr package

In [None]:
# enter your comments after each #
storms$year <- as.integer(storms$year)  #
storms$month <- as.integer(storms$month)  #
storms$hour <- as.integer(storms$hour)  #
storms$category <- factor(storms$category)  #

In [None]:
# view the resulting data structure
str(storms)

# <a name="03quartile">Quartiles</a>

---

-   The $25^{\mbox{th}}$ percentile is called the
     <font color="dodgerblue">**first quartile** </font> and is
    denoted $\color{dodgerblue}{\mathbf{Q_1}}$.
    - In R, use the function `quantile(x, probs=0.25)`.
-   The $75^{\mbox{th}}$ percentile is called the
     <font color="dodgerblue">**third quartile** </font> and is
    denoted $\color{dodgerblue}{\mathbf{Q_3}}$.
    - In R, use the function `quantile(x, probs = 0.75)`.
-   The  <font color="dodgerblue">**Interquartile Range
    (IQR)** </font>$\color{dodgerblue}{=Q_3-Q_1}$.
    - In R, use the function `IQR(x)`.
-   The  <font color="dodgerblue">**five number summary** </font>
    can also provide a good description of the spread of the values
    since we know <font color="dodgerblue">**25% of the values fall between each consecutive pair of values**</font>.
    $$\color{dodgerblue}{(\mbox{min}, Q_1 , \mbox{median}, Q_3, \mbox{max} )}$$
    -   In R, use the function `fivenum(x)` to compute the five number summary.



## <a name="03q14">Question 14</a>

---

Give the five number summary for the wind speed of all observations in the `storms` data set.

### <a name="03sol14">Solution to Question 14</a>

---

<br> <br> <br>
  


## <a name="03five-num">Five Number Summaries and Boxplots</a>

---

The five number summary for wind speeds is $(10, 30, 45, 65, 165)$.
Below is a  <font color="dodgerblue">**boxplot** </font> for this
data.

- 25% of the wind speeds are between 10 and 30 knots.
- 25% of the wind speeds are between 30 and 45 knots.
- 25% of the wind speeds are between 45 and 65 knots.
- 25% of the wind speeds are between 65 and 165 knots.


In [None]:
boxplot(storms$wind,  # data to plot
        main = "Wind Speeds of Storms",  # main title
        xlab = "Wind Speed (in knots)",  # x-axis label
        xaxt='n',  # turn off default ticks on x-axis
        cex.lab=1.75, cex.axis=1.75, cex.main=1.75,  # increase font size
        horizontal = TRUE)  # align horizontally
axis(1, at = fivenum(storms$wind))  # add tickmarks at five number summary

## <a name="03read-boxplot">How to Read and Create Boxplots</a>

---

To create a boxplot:

-   Find the values of $Q_1$, median, and $Q_3$.
-   Draw a box with edges at $Q_1$ and $Q_3$ and line inside the box for the median.
-   Identify the upper and lower fence to classify outliers:
    -   Upper fence $=Q_3 + 1.5(\mbox{IQR})$.
    -   Lower fence $=Q_1 - 1.5(\mbox{IQR})$.
-   Extend a line (whisker) from the lower edge of box to the smallest observation greater than the lower fence.
-   Extend a line (whisker) from the upper edge of the box to the largest
    value that is less than the upper fence.
-   The observations that are less than the lower fence or greater than
    the upper fence are considered
     <font color="dodgerblue">**outliers** </font>.
     -  Outlier values are marked with individual points.



## <a name="03q15">Question 15</a>

---

Compute the upper and lower fences for the wind speed observations in
`storms`.

### <a name="03sol15">Solution to Question 15</a>

---

<br> <br> <br>
  



# <a name="03ecdf">The Empirical Cumulative Distribution Function (ecdf)</a>

---

A question we often wish to explore is what proportion of values in our data are less or equal to a specified value $x$? To answer this question, we count the total number of observations in our data that are less than or equal to $x$, and then divide by the total number of observations in our data.



## <a name="03counting">Counting Observations with Logical Statements</a>

---

To illustrate how we can count observations that satisfy a given condition, consider the a vector of 5 values: $31$, $33$, $34$, $36$, and $38$. We store these values in the vector named `test.data` below. The command `test.data <= 35` applies a logical test to each of the 5 values in the vector:

> Is the value less than or equal to 35?

Run the code cell below and check the output to verify the test works as expected.

In [None]:
test.data <- c(31, 33, 34, 36, 38)  # vector of test data
test.data <= 35  # logical test

-   The result `TRUE` is counted as 1.
-   The result `FALSE` is counted as 0.
-   We can use the `sum()` function to count how many `TRUE` results we have.
-   Running the code cell below, we verify that 3 values in `test.data` are less than or equal to 35.

In [None]:
sum(test.data <= 35)  # sum the TRUE results

We can convert the count to a proportion by dividing by the total number of values in our data. Our vector `test.data` has a total of 5 observations; therefore, the proportion of values that are less than or equal to 35 is 3 out of 5 or $0.6$. We can use the `mean()` to count the number of `TRUE` results and divide by the total number of all observations in one command to simplify the code.

In [None]:
mean(test.data <= 35)  # total values <= 35 divided by total number of values

## <a name="03q16">Question 16</a>

---

What proportion of observations in `storms$wind` have a wind speed less
than or equal to 50 knots?

### <a name="03sol16">Solution to Question 16</a>

---

<br>

In [None]:
# what proportion of observations have wind less than or equal to 50


## <a name="03formula-ecdf">What is the Empirical Cumulative Distribution Function?</a>

---

The  <font color="dodgerblue">**empirical cumulative distribution function (ecdf)** </font> is typically denoted by the notation $\mathbf{\color{dodgerblue}{\widehat{F}(x)}}$. We read the notation $\hat{F}$ as **F hat**, and we will make use of the hat notation throughout the semester.

-   The input $x$ is a value.
-   The output $\widehat{F}(x)$ of the ecdf is the proportion of values in the sample that are less than or equal to $x$.

Recall the vector `test.data` contains the values $31$, $33$, $34$, $36$, and $38$. We can express the ecdf as a piecewise function.

$$
\widehat{F}(x) = \left\{
\begin{array}{ll}
0  & x < 31 \\
0.2 &  31 \leq x < 33 \\
0.4 &  33 \leq x < 34 \\
0.6 &  34 \leq x < 36 \\
0.8 &  36 \leq x < 38 \\
1 & x \geq 38
\end{array} \right.
$$



## <a name="03graph-ecdf">Graphing the Empirical Cumulative Distribution Function</a>

---

We can plot the ecdf using the `plot.ecdf()` function in R, and the resulting plot is a piecewise, step function.

In [None]:
plot.ecdf(test.data, col="steelblue",
          cex.lab=1.5, cex.axis=1.5, cex.main=1.5)  # increase font size)

## <a name="03q17">Question 17</a>

---

Complete the statements below to identify some key properties of ecdf’s.

### <a name="03sol17">Solution to Question 17</a>

---

-   The minimum output value of an ecdf is <mark>??</mark>.
-   The maximum value output value of an ecdf is <mark>??</mark>.
-   The ecdf is a <mark>??</mark> function since as $x$ increases, $\widehat{F}(x)$
    cannot decrease.

<br>

## <a name="03q18">Question 18</a>

---

Plot the empirical cumulative distribution function for the wind speeds
in the `storms` data set and check your answer to [Question
16](#question-16).

### <a name="03sol18">Solution to Question 18</a>

---

<br>

In [None]:
# plot the ecdf for wind speeds in storms

# <a name="03compare">Comparing Quantitative and Categorical Data</a>

---

We have explored some of the categorical variables in the `storms` data set in our work with [Exploring Categorical Data](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap1/02-EDA-Categorical.ipynb). We have discussed how we can summarize and plot a quantitative variable. Often in statistics we would like to compare the distribution of a quantitative variable for different classes of a categorical variable. For example, we may be interested in investigating the following:

> In which month do storms have the greatest wind speed?

We first check the data type of the month variable in `storms` using the `typeof()` function.

In [None]:
typeof(storms$month)  # check how months is stored

## <a name="03factor-month">Converting a Quantitative Variable to a Categorical Variable with `factor()`</a>

---

Months were initially stored as decimals. We converted `month` to an integer earlier, and we can see `month` is still stored as an integer. Let’s convert `month` to a `factor` so R will treat each month as a separate class.

In [None]:
storms$month <- factor(storms$month)  # convert month to a categorical variable
summary(storms$month)  # check summary output after converting to factor

## <a name="03side-by-side">Side by Side Boxplots with `plot()`</a>

---

The `plot()` function creates different types of plots depending on the data type and number of variables we enter.

-   If `x` is quantitative, `plot(x)` creates an index plot which is generally not too useful.
-   If `x` is categorical, `plot(x)` creates a bar chart.

In [None]:
par(mfrow = c(1,2))  # create a 1 by 2 array of plots
plot(storms$month)  # bar chart is created for categorical data
plot(storms$wind)  # index plot is created for quantitative data

-   If `x` is categorical and `y` is quantitative,
    `plot(y ~ x, data = [name])` creates side by side boxplots, one for
    each class of `x`.
-   If both `x` and `y` are quantitative variables,
    `plot(y ~ x, data = [name])` creates a scatterplot.

In [None]:
par(mfrow = c(1,2))  # create a 1 by 2 array of plots
plot(wind ~ month, data = storms)  # side by side boxplots
plot(wind ~ pressure, data = storms)  # scatterplot

The side by side boxplots created above are hard to read since we have
12 boxplots in total. The two months with the most storms data are
August and September.

> How can we compare storms only in August and September?

## <a name="03subset-filter">Subsetting and Filtering Data</a>

---

We can compare data for only August and September using various methods. One common method is to subset all of the data in `storms` into two separate data frames, one for each month. Below are three different ways we can subset data:

- Using the [`subset()`](#03subset) function in base R.
- Using the [`filter()`](#03filter) function in the `dplyr` package.
- Using [logical statements](#03logic).

Other methods exist as well.



### <a name="03subset">The `subset()` Function in Base R</a>

---

As the name implies, the `subset()` function in base R is a really useful function for subsetting! We can open the help documentation with `?subset` to learn how to apply this function. Below are some examples of different ways we may want to subset the `storms` data to analyze for storms that occurred in August.

In [None]:
# keeps all variables for storms in August
aug <- subset(storms, month == "8")

# keeps only the wind speed variable for August storms
aug.wind <- subset(storms, select = wind, month == "8")

# drop = T drops the column name and creates a vector instead of a data frame
aug.wind.vec <- subset(storms, select = wind, month == "8", drop = T)

In [None]:
# we can see all variables are selected
head(aug)

In [None]:
# just the wind variable is selected
head(aug.wind)

In [None]:
# wind speeds in august stored in a vector
head(aug.wind.vec)

## <a name="03">Question 19</a>

---

Compute the mean and median wind speed of storms in August. Compare the values of the mean and median. What does this tell us about the shape of the data?

### <a name="03sol19">Solution to Question 19</a>

---

<br> <br> <br>
  



### <a name="03filter">The `filter()` Function in `dplyr`</a>

---

Using the `filter` function in `dplyr` package, we can filter out just
the August observations.

-   Note you need to load the `dplyr` package with a `library()` in
    order to use `filter()`.
-   We have already loaded `dplyr` since that is where the `storms` data
    is found.
-   The command below gives the same result as
    `subset(storms, month == "8")`.

In [None]:
aug2 <- filter(storms, month == "8")  # filter requires dplyr package
head(aug2)  # selects all variables

### <a name="03logic">Using Logical Statements</a>

---

When writing more complex code such as for loops, it is often useful to subset data using logical statements. For example, `storms[storms$month == "8", ]` extracts just the rows that have a `month` value equal to 8.

In [None]:
# extract rows from storms with month equal to 8
aug.logic <- storms[storms$month == "8", ]
head(aug.logic)

## <a name="03q20">Question 20</a>

---

Using one of the methods above, create a data frame name `sept` that
contains all variables for only the observations that occurred in
September.

### <a name="03sol20">Solution to Question 20</a>

---

<br>

In [None]:
# keeps all variables for storms in September


## <a name="03side-boxplot">Creating Side by Side Boxplots with `boxplot`</a>

---

Once we have created the data frames `aug` and `sept`, we can create side by side boxplots to compare the wind speeds for storms in these two months.

In [None]:
# need to answer previous question first
boxplot(aug$wind, sept$wind,  # enter two vectors of data
        main = "Comparing Wind Speeds in Aug. and Sept.",   # main title
        xlab = "Wind Speed (in knots)",  # x-axis label
        horizontal = TRUE,  # align boxplots horizontally
        names = c("August", "September"),  # label each boxplot
        cex.lab=1.5, cex.axis=1.5, cex.main=1.5,  # increase font size
        col = c("seagreen", "steelblue"))  # fill color for box

## <a name="03q21">Question 21</a>

---

In which month (August or September) are the wind speeds of storms more
severe? What statistics did you use to draw your conclusion?

### <a name="03sol21">Solution to Question 21</a>

---

<br> <br> <br>  
  



## <a name="03q22">Question 22</a>

---

Create side by side boxplots to compare the distribution of wind speeds
in July, August and September.

### <a name="03sol22">Solution to Question 22</a>

---

<br> <br> <br>  
  



# <a name="CC License">Creative Commons License Information</a>
---


![Creative Commons
License](https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png)

*Statistical Methods: Exploring the Uncertain* by [Adam
Spiegler (University of Colorado Denver)](https://github.com/CU-Denver-MathStats-OER/Statistical-Theory)
is licensed under a [Creative Commons
Attribution-NonCommercial-ShareAlike 4.0 International
License](http://creativecommons.org/licenses/by-nc-sa/4.0/). This work is funded by an [Institutional OER Grant from the Colorado Department of Higher Education (CDHE)](https://cdhe.colorado.gov/educators/administration/institutional-groups/open-educational-resources-in-colorado).

For similar interactive OER materials in other courses funded by this project in the Department of Mathematical and Statistical Sciences at the University of Colorado Denver, visit <https://github.com/CU-Denver-MathStats-OER>.