Click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap1//03-EDA-Quantitative.ipynb) to open interactive version of the full text section.


For a shorter [in-class lab version of the section, part 1, click here.](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Labs/03-EDA-LabA-Quantitative.ipynb)

For a shorter [in-class lab version of the section, part 2, click here.](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Labs/03-EDA-LabB-Quantitative.ipynb)

# <a name="03intro">1.3: Exploring Quantitative Data</a>

---


Additional Reading:

-   See [Overview of Plotting Data in R](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Appendix/Overview-of-Plots.ipynb) for further reading and examples about plotting in R.
-   See [Fundamentals of Working with Data](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Appendix/Intro-to-Vectors-Dataframes.ipynb) for more information about data types and structures in R.
-   The [R Graph Gallery](https://r-graph-gallery.com/) has examples of many other types of graphs.


# <a name="03variables">Types of Variables</a>

---

In statistics,  <font color="dodgerblue">**variables** </font> are the attributes measured or collected in data. We refer to them as variables since the values or classes of attributes typically vary from observation to observation. The term variable is used differently in statistics from the notion of a variable in algebra. There are two types of variables in statistics:

-   If a variable is measured or counted by a number, it is called a <font color="dodgerblue">**quantitative** </font> or <font color="dodgerblue">**numerical** </font> variable.
  -   Quantitative variables may be <font color="dodgerblue">**discrete (integers)**</font> or <font color="dodgerblue">**continuous (decimals)**</font>.

-   If a variable groups observations into different categories or rankings, it is a <font color="dodgerblue">**qualitative** </font> or <font color="dodgerblue">**categorical** </font> variable.
  -   The different categories of a qualitative variable are called <font color="dodgerblue">**levels** </font> or <font color="dodgerblue">**classes** </font>.

The type of statistical analysis we can do depends on whether:

-   We are investigating a single variable, or looking for an association between multiple variables.
-   The variable(s) are quantitative or categorical.
-   The data satisfies certain assumptions.

In our work with [Exploring Categorical
Data](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap1/02-EDA-Categorical.ipynb), we performed an initial summary of the categorical variables in the `storms` data set. Today, we will investigate how to numerically and visually summarize quantitative variables.



# <a name="03data">Getting to Know Our Data</a>

---

The `dplyr` package contains a data set from the [NOAA Hurricane Best
Track Data](https://www.nhc.noaa.gov/data/#hurdat) that contains data on
the following attributes of tracked North Atlantic storms since 1975:

-   Storm name: `name`
-   Date and time: `year`, `month`, `day`, and `hour`
-   Storm position: `lat` and `long`
-   Storm classification: `status`
-   Category of hurricane: `category` (non-hurricanes are `NA`)
-   Wind speed (in knots): `wind`
-   Pressure (in millibars): `pressure`
-   Tropical storm force diameter (in nautical miles): `tropicalstorm_force_diameter`
-   Hurricane force diameter (in nautical miles): `hurricane_force_diameter`

See [Exploring Categorical Data](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap1/02-EDA-Categorical.ipynb) for a refresher on our initial exploration with the `storms` data frame.



## <a name="03load">Loading Required Package</a>

---

In order to access the `storms` data frame in the `dplyr` package, we first load the package with the `library()` function.

In [None]:
library(dplyr)  # load dplyr package

## <a name="03help">Help Documentation for `storms`</a>

---

The `?` help operator and `help()` function provide access to the help manuals for R functions, data sets, and other objects. If at any point we want to learn more about data or a function used in this notebook, we can use the help operator. For example, `?typeof`, `?str`, `?hist`, and `?boxplot` will open a help tab with further details about each of function.

-   **Run the code cell below to access the help documentation for the `storms` data set.**

In [None]:
?storms  # open help tab

## <a name="03q1">Question 1</a>

---

List all the quantitative variables in `storms`. Which are being stored
as `integer`, and which are stored as `double` (decimals)?

- You can edit, run  and rerun the `typeof()` function in the first code cell below to help identify the data types of individual variables in `storms`.
- You can use the `str()` function in the second code cell to identify the data types of all variables at once.

In [None]:
typeof(storms$year)

In [None]:
str(storms)

### <a name="03sol1">Solution to Question 1</a>

---

<br> <br> <br>
  



## <a name="03q2">Question 2</a>

---

What wind speeds are classified as a Category 2 hurricane?

### <a name="03sol2">Solution to Question 2</a>

---

<br> <br> <br>
  
  



## <a name="03q3">Question 3</a>

---

What does the variable `tropicalstorm_force_diameter` measure? What does it mean if a storm observation has a 0 for `tropicalstorm_force_diameter`?

### <a name="03sol3">Solution to Question 3</a>

---

<br> <br> <br>
  



## <a name="03q4">Question 4</a>

---

Enter comments in the code cell below to help describe what each command performs. Then run the `str()` function after running the commands to see the updated data structure of `storms`.

### <a name="03sol4">Solution to Question 4</a>

---

In [None]:
# enter your comments after each #
storms$year <- as.integer(storms$year)  #
storms$month <- as.integer(storms$month)  #
storms$hour <- as.integer(storms$hour)  #
storms$category <- factor(storms$category)  #

In [None]:
# view the resulting data structure
str(storms)

# <a name="03plot">Plotting Quantitative Data</a>

---

Additional resources for help with plotting data:

-   See [Overview of Plotting Data in R](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Appendix/Overview-of-Plots.ipynb) for further reading and examples about plotting in R.
-   The [R Graph Gallery](https://r-graph-gallery.com/) has examples of many other types of graphs.



## <a name="03hist">Histograms</a>

---

A  <font color="dodgerblue">**histogram** </font> is special bar chart we use to display the distribution of values for a quantitative variable.

-   We first group the values into different ranges of values called <font color="dodgerblue">**bins** </font> of equal width.
    -   This essentially converts the quantitative variable to an ordinal categorical variable with with each bin representing a different level.
    -   Consider the quantitative variable `wind`. We can use bin ranges such as 0-10 knots, 10-20 knots, … , 160-170 knots.
      -   Each bin range should have the same width.
      -   The bins do not overlap.
      -   The ordering of the bins is very important.
-   Then we count how many values in the data are in each bin.
-   A histogram is a bar chart that represents the number of values that are in each bin range.
-   Values of the quantitative variable are measured on the horizontal axis.
-   The height of the bars over each bin range is the number of values (or frequency) in each bin range.
-   **By default, the counts are right closed.** For example, a wind value of 20 knots would be counted in the bin range 10-20 knots and not counted in the bin range 20-30 knots.
-   A histogram should not have an spaces between consecutive bars. Empty space means no values are in that bin range.
-    <font color="dodgerblue">**The R function `hist(x, [options])` creates a histogram.** </font>
-   Run `?hist` for more information about the available options for customizing a histogram, some of which are illustrated in the code cell below.

In [None]:
# create a histogram
hist(storms$wind,  # vector of values to plot
     breaks = 15,  # number of bin ranges to use
     xlab = "wind speed (in knots)",   # x-axis label
     xlim = c(0,200),  # sets window for x-axis
     ylab = "Frequency",  # y-axis label
     ylim = c(0,5000),  # sets window for y-axis
     main = "Distribution of Storm Wind Speed",  # main label
     cex.lab=1.5, cex.axis=1.5, cex.main=1.5,  # increase font size
     col = "steelblue")  # fill color of bars

## <a name="03q5">Question 5</a>

---

Based on the histogram above, approximately how many storms have a wind
speed less than or equal to 40 knots?

### <a name="03sol5">Solution to Question 5</a>

---

<br> <br> <br>
  
  
  



## <a name="03q5">Question 6</a>

---

The code cell below can help us check our answer.

1.  Explain what operation(s) the command in the code cell below.
    Running the code cell and compare the last 10 entries in the vector
    `le.40` and the vector `storms$wind` to help determine your answer.

2.  Then run and explain what the second code cell below does. *Hint: R
    reads the logical `TRUE` as the number 1 and `FALSE` as the number
    0.*

3.  How accurate was your previous answer in [Question 5](#03q5)?

### <a name="03sol6">Solution to Question 6</a>

---

1.  Enter comment in first code cell.

2.  Enter comment in second code cell.

3.  How accurate was your answer in [Question 5](#03q5)?

In [None]:
le.40 <- storms$wind <= 40  # ??

tail(storms$wind, 10)  # prints last 10 rows of wind speed vector
tail(le.40, 10)  # prints last 10 rows of logical vector le.40

In [None]:
# enter comment to interpret this command
sum(le.40)  # ??

### <a name="03number-bins">Changing the Number of Bins</a>

---

A histogram can illustrate the general shape of the distribution of
quantitative variable; however, the number of breaks we use can have a
substantial impact.

-   If we include too few bins, we do not get much detail, and we may
    even get a misleading picture.
-   If we include too many bins, the histogram may be difficult to read.
-   The fun of interacting with data in R is we can play around and
    adjust the number of breaks and other options until we are
    satisfied.

In [None]:
# plots appear in an array with 1 row and 2 columns
par(mfrow = c(1, 2))  # create an array of plots

# create a histogram
hist(storms$wind,  # vector of values to plot
     breaks = 5,  # number of bin ranges to use
     xlab = "wind speed (in knots)",   # x-axis label
     xlim = c(0,200),  # sets window for x-axis
     ylab = "Frequency",  # y-axis label
     ylim = c(0,15000),  # sets window for y-axis
     main = "breaks = 5",  # main label
     cex.lab=1.5, cex.axis=1.5, cex.main=1.5,  # increase font size
     col = "steelblue")  # fill color of bars

# create a histogram
hist(storms$wind,  # vector of values to plot
     breaks = 50,  # number of bin ranges to use
     xlab = "wind speed (in knots)",   # x-axis label
     xlim = c(0,200),  # sets window for x-axis
     ylab = "Frequency",  # y-axis label
     ylim = c(0,3000),  # sets window for y-axis
     main = "breaks = 50",  # main label
     cex.lab=1.5, cex.axis=1.5, cex.main=1.5,  # increase font size
     col = "seagreen")  # fill color of bars

## <a name="03q7">Question 7</a>

---

How would you describe the shape of the distribution of wind speed in the histograms above?

### <a name="03sol7">Solution to Question 7</a>

---

<br> <br> <br>
  
  



## <a name="03q8">Question 8</a>

---

Create a histogram to display the quantitative variable `month`. What does the shape of that graph tell you about the data?

### <a name="03sol8">Solution to Question 8</a>

---

<br> <br> <br>
  
  
  

## <a name="03q9">Question 9</a>

---

Create a histogram to display the quantitative variable `long`. What does the shape of that graph tell you about the data?

### <a name="03sol9">Solution to Question 9</a>

---

  
<br> <br> <br>

  



## <a name="03skewness">The Skewness of Data</a>

---

The  <font color="dodgerblue">**skewness** </font> of the data describes the direction of the tail of the data. The tail of the data indicates the direction of outliers (if any).

In [None]:
par(mfrow = c(1, 3))  # Create a 1 x 3 array of plots

hist(storms$wind,
     xlab = "wind speed (in knots)",   # x-axis label
     ylab = "Frequency",  # y-axis label
     main = "Distribution of Wind Speeds",  # main title
     cex.lab=1.5, cex.axis=1.5, cex.main=1.5,  # increase font size
     col = "steelblue")  # fill color of bars

hist(storms$month,
     breaks = 12,  # number of breaks
     xlab="Month",   # x-axis label
     ylab = "Frequency",  # y-axis label
     main = "Distribution of Months",  # main title
     cex.lab=1.5, cex.axis=1.5, cex.main=1.5,  # increase font size
     col = "coral1")  # fill color of bars

hist(storms$long,
     breaks = 15,  # number of breaks
     xlab="Degrees of Longitude",   # x-axis label
     ylab = "Frequency",  # y-axis label
     main = "Distribution of Longitude",  # main title
     cex.lab=1.5, cex.axis=1.5, cex.main=1.5,  # increase font size
     col = "aquamarine4")  # fill color of bars

-   The distribution of wind speeds is
     <font color="dodgerblue">**skewed right** </font>.
-   The distribution of months is
     <font color="dodgerblue">**skewed left** </font>.
-   The distribution of longitude is approximately
     <font color="dodgerblue">**symmetric** </font>.



# <a name="03center">Measurements of Center</a>

---

Typical measurements of center are:

-   The  <font color="dodgerblue">**mean** </font> is the average
    value.

$${\large \bar{x} = \frac{\mbox{sum of all values}}{\mbox{total number of values}} =  \sum_{i=1}^{n} \frac{x_n}{n}}. $$

-   We use $\color{dodgerblue}{\mathbf{\bar{x}}}$ (pronounced x-bar) to
    denote a  <font color="dodgerblue">**sample** </font> mean.
    -   We use $\color{mediumseagreen}{\mathbf{\mu}}$ (Greek letter mu)
    to denote a
     <font color="mediumseagreen">**population** </font> mean.
    -   In R, we use the function `mean()`.
-   The  <font color="dodgerblue">**median** </font> is the
    $50^{\mbox{th}}$ percentile. This means 50% of the values in the
    data set are less than the median.
    -   In R, we use the function `median()`.
    - If there are an odd number of values, the median is the middle value.
    -   If there are an even number of values, the median is the
    midpoint between the two middle values.



## <a name="03q10">Question 10</a>

---

Compute the mean and median wind speed of the `storms` data. Interpret
each value in practical terms. Be sure to include the units in your
interpretation.

*Hint: We can input the vector of wind speeds with the code
`storms$wind`.*

### <a name="03sol10">Solution to Question 10</a>

---

<br> <br> <br>

  
  



## <a name="03q11">Question 11</a>

---

Why do you think the mean wind speed is greater than the median wind
speed?

### <a name="03sol11">Solution to Question 11</a>

---

<br> <br> <br>
  
  



## <a name="03shape-center">Relation of Shape to Measurements of Center</a>

---

<figure>
<img
src="https://raw.githubusercontent.com/CU-Denver-MathStats-OER/Statistical-Theory/main/Images/03fig-skewness.png"
alt="Image Credit: Adam Spiegler, CC BY-SA 4.0." />
<figcaption aria-hidden="true">Image Credit: Adam Spiegler, <a
href="https://creativecommons.org/licenses/by-sa/4.0">CC BY-SA
4.0</a>.</figcaption>
</figure>

-   The mean is more sensitive to outliers than the median. The mean is
    pulled in the direction of the tail.
-   If the shape of the histogram is
     <font color="dodgerblue">**symmetric** </font>, then the
     <font color="dodgerblue">**mean is equal to the
    median** </font>.
-   If the shape of a histogram is  <font color="tomato">**skewed
    to the left** </font>, the  <font color="tomato">**mean is
    less than the median** </font>.
-   If the shape of a histogram is
     <font color="mediumseagreen">**skewed to the right** </font>,
    the  <font color="mediumseagreen">**mean is greater than the
    median** </font>.



# <a name="03spread">Measurements of Spread</a>

---


Typical measurements of spread are:

-   The  <font color="dodgerblue">**range = max - min** </font>.
    -   The advantage of the range is that it is easy to compute.
    -   However, the range ignores all values in the data other than the
    maximum and minimum values.
-   The  <font color="dodgerblue">**standard deviation** </font>
    approximately measures the average distance of all values from the
    mean value.
    -   For a sample,
$$\color{dodgerblue}{s = \sqrt{\dfrac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}}}.$$

    -   The standard deviation takes all values into account and thus
    involves many calculations. We typically use technology to help!
    -   The command `sd(var_name)` computes the sample standard
    deviation in R.
    -   We use $\color{dodgerblue}{\mathbf{s}}$ to denote a
     <font color="dodgerblue">**sample** </font> standard
    deviation.
    -   We use $\color{tomato}{\mathbf{\sigma}}$ (Greek letter sigma) to
    denote a  <font color="tomato">**population** </font>
    standard deviation.

<figure>
<img
src="https://raw.githubusercontent.com/CU-Denver-MathStats-OER/Statistical-Theory/main/Images/03fig-stdev.png"
alt="Image Credit: Adam Spiegler, CC BY-SA 4.0." />
<figcaption aria-hidden="true">Image Credit: Adam Spiegler, <a
href="https://creativecommons.org/licenses/by-sa/4.0">CC BY-SA
4.0</a>.</figcaption>
</figure>



## <a name="03q12">Question 12</a>

---

Which of the histograms (i)-(vi) has the largest range? The smallest
range?

### <a name="03sol12">Solution to Question 12</a>

---

<br> <br> <br>
  
  



## <a name="03q13">Question 13</a>

---

Which of the histograms (i)-(vi) has the largest standard deviation? The
smallest standard deviation?

### <a name="03sol13">Solution to Question 13</a>

---

<br> <br> <br>

  



# <a name="CC License">Creative Commons License Information</a>
---


![Creative Commons
License](https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png)

*Statistical Methods: Exploring the Uncertain* by [Adam
Spiegler (University of Colorado Denver)](https://github.com/CU-Denver-MathStats-OER/Statistical-Theory)
is licensed under a [Creative Commons
Attribution-NonCommercial-ShareAlike 4.0 International
License](http://creativecommons.org/licenses/by-nc-sa/4.0/). This work is funded by an [Institutional OER Grant from the Colorado Department of Higher Education (CDHE)](https://cdhe.colorado.gov/educators/administration/institutional-groups/open-educational-resources-in-colorado).

For similar interactive OER materials in other courses funded by this project in the Department of Mathematical and Statistical Sciences at the University of Colorado Denver, visit <https://github.com/CU-Denver-MathStats-OER>.