# DS102 Statistical Programming in R : Lesson Five - Statistical Plots

### Table of Contents <a class="anchor" id="DS105L5_toc"></a>

* [Table of Contents](#DS105L5_toc)
    * [Page 1 - Introduction](#DS105L5_page_1)
    * [Page 2 - Installing Packages](#DS105L5_page_2)
    * [Page 3 - Histograms](#DS105L5_page_3)
    * [Page 4 - Box Plots](#DS105L5_page_4)
    * [Page 5 - Normal Probability Plots](#DS105L5_page_5)
    * [Page 6 - Key Terms](#DS105L5_page_6)
    * [Page 7 - Hands-On](#DS105L5_page_7)
    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Overview of this Module<a class="anchor" id="DS105L5_page_1"></a>

[Back to Top](#DS105L5_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [1]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction to R
VimeoVideo('331822031', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO102L05overview.zip)**.

# Introduction

R has very strong data visualization capabilities. Unfortunately, creating plots in R can be a challenging process. To the newcomer, there often appear to be mysterious incantations needed to get the plot to appear at all, much less in a form that is desired.

In this lesson, you will be introduced the plotting package ```ggplot2```. By the end of this lesson, you should be able to create: 

* Histograms
* Box plots
* Normal probability plots

And be able to make judgments about the normality and outliers of your data.

This lesson will culminate in a hands on in which you will create all three of these charts for a dataset about river length. 


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - Installing Packages<a class="anchor" id="DS105L5_page_2"></a>

[Back to Top](#DS105L5_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [2]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction to R
VimeoVideo('327336804', width=720, height=480)

The transcript for the above topic tutorial video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO102-L05-pg2tutorial.zip)**.

# What is ggplot2? 

R has many different systems to create plots of data; the two most popular ones are ```R Base Graphics``` and a system called ```ggplot2```. ```R Base Graphics``` is built into R; you do not need to do anything to make these capabilities available. ```ggplot2``` is a package that you must install. ```ggplot2``` is part of a collection of R packages called the ```tidyverse```. You can find more information about the ```tidyverse``` **[here](https://www.tidyverse.org/)**.

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>You know what would be helpful? <a href = "https://vimeo.com/430057601"><b>This!</b></a> Discover the <i>Anatomy of ggplot</i>. The information will give you the "need to know basics" of running and using ggplot. You can refer back to this video at anytime throughout the course.</p>
    </div>
</div>


Although you may find it hard to believe when you start working with ```ggplot2``` and other ```tidyverse``` packages, they are somewhat simpler to use, and their structure is far more consistent, than the packages with corresponding functionality in the base R language. Even more importantly, ```ggplot2``` allows for better customization.  

As you go through this lesson and later lessons, you will learn about what ```ggplot2``` is doing and how it works. But expect a fairly steep learning curve; it may take time before you reach the point where you can bend ```ggplot2``` to your will consistently.

---

## Installing ggplot2

The first item of business is to install the ```ggplot2``` package. Fortunately, RStudio makes this easy. You must have a connection to the Internet to do this installation, since it involves downloading packages.

In RStdio, click on the ```Packages``` tab. If you have not rearranged your panes in RStudio, it should be a tab in the lower right window. You should get a display that looks something like this:

![An R studio window showing the packages pane with a variety of system packages displayed.](Media/L05-InstallTab.png)

Click on the ```Install``` button. This will bring up a dialog box that looks like this:

![A R Studio window for installing packages. At the top of the window is a help button for configuring repositories. There are three fields, install from, packages, and install to library. There is an option to install dependencies if desired. At the bottom of the window are two buttons, install and cancel.](Media/L05-InstallTab1.png)

Type ```ggplot2``` into the "Packages" field, then click on the ```Install``` button. Your Console pane will go crazy for a moment as it installs ```ggplot2``` and the different packages needed to support ```ggplot2```. It may take a few minutes.

Another way to install packages is to use the function ```install.packages()``` and place the name of the package you are trying to install in the parentheses. For instance, to install ```ggplot2``` without the use of the RStudio controls, you would do the following: 

```{r}
install.packages("ggplot2")
```

The end result is the same, however, no matter which method you employ.

The last thing to do to make ```ggplot2``` available for use is to type the following command in the Console pane:

```{r}
library("ggplot2")
```

If your Packages tab is still visible, you should be able to find ```ggplot2``` in the list of packages, and the box by it should be checked. (You could have just checked this box rather than typing the library command.)

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>It is best practice to put double quotes around anything in the library parentheses, since that option will always work, but single quotes or no quotes at all only work for some packages. You won't have to fiddle with things if you always use double quotes!</p>
    </div>
</div>

---

## Installing Other Packages

You'll need many other packages throughout this course, and the best thing to do is download all of them now! This way, you won't end up with any weird versioning issues. Please make sure you run this code in R:

```{r}
install.packages("ggplot2")
install.packages("datasets")
install.packages("readxl") 
install.packages("dplyr") 
install.packages("PerformanceAnalytics")
install.packages("corrplot") 
install.packages("gapminder")
install.packages("gridextra")
install.packages("Ecdat")
install.packages("corpcor")
install.packages("GPArotation")
install.packages("psych")
install.packages("IDPmisc")
install.packages("lattice") 
install.packages("treetop")
install.packages("scales")
install.packages("rcompanion")
install.packages("gmodels")
install.packages("car")
install.packages("caret")
install.packages("gvlma")
install.packages("predictmeans")
install.packages("caret")
install.packages("magrittr")
install.packages("tidyr")
install.packages("lmtest")
install.packages("popbio")
install.packages("e1071")
install.packages("data.table")
install.packages("effects")
install.packages("multcomp")
install.packages("mvnormtest")
```

And then you should be all set, and hopefully save your future self some heartache. 

---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

In [1]:
try:
    from DS_Students import MultipleChoice
    from ipynb.fs.full.DS102Questions import *
except:
    !pip install DS_Students
    from DS_Students import MultipleChoice
    from ipynb.fs.full.DS102Questions import *

In [2]:
try:
    display(L5P2Q1, L5P2Q2)
except:
    pass

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '1. Which of the following is NOT an advantage of us…

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '2. What function makes a library usable?\n', 'outpu…

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - Histogram<a class="anchor" id="DS105L5_page_3"></a>

[Back to Top](#DS105L5_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [1]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction to R
VimeoVideo('327336769', width=720, height=480)

The transcript for the above topic tutorial video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO102-L05-pg3tutorial.zip)**.

# Histograms

As you have learned previously, a histogram provides a visual representation of the distribution of values in a sample. It would be silly to graph each value by itself, so instead it gets broken into bins.  A bin takes the range of data and divides it into equal segments.  Then so many data points fall into each range of numbers, or bins. Typically, the number of values that fall into each bin are counted, and this count is used to create a bar graph in which the height of each bar is proportional to the number of values that fall into that bin.

Suppose you want to make a histogram of the height data you used previously.  You will first create the vector of heights and assign it to the variable ```height```:

```{r}
height <- c(171, 192, 183, 177, 154, 176)
```

You then create a data frame from ```height```:

```{r}
height_df <- data.frame(height)
```

You create a histogram with these commands:

```{r}
h <- ggplot(height_df, aes(x = height))
h + geom_histogram()
```

The ```aes()``` function is where you specify your variables.  You want the variable ```height```, from the data frame ```height_df```. These commands create a histogram that looks like this:

![A histogram. The x axis is h and runs from one hundred fifty to one hundred ninety five, in increments of ten. The y axis is count and runs from zero point zero zero to one point zero zero in increments of zero point two five. Five vertical bars are shaded on the histogram.](Media/L05-Histogram01.png)

---

## Adding in Bins

This also creates a warning: "stat_bin() using bins = 30. Pick better value with binwidth." This histogram has a vertical bar for each height value. Visually, this is not very satisfying. You can make a much more informative histogram by setting the width of each bin to be 10. You accomplish this with the ```bindwidth=``` attribute: 

```{r}
h + geom_histogram(binwidth = 10)
```

![The x axis is h and runs from one hundred forty to two hundred, in increments of ten. The y axis is count and runs from zero to three in increments of one. Four vertical bars are shaded on the histogram.](Media/L05-Histogram02.png)

As you can see, the histogram is just a special kind of bar chart. The horizontal variable is the height in centimeters. Each bar in the chart is associated with a lower value and an upper value on the horizontal axis.

For example, the bar of height three in the middle of the chart has a lower value of 175 and an upper value of 185. The height of each bar is the number of data values that fall between its lower and upper values.

If you look at the height data, there are three values that fall between 175 and 185. These lower and upper values define the bin for each bar; the height of the bar represents the number of data values that fall into the bin.

There is no bar (or you can think of it as a bar that has zero height) corresponding to the bin for values between 155 and 165; that is because there are no heights that are between 155 and 165.

The vertical axis is labeled ```count``` to indicate that the vertical height of the bars is the number of values that fall into the corresponding bins.

---

## Adding a Title and Labels

You can improve this histogram by giving it a title of "Histogram of Heights" and adding a label to the horizontal axis of "Height (in cm)". You can do this with the following command:

```{r}
h + geom_histogram(binwidth = 10) +
ggtitle("Histogram of Heights") +
xlab("Height (in cm)")
```

This results in the histogram below:

![A histogram titled histogram of heights. The x axis is height in centimeters and runs from one hundred forty to two hundred, in increments of ten. The y axis is count and runs from zero to three in increments of one. Four vertical bars are shaded on the histogram.](Media/L05-Histogram03.png)

---

## Relative Frequencies

Sometimes, when plotting a histogram, you want the height of the bars to be related to the *relative frequency*, or the fraction of the total number of values that fall into each bin. This requires a rather complicated command:

```{r}
h + geom_histogram(binwidth = 10, aes(y = ..count../sum(..count..))) +
ggtitle("Histogram of Heights") + xlab("Height (in cm)") +
ylab("Relative frequency")
```

Adding ```aes(y = ..count../sum(..count..)``` as an argument to ```geom_histogram()``` is the code that changes the counts to relative frequency. Adding ```ylab("Relative frequency")``` gives a reasonable label for the vertical axis. These commands give the following histogram. The vertical axis is the relative frequency.

![A histogram titled histogram of heights. The x axis is height in centimeters and runs from one hundred forty to two hundred, in increments of ten. The y axis is relative frequency and runs from zero point zero to zero point five in increments of zero point one. Four vertical bars are shaded on the histogram.](Media/L05-Histogram05.png)

You can change the color of the bars and the color of the lines outlining the bars by adding arguments to ```geom_histogram()```; the color of the bars is specified by adding a ```fill=``` argument, while the color of the lines is specified by adding a ```color=``` argument. You can create a histogram with these colors:

```{r}
h + geom_histogram(binwidth = 10, fill = "goldenrod", color = "deepskyblue4") +
ggtitle("Histogram of Heights") + xlab("Height (in cm)")
```

![A histogram titled histogram of heights. The x axis is height in centimeters and runs from one hundred forty to two hundred, in increments of ten. The y axis is count and runs from zero to three in increments of one. Four vertical bars are shaded on the histogram.](Media/L05-Histogram04.png)

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>You'll notice that there is different spacing on the color and size equals signs.  That spacing doesn't matter; but typically leaving spaces on either side of the equals sign is preferred for readability.</p>
    </div>
</div>

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>If you would like to see all the predefined color names for ggplot2, <a href="http://sape.inf.usi.ch/quick-reference/ggplot2/colour">click here</a></p>
    </div>
</div>

---

# Eruptions Histogram

You will now look at a larger data set and how it can be interpreted in a histogram. You will use the eruption times for Old Faithful. ```faithful``` is a data frame with two columns; one is labeled ```eruptions```. You can create the histogram with the following commands:

```{r}
faithful_histogram <- ggplot(faithful, aes(x = eruptions))
faithful_histogram + geom_histogram()
```

This gives the following histogram:

![A histogram. The x axis is eruptions and runs from more than one to more than five in increments of one. The y axis is count and runs from zero to more than twenty five in increments of ten. Vertical bars of various height are spread across the histogram.](Media/L05-Histogram06.png)

This histogram has 30 bins (the default number of bins for the ```geom_histogram()``` function). You could use the ```binwidth=``` argument to change the number of bins. 

However, there is another way that gives even more control over the bin boundaries. You will create a vector of bin boundaries (sometimes called breaks), and pass this vector as the breaks argument to ```geom_histogram()```. In the following, you create bins with a width of 0.2:

```{r}
faithful_histogram + geom_histogram(breaks = seq(1.4, 5.2, by = 0.2))
```

![A histogram. The x axis is eruptions and runs from more than one to more than five in increments of one. The y axis is count and runs from zero to forty in increments of ten. Vertical bars of various height are spread across the histogram.](Media/L05-Histogram07.png)

Although it doesn't look much different in the general shape, you will notice that the counts in the second histogram for eruptions goes up much higher than the eruption counts in the first histogram.

---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

In [3]:
try:
    display(L5P3Q1)
except:
    pass

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '1. Which argument to \x1b[31;1mgeom_histogram()\x1b…

<p style="text-align: center">
  <img src="Media/L05-ExerciseWaiting.png" alt="Drawing" style="width: 500px;"/>
</p>

In [4]:
try:
    display(L5P3Q2, L5P3Q3)
except:
    pass

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '2. The waiting times for the Old Faithful data are …

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '3. Which of these arguments to will NOT influence t…

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - Box Plots<a class="anchor" id="DS105L5_page_4"></a>

[Back to Top](#DS105L5_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [5]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction to R
VimeoVideo('327336741', width=720, height=480)

The transcript for the above topic tutorial video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO102-L05-pg4tutorial.zip)**.

# Box Plots

A box plot is similar to a histogram in that it summarizes data values in a visual format.

It is reasonably straightforward to create a box plot using ```ggplot2```. In this section, you will use the ```cars``` dataset that is supplied as part of R. It includes speed measurements (in miles per hour) and stopping distance (in feet) for cars measured in the 1920's. The cars dataset is in a data frame format.  If you want to see the first six rows, you can use the ```head()``` function:

```{r}
head(cars)
```

speed dist

1 4 2

2 4 10

3 7 4

4 7 22

5 8 16

6 9 10

You can use the following commands to create a box plot that represents the stopping distance.  Box plots will have no ```x=``` except for empty quotes, and you'll put in the variable you're interested in seeing a box plot with under ```y=```. 

```{r}
d <- ggplot(cars, aes(x = "", y = dist))
d + geom_boxplot() + xlab("")
```

This produces the following box plot:

![A box plot that shows the stopping distance in feet for cars measured in the nineteen twenties. The y axis is distance and runs from zero to one hundred and twenty five in increments of twenty five.](Media/L05-Boxplot01.png)

The box plot is created from the following values:

* Minimum: The smallest value in the vector.
* 1st Quartile: The value below which one quarter of the values lie.
* Median: The middle value in the vector: one-half of the values are larger, and one-half of the values are smaller.
* 3rd Quartile: The value below which three quarters of the values lie.
* Maximum: The largest value in the vector.

Remember, you can use the ```summary()``` function to compute all of these values (plus the mean).  It works on data frames just as well as vectors, but you will need to specify the variable in the data frame you are interested in using the dollar sign ```$``` after the dataset name.  So, ```cars$dist``` specifies that you want a summary of the ```dist``` variable from the ```cars``` dataset.

```{r}
summary(cars$dist)
```

Min. 1st Qu. Median Mean 3rd Qu. Max.

2.00 26.00 36.00 42.98 56.00 120.00

The plot below is labeled with these values just for ease of interpretation for you; R will not normally label them.

![A box plot that shows the stopping distance in feet for cars measured in the nineteen twenties. The y axis is distance and runs from zero to one hundred and twenty five in increments of twenty five. The maximum, which is an outlier, the third quartile, the median, the first quartile, the minimum, and the I Q R are labeled on the box plot.](Media/L05-Boxplot02.png)

---

## Outliers

There is one value in this box plot that is an outlier. How do you know if a value is an outlier? Two ways: 

1. Visually.  Anything outside the "whiskers" of the plot (the lines at the end) are outliers.  

2. Mathematically.  Anything 1.5 times higher or lower than the value of the interquartile range (IQR) is an outlier.

If you were to inspect the plot above for outliers, you would notice that the point labeled "outlier" is in fact after the end of the top whisker.  You could also calculate the IQR (3rd Quartile minus 1st Quartile) to come up with an IQR value of 30 for the stopping distance data. Multiply 30 by 1.5: 

```text
30 * 1.5 = 45
```

Then add this value to your third quartile to find upper outliers: 

```text
45 + 56 = 101
```

And subtract it from the first quartile to find your lower outliers:

```text
26 - 45 = -19
```

So anything below -19 or above 101 would be considered outliers in your data.  On the chart above, you can easily see that the point labeled an outlier is higher than 101. 

This procedure of using 1.5\*IQR beyond the "box" part of the box plot is pretty standard, but it is not a hard and fast rule. There are statistical software packages out there that will use different criteria to determine whether a point is an outlier or not; but many packages use this rule of thumb.

---

### The Importance of Outliers

You may be wondering why outliers matter.  Well, especially in a small dataset, one or two outliers can really make a difference! Outliers will mess with the mean of a dataset tremendously.  A few high outliers could raise the mean; a few low outliers may bottom it out.  Consider the ```cars``` dataset: with the outlier in, the mean is 42.98.  When the outlier of 120 is removed, the mean drops to 41.41.  Now that's not a whole lot in this case, but add in just one or two more upper outliers and you may find a completely inaccurate mean that's much higher than it should be.  What is meant by inaccurate? Well, those few high points are having undue influence over the data.  They may be Incredibly Unlikely data points, since they aren't even close to the data, or they may be mistakes in data collection or data entry.  Ever accidentally add an extra zero to something?  A 20 pound baby can become 200 pounds in just one typo! Because outlier data is so improbable, it can bias your data. Outliers always need to be noted, but for some analyses, you may also remove them.  You will learn more about the process of automatic outlier detection and removal at a later point, but it's good to be aware of their influence now! 

---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

In [5]:
try:
    display(L5P4Q1, L5P4Q2, L5P4Q3)
except:
    pass

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '1. In this statement, what is the purpose of the do…

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '2. Why are outliers important?\n', 'output_type': '…

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '3. How do you mathematically detect outliers?\n', '…

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - Normal Probability Plots<a class="anchor" id="DS105L5_page_5"></a>

[Back to Top](#DS105L5_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [6]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction to R
VimeoVideo('327336815', width=720, height=480)

The transcript for the above topic tutorial video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO102-L05-pg5tutorial.zip)**.

# Normal Probability Plots

You can do a sanity check to see if data come from a normal distribution using a normal probability plot. As you've already seen, determining whether data are normally distributed is important for many different statistics, since it is an assumption that must be met.  

If data are approximately normally distributed, the points in their normal probability plot will lie approximately in a straight line. On the other hand, if they are not normally distributed, the plot will not be straight. As a guideline, use the "fat pencil test:" if a fat pencil placed over the graph will cover all or virtually all of the data points in the normal probability plot, then the distribution can be assumed normal.

Using ```ggplot2```, it is straight forward to create a normal probability plot. You will use the eruption times from the ```faithful``` dataset to demonstrate this. The following command creates the normal probability plot:

```{r}
ggplot(faithful, aes(sample = eruptions)) + geom_qq()
```

The name of the variable will go after ```sample=``` and adding the ```geom_qq()``` is the type of plot.

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Fun Fact!</h3>
    </div>
    <div class="panel-body">
        <p>The command geom_qq() is what creates this plot.  For this reason, normal probability plots are often called QQ Plots instead.</p>
    </div>
</div>

![A probability plot. The x axis is labeled theoretical and runs from negative three to three in increments of one. The y axis is labeled sample and runs from less than two to just over five in increments of one starting at two. Data is plotted but does not fall in a straight line.](Media/L05-NormalProbPlot.png)

When you look at this plot, it is clear that the data **do not** fall on a straight line - it is not even close - so the eruption times in the ```faithful``` data frame do not come from a normal distribution. You can also see this by looking at the histogram of eruption times you created previously; this histogram is repeated below:

![A histogram. The x axis is eruptions and runs from more than one to more than five in increments of one. The y axis is count and runs from zero to forty in increments of ten. Vertical bars of various height are spread across the histogram.](Media/L05-Histogram07.png)

This histogram does not have the same shape as a normal distribution; a normal distribution has one "peak," (that beautiful bell-shaped curve) while this histogram has two peaks. Distributions with two peaks are called *bimodal*. This is further evidence that the eruption times do not come from a normal distribution.

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>As a data scientist, a healthy dose of common sense is a great asset. You just illustrated how to do a 'diagnostic' test on a set of data to see if it is reasonable to assume the data are normally distributed.</p>
    </div>
</div>

In reality, if you were working with the ```faithful``` data, a glance at the histogram would make it obvious that the data are not normally distributed. You probably wouldn't bother to create a normal probability plot to verify this.

---

## A Normally Distributed Example

You can create a normal probability plot from the speed of light data in the ```morley``` data set. The following command creates the normal probability plot:

```{r}
ggplot(morley, aes(sample = Speed)) + geom_qq()
```

This creates the following normal probability plot:

![A normal probability plot. The x axis is labeled theoretical and runs from approximately negative two point seven five to approximately two point seven five in increments of one starting at negative two. The y axis is labeled sample and runs from six hundred to approximately one thousand one hundred in increments of one hundred. Data is plotted in more or less a straight line running from the bottom left to the upper right.](Media/L05-NormalProbPlot2.png)

The data fall more or less in a straight line, so this indicates that these data have a distribution that is approximately normal. In your mind's eye, place a fat pencil over the data. Does your pencil cover all or very nearly all of the data? If your answer is 'yes' you can assume that the data are close enough to being normally distributed to treat them as such.

---

## Summary

Screening and cleaning your data is one of the most important steps a data scientist can take to ensure high quality data.  Histograms, box plots, and normal probability plots are all great visuals to use for your own internal visualization purposes, so you can get a handle on the normality and number of outliers in your data.  Armed with that knowledge, you will be better able to make judgments about which statistics to use and whether the data you have is accurate.  

```ggplot2``` is a very useful package for visualizations in R that you will use over and over again, so getting familiar with the syntax of ```ggplot2``` will speed you on your way to bigger and better things!

---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

In [6]:
try:
    display(L5P5Q1)
except:
    pass

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '1. How should you interpret a normal probability pl…

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - Key Terms<a class="anchor" id="DS105L5_page_6"></a>

[Back to Top](#DS105L5_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


Below is a list and short description of the important keywords learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>ggplot2</td>
        <td>A graphics-creation library in R. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Outlier</td>
        <td>An extreme value in the data.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Normal Probability (QQ) Plot</td>
        <td>Shows the normality of your data.  Data is normally distributed if the points form a straight line that can be covered with a fat pencil.</td>
    </tr>
</table>

---

## Key R Functions

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>install.packages() </td>
        <td>Installs new packages for R.  Done only once, but requires internet. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>library()</td>
        <td>Makes new packages in R available for your use.  Must be done every time you open R. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>data.frame()</td>
        <td>Creates a data frame. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>ggplot()</td>
        <td>Begins a chart.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Bimodal</td>
        <td>When data has more than one peak, or central point.</td>
    </tr>
</table>


---

## Key Arguments in ggplot()

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>aes()</td>
        <td>Stands for aesthetics; where you specify the columns of data you want to graph. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>x=</td>
        <td>Argument of aes() where you specify the x variable. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>y=</td>
        <td>Argument of aes() where you specify the y variable.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>geom_histogram()</td>
        <td>Makes a histogram.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>binwidth=</td>
        <td>Specifies the number of data buckets in a histogram.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>ggtitle()</td>
        <td>Adds a title for your graph.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>xlab()</td>
        <td>Adds a label for the x axis. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>ylab()</td>
        <td>Adds a label for the y axis.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>fill=</td>
        <td>Specifies a main color in the chart.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>color=</td>
        <td>Specifies an outline color in the chart.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>geom_boxplot()</td>
        <td>Creates a box plot.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>sample=</td>
        <td>An argument for aes() used when making a normal probability plot to specify the data column.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>geom_qq()</td>
        <td>Creates a normal probability plot.</td>
    </tr>
</table>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - Hands-On<a class="anchor" id="DS105L5_page_7"></a>

[Back to Top](#DS105L5_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


### Lesson 5 Hands-On
#### Directions
This Hands-On will be graded, so be sure you complete all requirements. Please complete this Hands-On within RStudio and submit both your presentation and your code (R Script file) for review when completed. 

---

## Requirements

The built in data set rivers has the length in miles of 141 major rivers in North America. You can build a data frame of this data set that is suitable for graphing as follows:

```rr = data.frame(rivers)```

Using the following command will provide a view of the data in the data frame:

```View(rr)```

With this entire data set, create a histogram with suitable bin widths, a box plot, and a normal probability plot. Then answer the following questions:

+ Are there any outliers in this data set? Are they high or low outliers?

+ Do these data appear to come from a normal distribution?

Create a presentation (MS Power Point or equivalent) that answers these questions, and includes the graphs and code that you used.


<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire document when finished!</p>
    </div>
</div>

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>To zip your file on <b>Windows</b>, right click on the file and select "Send to", then select "Compressed (zipped) folder". For <b>Mac</b> users, right click on the file and select "Compress", then select your file from the options.</p>
    </div>
</div>