# Lecture 5.1: More About Exploratory Data Analysis 

<div style="border: 1px double black; padding: 10px; margin: 10px">

**Goals for today's lecture:**
* More about histogram
* the **covariation** between two variables:
    
This lecture note corresponds to parts of Chapter 7 of your book.
</div>


In [1]:
library(tidyverse)
library(nycflights13)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.1     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



## More about Histogram

Let us try to plot a histogram for the variable `dep_delay` in our flights data set. 

It seems like there are 8255 rows that have missing values, so maybe let us try to remove those values first before we plot our histogram.  

Since we have already manually removed all of the missing values, `ggplot` will not output a warning message for us now.

Let us zoom into the left part of the plot. Let us only look at flights with departure delays of less than an hour.

We can look at the underlying bins and their count by using the `cut_width` function in ggplot2.

The `cut_width` function basically shows you how many observations are within each bin with bin width equal to five.

#### Remark: 
The appearance of a histogram does depend on your choice of the bin width. It is a good idea to try several values to see if different choices reveal different patterns.

We can also bring in a third variable to our histogram just like we did for `geom_bar` and others.

Let us bring in the categorical variable **carrier** and map the color aesthetic to it.


Oops! The legend is a bit crowded. Let us see who the major carriers are by number of flights.

Maybe let us just plot the historgam with the top 5 carriers. Let us find out which carriers are the top five carriers by using the tools that we have learnt so far.  

Now we can additionally filter out rows that do not belong to the top 5 carriers.

Hmmm... May be not a good idea to stick with histograms here. It is still too crowded and it is hard to see what is going on.  So let us a new geometry **freqpoly** which is like histogram but shows lines. Overlapping lines are easier to see than overlapping bars.

# Covariation Between Two Variables

### A Categorical and A Continuous Variable

In [19]:
print(mpg)

[38;5;246m# A tibble: 234 x 11[39m
   manufacturer model    displ  year   cyl trans   drv     cty   hwy fl    class
   [3m[38;5;246m<chr>[39m[23m        [3m[38;5;246m<chr>[39m[23m    [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<chr>[39m[23m   [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m
[38;5;250m 1[39m audi         a4         1.8  [4m1[24m999     4 auto(l… f        18    29 p     comp…
[38;5;250m 2[39m audi         a4         1.8  [4m1[24m999     4 manual… f        21    29 p     comp…
[38;5;250m 3[39m audi         a4         2    [4m2[24m008     4 manual… f        20    31 p     comp…
[38;5;250m 4[39m audi         a4         2    [4m2[24m008     4 auto(a… f        21    30 p     comp…
[38;5;250m 5[39m audi         a4         2.8  [4m1[24m999     6 auto(l… f        16    26 p     co

We can map a categorical variable to, say, the **color** aesthetic in a frequency polygon of a continuous variable.

Mapping the `color` aesthetic to the `class` variable in a histogram does not have a good effect.

Changing the **fill** aesthetic to the **color** aesthetic improves the appearance but the plot remains problematic.

Another thing we can do with a categorical, continuous pair is to use a **boxplot**.

* The lower and upper hinges correspond to the first and third quartiles (the 25th and 75th percentiles).
* The upper whisker extends from the hinge to the largest value no further than 1.5 * IQR from the hinge (where IQR is the inter-quartile range, or distance between the first and third quartiles).
* The lower whisker extends from the hinge to the smallest value at most 1.5 * IQR of the hinge.
* Data beyond the end of the whiskers are called "outlying" points and are plotted individually.

To replot with `class` values listed in order of the median value for `cty`, we can use the `reorder()` function.

```
reorder(cat, con, FUN = median)
```

reorders the levels of the categorical variable `cat` according the continuous variable `con`. The function `median()` is applied the the `con` values corresponding to a fixed level of `cat`. Default value of the `FUN` argument is `mean`.

We can flip the x, y axes if the categorical level names are long

Contrast this with faceting the `cty` histogram on the `class` variable.

We can also superimpose the points themselves on top of the boxplot by adding `geom_jitter`. But it is a good idea to hide the outliers by setting `outlier.shape = NA` first.

# Two categorical variables

`geom_count` can be used to visualize two categorical variables.`geom_count` can be used to visualize two categorical variables.

We can compute these numbers using `count()`.

These counts can be fed to other geometries.

# Two continuous variables

We already know a lot about scatterplots. Once you have too many points, you may want to use `geom_bin2d` or `geom_hex`.We already know a lot about scatterplots. Once you have too many points, you may want to use `geom_bin2d` or `geom_hex`.

Sometimes setting the transparency of points using `alpha` can help.

Let us try to see what happens if we use a boxplot with 2 continuous variables: `price` as a function of `carat` for the `diamonds` tibble.

If outliers run into each other, you could adjust `outlier.alpha`.