A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

# Problem 5.3. Density Estimation.

In this problem, we use Seaborn to create Kernel Density Estimation (KDE) plots of travel time in the flights data.

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

from nose.tools import assert_equal, assert_is_instance, assert_is_not
from numpy.testing import assert_array_equal, assert_array_almost_equal, assert_almost_equal

We use the `AirTime` column at the Willard airport.

In [None]:
df = pd.read_csv('/home/data_scientist/data/2001.csv', encoding='latin-1', usecols=(13, 16))

local = df[df['Origin'] == 'CMI'].dropna()
local = local.drop(['Origin'], axis=1) # we don't need the Origin column anymore.
local = local.reset_index(drop=True) # reset index and drop the all index.

print(local.head(10))

## Plot KDE

- Use [seaborn.distplot](https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.distplot.html) to write a function named `plot_distplot()` that plots a histogram, a KDE, and a rug plot, all in the same figure.

- KDE is covered in the [Introduction to Density Estimation](https://github.com/UI-DataScience/accy571-fa16/blob/master/Week7/notebooks/intro2de.ipynb) notebook. See [Seaborn documentation](https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.distplot.html) for more examples.

- `plot_distplot()` accepts a second argument `bins`, which should be passed to the `bins` paramter of `displot()`. In other words, you should be able to change the number of bins in the histogram by using different `bins` in `plot_distplot(df, bins=bins)`. For example,
```python
>>> dist_10_bins = plot_distplot(df=local, bins=10)
```
should create a histogram with 10 bins, and
```python
>>> dist_50_bins = plot_distplot(df=local, bins=50)
```
should create a histogram with 50 bins.

![](https://github.com/UI-DataScience/accy571-fa16/raw/master/Week7/assignments/images/dist_10_bins.png)

![](https://github.com/UI-DataScience/accy571-fa16/raw/master/Week7/assignments/images/dist_50_bins.png)

(Your plots do not have to look exactly like the above figures. As long as your plots look reasonable and your function passes the tests, your solution is correct.)

Note that the histograms look quite different when we change the number of bins, but the KDE is able to smooth out the variations in the histograms.

In [None]:
def plot_distplot(df, bins, column="AirTime"):
    """
    Uses seaborn.displot() to plot a KDE, a histogram, and a rugplot
    all in the same figure.
    
    Parameters
    ----------
    df: A pandas.DataFrame
    bins: The number of bins
    column: The column to use in "df"
    
    Returns
    -------
    A matplotlib.axes.Axes
    """
    
    # YOUR CODE HERE
    
    return ax

In [None]:
dist_10_bins = plot_distplot(df=local, bins=10)

In [None]:
assert_is_instance(dist_10_bins, mpl.axes.Axes)

# test histogram
patches = dist_10_bins.patches
assert_equal(len(patches), 10)

bins_a = [25.,  31.,  37.,  43.,  49.,  55.,  61.,  67.,  73.,  79.,  85.]
freq_a = [0.05453103,  0.07852469,  0.0242911,   0.00604799,  0.00148721,
          0.00069403,  0.00039659,  0.00039659,  0.,          0.00029744]

for i in range(len(patches)):
    assert_equal(patches[i].get_x(), bins_a[i])
    assert_almost_equal(patches[i].get_height(), freq_a[i])

# test kde + rug plots
lines = dist_10_bins.lines
assert_equal(len(dist_10_bins.lines), len(local) + 1) # 1 kde + rug plots

# test kde
kdex, kdey = dist_10_bins.lines[0].get_xydata().T
assert_almost_equal(np.trapz(kdey, kdex), 1.0, 4)

# test rug plots
for i in range(len(local)):
    rugx, rugy = dist_10_bins.lines[i + 1].get_xydata().T
    assert_array_equal(rugx, local.iloc[i, 0])
    assert_equal(rugy[1] - rugy[0] > 0, True)
    
# check label texts
assert_is_not(len(dist_10_bins.title.get_text()), 0,
    msg="Your plot doesn't have a title.")
assert_is_not(dist_10_bins.yaxis.get_label_text(), '',
    msg="Change the y-axis label to something more descriptive.")

In [None]:
dist_50_bins = plot_distplot(df=local, bins=50)

In [None]:
assert_is_instance(dist_50_bins, mpl.axes.Axes)
# test histogram
patches = dist_50_bins.patches
assert_equal(len(patches), 50)