# Plotting Distributions

Now it is time to apply and expand our knowledge of plotting in matplotlib to the visualization of distributions. This is an important part of exploring datasets to understand the composition of the various fields. 

### Learning Objective

At the end of this notebook you will be able to:

    Implement binning of continuous data
    Create histograms to visualize distributions
    Build bar charts from a dataset
    Visualize 2 distributions in one chart
    Build complex and composite charts from a dataset
    Split charts using categorical variables
    
We'll be using Pandas DataFrames as the basis for these exercises as this is a usual use case while doing EDA.
We will continue to use matplotlib and introduce Seaborn as an additional way to create nice charts.

We will be working with data from the Bureau of Transportation Statistics of the U.S. Department of Transportation

The specific dataset we are working with can be downloaded here:
https://www.transtats.bts.gov/DL_SelectFields.asp?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr


In [None]:
import matplotlib # Imports entire matplotlib library
import matplotlib.pyplot as plt # Imports the plotting interface and gives a shortcut
import numpy as np # we will use numpy to generate the data used in our sample charts
import pandas as pd
import seaborn as sns

### What is a histogram?
It is a common chart type that is helpful in showing how values of an observations in a dataset are distributed throughout the dataset. In other words, how many observations being to each value or set of values.

Here is a simple example of 1000 random observations

In [None]:
x = np.random.randn(1000)
plt.hist(x);

This looks like a nice bell curve which is due to the fact that the numpy randn function takes samples from a normal distribution.
We will look at a real world dataset of flight delays to see how they are distributed.

In [None]:
# Path of the file to read
flights_filepath = '../data/586855489_T_ONTIME_REPORTING_April_2020_2.csv'

# Read the file into a variable iris_data
df = pd.read_csv(flights_filepath, header=0)

# Print the first 5 rows of the data
display(df.shape)
display(df.info())



In [None]:
df.describe().round(2)

In [None]:
# Clean up dataframe
df.drop(axis=1,columns='Unnamed: 12', inplace=True) # drop random extra column

# Change data type for date
df['FL_DATE'] = pd.to_datetime(df['FL_DATE'])
display(df.info())

## Histograms

The histogram shows the distribution of data across the x value as split into 'bins' to group and count the observations. Bins are equally sized and span the range of x values. It is possible to specify the number of bins you would like and the bin size will be chosen by splitting the range accordingly.
The below is an easy way of showing the histogram but it is not a great visualization, as it is small and is full of chart junk.
Fortunately we can use the object orientated interface of Matplotlib to manipulate the panel by referencing the 'ax' inside the `hist()` method, which we will show next.

In [None]:
df.hist(column='AIR_TIME', bins=20)

In [None]:
# Here we introduce a technique for making 'chained' code more readable, understandable and editable

# Count the number of observations with flight time above 420 minutes
(
    df    # We start with our dataframe
    .loc[df['AIR_TIME'] > 420] # filter it down to rows that have flight time more than 420 mintues
    .count()['AIR_TIME']  # we count the resulting rows
)

In [None]:
# Note the above is equivilent to the following 
df.loc[df['AIR_TIME'] > 420].count()['AIR_TIME']

In [None]:
fig, ax = plt.subplots(figsize=(10,5))

xmax = 420 # based on the above we can reduce the xaxis to not show the flights over 7 hours
bins = 21 # a nice number of bins based on range

(
    df
    .loc[df['AIR_TIME'] < xmax]
    .hist(ax=ax, 
          column='AIR_TIME',
          bins=bins, 
          edgecolor="black", 
          linewidth=1, color='w')
)

# These can also be set in the hist method
ax.grid(False)
ax.set_title('Histogram of Air Travel Time')
ax.set_ylabel('Count of Flights')
ax.set_xlabel(f'Air Travel Time  ({int(xmax/bins)} minute bins)');


In [None]:
# Bar Chart

fig, ax = plt.subplots(figsize=(10,5))
(
    df
    .groupby('OP_UNIQUE_CARRIER') # aggregate data to level of airline
    .count()['FL_DATE'] # count the observations
    .sort_values() # sort the values in descending
    .plot.barh() # Make a bar chart based on the dataframe
)
ax.set_title('Flight Volume by Airline in the USA April 2020')

ax.set_xlabel('Count of Flights');
ax.set_ylabel('Airline Unique Code');


## Seaborn
Seaborn is built on top of Matplotlib and has some nice features for plotting distributions like drawing the kernel density estimate (KDE) plot (smoothed histogram) directly on the histogram.

We can explore different ways to display two distributions at the same time. We can determine roughly if the data behave similarly across the two dimensions.

In [None]:
# As Subplots
fig, ax = plt.subplots(2,1, figsize=(10,5))
fig.tight_layout(h_pad=4)

fig.suptitle('Distributions of Flight Performance')
plt.subplots_adjust(top=0.90)

xmax = 120 # Limit our misery and ignore delays over 2 hours
bins = 60

sns.histplot(ax = ax[0], 
             data = df.loc[df['ARR_DELAY'] < xmax],
             x = 'ARR_DELAY', 
             bins = bins,
             kde = True)
sns.histplot(ax = ax[1], 
             data = df.loc[df['AIR_TIME'] < 420], 
             x = 'AIR_TIME', 
             bins = bins, 
             kde = True);



### Overlapping Charts
By plotting multiple charts on the single axes it is easy to compare distributions. 

In [None]:
fig, ax = plt.subplots(figsize=(10,5))

xmax = 120
bins = 75
sns.histplot(ax = ax,
             data = df.loc[df['ARR_DELAY'] < xmax],
             x = 'ARR_DELAY',
             bins = bins,
             kde = True, 
             color = 'r')

sns.histplot(ax = ax, 
             data = df.loc[df['DEP_DELAY'] < xmax], 
             x = 'DEP_DELAY', 
             bins = bins,
             kde = True, 
             color = 'b', 
             alpha = 0.5); # Setting the alpha creates transparency

ax.set_xlabel('Delay in Minutes')
ax.legend(['Arrival Delay', 'Departure Delay']);

# Criticism of Histograms
The decisions that need to be made while building a histogram will lead to different results, so therefore the histogram is not an unbiased method of displaying data.
The output will change based on the bin size, and treatment of extreme values. Therefore the interpretation of the results are based on the choice made while creating the chart.  
Just check the number of parameters that can be adjusted in the sns.histplot() function to see how much influence your choices have on the output.

In general Histograms have the following issues:
* Changes to the number of bins can radically change the shape.
* The maximum and minimum values (and their treatment) will change the shape
* Consideration of specific individual observations is not possible
* The differences between discrete and continuous data is lost
* Comparing distributions is hard due to the impact of choices made above


In [None]:
# As Subplots
fig, ax = plt.subplots(2,2, figsize=(20,5))
fig.tight_layout(h_pad=4)

fig.suptitle('4 histograms of the same data')
plt.subplots_adjust(top=0.90)

xmax = 120 # Reasonable filter of extreme values
xmax_2 = 500 # just extreme! values
bins_1 = 60 
bins_2 = 150
sns.histplot(ax = ax[0,0], 
             data = df.loc[df['ARR_DELAY'] < xmax], # ignore outliers
             x = 'ARR_DELAY', 
             bins = bins) # less bins
sns.histplot(ax = ax[0,1], 
             data = df, # no outlier removal
             x = 'ARR_DELAY', 
             bins = bins);

sns.histplot(ax = ax[1,0], 
             data = df.loc[df['ARR_DELAY'] < xmax], # ignore outliers
             x = 'ARR_DELAY', 
             bins = bins_2) # more bins
sns.histplot(ax = ax[1,1], 
             data = df.loc[df['ARR_DELAY'] < xmax_2], # Just extreme value removal
             x = 'ARR_DELAY', 
             bins = bins_2) ; # more bins



### Cumlative Distribution Plots to the resuce?
https://towardsdatascience.com/6-reasons-why-you-should-stop-using-histograms-and-which-plot-you-should-use-instead-31f937a0a81c  
According to this we should use CDPs instead. It is worth considering but Histograms are well understood by audiences and the classic normal curve is instantly recognisable.


### Exercise:
Can you revisit our overlapping chart example of ARR_DELAY and DEP_DELAY and plot the CDP using seaborn?.
Does this solve all the issues as promised?


## Composite Charts
Seaborn has a number of composite chart types that allow the comparison of multiple data columns using a combination of charts. 

For example we can compare the distributions of two columns with the jointplot which is a builtin Seaborn plot type. It shows the histograms for each column and a addition chart that displays the distribution of observations across the two dimensions at the same time, such as a scatter plot.



In [None]:
df_plot = (
            df
            .loc[df['ARR_DELAY'] < 80]
            .loc[df['AIR_TIME'] < 240]
            .sample(n=1000) # take a small sample of the data for purpose of visualization
            )

sns.jointplot(data = df_plot, 
                  x = 'ARR_DELAY', 
                  xlim = (-80,60), 
                  y = 'AIR_TIME', 
                  ylim = (0,240), 
                  kind = 'scatter');

## Figure Level Functions

We have previously introduced the two ways Matplotlib creates figures, either MATLAB or Object-oriented interface. In both of these interfaces there are methods that control the figure (`fig.`)  and other methods that control the axes (`ax.`).
Seaborn has some plot functions that work in the same way in that they create axes level objects that are the same as matplotlib charts. 

When using Seaborns's complex charts, Seaborn will generates the figure (grid) and axes it needs. To customize aspects of this complex chart it is necessary to get a hold of these objects and apply the relevant functions to them.



As we did in the first notebook we can use the `type()`function to get an idea of what objects we are working with

In [None]:
# KDE functions can be quite CPU intensive for large datasets
df_plot = (
            df
            .loc[df['ARR_DELAY'] < 80]
            .loc[df['AIR_TIME'] < 240]
            .sample(n=1000))

df_plot["WEEKDAY"] = df_plot["FL_DATE"].apply(lambda x: x.weekday())


grid = sns.jointplot(data = df_plot, 
                      x = 'ARR_DELAY', 
                      xlim = (-80,60), 
                      y = 'AIR_TIME', 
                      ylim = (0,240), 
                      kind = 'scatter')

grid.plot_joint(sns.kdeplot) # We can add an additional chart to the joint axes of the grid

Here is a list of the composite chart type objects from Seaborn:  
PairGrid, FacetGrid, JointGrid

Note that `jointplot()` created a JointGrid object.

## Examples of working with seaborn
Work carefully through the below code and examine the output to see how you can customise the figure and charts

In [None]:
# Set up dataframe
df_plot = (
            df
            .loc[df['ARR_DELAY'] < 80]
            .loc[df['AIR_TIME'] < 240]
            .loc[df['OP_UNIQUE_CARRIER'].isin(['UA', 'OH', 'OO' 'AA', 'DV', 'MQ', 'DL', 'F9'])] # List of Carriers we want to view in more detail
)
df_plot["WEEKDAY"] = df_plot["FL_DATE"].apply(lambda x: x.weekday()) # add a category for day of week

# Create a grid object with columns and rows that split based on categories
grid = sns.FacetGrid(df_plot,
                     row = 'OP_UNIQUE_CARRIER',
                     col = 'WEEKDAY',
                     despine = True, # Removes the top and left axis of the subplots
                     margin_titles = True); # Displays row and column titles only on margins instead of on each axes

# Fill the grid with subplots
grid.map(sns.scatterplot, # Fill with scatterplots
         'ARR_DELAY', # x-axis
         'DEP_DELAY', # y-axis
         color = 'k', # color things black 
         s = 100, # set the size of the dots
         alpha = 0.1) # set the transparency           

ax1 = grid.axes_dict[('UA', 4)] # We can get ahold of the axes object to work with it directly by referencing the combination of row and column values
ax1.set_visible(False) # For some reason hide this plot

ax2 = grid.axes_dict[('MQ', 2)]
ax2.set_facecolor('y') # Change background color for this 'interesting' chart
ax2.set_title('This is yellow'); # Just in case people missed it


The above is a deconstruction of the facetgrid. Using the built-in pairplot and other seaborn functions many specific combinations of charts can be generated without so much manual coding of the charts.

## Next Steps
We can use the groupby function to aggregate and explore at a metric level.

In [None]:
# We can also aggregate the data up to 'day' level so we have one row per day. 
# We use the mean aggregation to summarize the values.
# These new summaries can then be plotted to show the relationshps between days

df_plot_2 = df.groupby(by='FL_DATE').mean()
g = sns.jointplot(data = df_plot_2, 
                  x = 'ARR_DELAY', 
                  y = 'DEP_DELAY',
                  kind = 'scatter')
g.plot_joint(sns.kdeplot)

In [None]:
# We can view the table to see what the underlying data now looks like. Remember the values are now the average of all flights on that day
(
    df_plot_2
    .sort_values(by='ARR_DELAY')
    .round(2)
)

### Considerations when grouping
The grouping has some impacts that we may not have wanted or expected.  
What has happened to the categorical columns?  
What about the 'ID' columns, are these still valid?

### Exercise
Are there other relationships that could be explored?

1. Use Bar charts to display the airlines based on one aggregated measure
2. Build multiple bar charts in either subplots or a single panel to explore how airlines perform across a few metrics