## COMP 4433: Week 5 Live Session

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from matplotlib import gridspec

In [None]:
# %matplotlib inline

__First a note on style and theme__

We'll discuss more next week.

#### First, the reset_defaults() function will restore all RC params (runtime configuration parameters) to their default values.

In [None]:
sns.reset_defaults()

#### The style parameters from seaborn work with the matplotlib rcParams system to control the general appearance of plots.  We'll discuss two primary functions: axes_style() and set_style()

#### The seaborn function, axes_style(), has two main purposes.

1. To capture current settingsfrom a passed rcparams dict or from a predefined style.

2. For use as a context manager.

#### Below we save off the default style settings.  Calling this function with no args captures the current values.

In [None]:
defaults = sns.axes_style()

defaults

#### axes.style() can also be employed as a context manager.

In [None]:
# here we plot with the default style.

foo = ['A', 'B', 'C', 'D']
bar = [5, 6, 8, 9]

sns.barplot(x=foo, y=bar)
plt.show()

In [None]:
# here we use axes_style() in a with statement to avoid altering global defaults.

with sns.axes_style('dark'):
    sns.barplot(x=foo, y=bar)
    plt.show()

#### axes_style() accepts strings indicating a predefined style (dark, darkgrid, white, whitegrid, ticks) or a dictionary of style parameters. This function won't modify your global style settings. It will return a dict of the supported parameters.

#### For example, let's access the default matplotlib RC params.  Below we'll confirm that all of the parameters supported by the seaborn axes styles are also RC params.

In [None]:
import matplotlib as mpl

rcparam_keys = list(mpl.rcParams.keys())

seaborn_style_keys = list(defaults.keys())

print('RCparam keys:', len(rcparam_keys))
print('Seaborn style keys:', len(seaborn_style_keys))

for i in seaborn_style_keys:
    print(i in rcparam_keys)

#### If we pass the entire RC params dict to the style param, we'll be returned a seaborn style dict with only the supproted parameters.  

#### We'll also use both the _style_ and _rc_ keyword arguments in our function call.  We pass a dictionary to the _rc_ parameter of axes_style() to override one (or more if we wish) of the settings defined by the _style_ argument.

In [None]:
default_mpl = sns.axes_style(style=mpl.rcParams, rc={'axes.facecolor': '#b4eb34',
                                                    'axes.spines.left': False,
                                                    'axes.spines.right': False,
                                                    'axes.spines.top': False})

default_mpl 

In [None]:
with sns.axes_style(default_mpl):
    sns.barplot(x=foo, y=bar)
    plt.show()

#### Here we use the rc argument to override some grid-related attributes in another context-managed plot.

In [None]:
with sns.axes_style(default_mpl, rc={'axes.grid': True,
                                    'grid.color': '#0a0909'}):
    
    sns.barplot(x=foo, y=bar)
    plt.show()

#### We'll discuss set_style() more next week, but these can be used to modify global defaults for all plots.  set_style() takes the same arguments as axes_style(). Here's an example.

In [None]:
sns.set_style('dark') # using set_style() to modify global defaults

sns.axes_style() # inspecting current style specs

#### Now we've changed the style setting globally

In [None]:
sns.barplot(x=foo, y=bar)
plt.show()

#### Here we use set_style() along with the default parameters that we saved off earlier to change back to our defaults globally. 

In [None]:
sns.set_style(defaults) # or sns.set_style(style=defaults) 

In [None]:
sns.axes_style()

#### figure-level vs axes-level functions in Seaborn

It's useful to review the documentation on this concept.
This will help you fully incorporate seaborn plotting
with matplotlib object-oriented approaches.

Each seaborn plotting module has a top-level figure-level function.
Figure-level functions interface with matplotlib through a seaborn object (almost always a FacetGrid).  Axes-level functions return a matplotlib Axes object.

Figure-level functions can produce their assoicated axes-level plots by sepcifying the 'kind' paramater.  There are advantages and disadvangages to both.

Axes-level plots are easy to use and are matplotlib objects, so they're simple to incorporate with other matplotlib functionality.

In [None]:
diamonds = sns.load_dataset('diamonds')

In [None]:
# histplot() is an axes-level function
hist = sns.histplot(data=diamonds, x='price', hue='cut', multiple='stack')

print(type(hist))

# similar to the .gca() method to access current axes of a figure, we can
# use .gcf() to get (or generate...if it doesn't exist) the current figure from a plot.

fig = plt.gcf() 
print(type(fig))

# we can employ any of the pyplot axes methods on our Seaborn axes-level plots.
hist.set_title('Price Distribution by Hue', fontsize=12)
plt.xticks(rotation=45)

hist.spines.left.set_visible(False)
hist.spines.top.set_visible(False)
hist.spines.right.set_visible(False)
hist.spines.bottom.set_color('gray')

# now that we have access to the figure, we can also utilize figure-level methods
fig.suptitle('DIAMONDS', fontsize=14)

# setting figure width and height with figure methods
fig.set_figwidth(8)
fig.set_figheight(4)

plt.show()

In [None]:
# displot() is figure-level plotting function.

hist2 = sns.displot(data=diamonds, 
            x='price', hue='cut', multiple='stack', 
            kind='hist',
            facet_kws=dict(legend_out=False),
            height=4, aspect=1.5)

# the move_legend() function works on matplotlib axes objects as well as seaborn objects
# if we use this function on a facet grid, we'll get extra white space to the right of the plot
# to prevent this you can set legend_out = False on the facet_grid (see above)
sns.move_legend(hist2, 'upper right')

print(type(hist2.figure))
print(type(hist2))

# while hist2 is a seaborn object it has components that are matplotlib objects
# below we access the figure object of the facet grid
hist2.figure.suptitle('DIAMONDS', fontsize=14)

# setting figure width and height with figure methods
hist2.figure.set_figwidth(8)
hist2.figure.set_figheight(4)

plt.show()

#### One advantage of the figure-level plotting functions is that you can facet them from your function call.

In [None]:
# Below we use the figure-level method and facet the plots by cut

sns.displot(data=diamonds, x='price', hue='cut', kind='hist', col='cut')

#fig = plt.gcf()
#fig.set_figwidth(10)
#fig.set_figheight(4)

plt.show()

__Axes-level plots can be used to build complex matplotlib plots
using an object-oriented approach.
Figure-level plotting functions can't be used to draw on subplot axes.__

In [None]:
# for example, this will throw an error
fig, ax = plt.subplots(figsize=(8, 6))
sns.displot(data=diamonds, x='price', hue='cut', kind='hist', ax=ax)

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(8, 4))

sns.histplot(data=diamonds, x='price', hue='cut', multiple='stack', ax=axs[0])
sns.kdeplot(data=diamonds, x='price', hue='cut', multiple='stack', ax=axs[1])

axs[0].set_title('Histogram')
axs[1].set_title('KDE')

plt.tight_layout()
plt.show()

__More on integrating fig, ax object-oriented pyplot approach with sns.__

Specify ax argument in call to axes-level sns plotting functions.

Don't forget about the shape of the ndarray axes objects.

In [None]:
mpg = sns.load_dataset('mpg')

In [None]:
#sns.set()
sns.set_style('darkgrid')

In [None]:
fig = plt.figure(figsize=(12, 9))

gs = fig.add_gridspec(3, 4)
ax0 = plt.subplot(gs[:2, :2])
ax1 = plt.subplot(gs[2:, :2])
ax2 = plt.subplot(gs[:3, 2:])

sns.scatterplot(data=mpg, x='acceleration',
                y='mpg', hue='cylinders',
                ax=ax0)

sns.histplot(data=mpg, x='mpg', bins=20,
             ax=ax1)

sns.boxplot(data=mpg, x='model_year', y='mpg',
            ax=ax2)

ax2.set_xlabel('model year')

fig.suptitle('Automobile Data (1970-1982)', fontsize=18)

plt.tight_layout()

plt.show()

__Multi-Classifier Consensus Density Plot__

I made this name up, but the plot itself is occasionally useful.
Especially if you're trying to assess differential classification,
identify hard-to-classify cases or determine voting in an ensemble model.

A similar technique is employed in the week 6 asynch for missingness.

In [None]:
# some random data meant to simulate the application of some different classifiers
np.random.seed(57)

pid = np.linspace(8000, 9001, 1000, dtype=int).astype(str) # an id for records
outcome = np.random.randint(0, 2, 1000)

classifier_1_y_hat = np.round((outcome + 0.5) * np.random.rand(1000)).astype(int)
classifier_2_y_hat = np.round((outcome + 0.4) * np.random.rand(1000)).astype(int)
classifier_3_y_hat = np.round((outcome + 0.3) * np.random.rand(1000)).astype(int)

In [None]:
# building a dataframe
df = pd.DataFrame(list(zip(pid, outcome, classifier_1_y_hat, classifier_2_y_hat, classifier_3_y_hat)),
                  columns=['id', 'outcome', 'c_1', 'c_2', 'c_3'])

In [None]:
df.head()

In [None]:
# we could use a boolean mask here, but I'm creating new columns to assess classifications
for i, j in zip(['c_1', 'c_2', 'c_3'], ['c_1_bool', 'c_2_bool', 'c_3_bool']):
    df[j] = np.vectorize(lambda x, y: True if x == y else False)(df[i], df['outcome'])

In [None]:
# Bool value heatmap...classification results unsorted.
sns.heatmap(df[['c_1_bool', 'c_2_bool', 'c_3_bool']], cbar=True, cmap='vlag')
plt.tight_layout()
plt.show()

In [None]:
# This might be easier to see if we sort by the best classifier.
# Let's inspect the average to see proportion of correct classifications.
df[['c_1_bool', 'c_2_bool', 'c_3_bool']].astype(int).describe()

In [None]:
# we'll sort the values to achieve a better grouping of classifier consensus.

df.sort_values(by=['c_1_bool', 'c_2_bool', 'c_3_bool'],
              ascending=[True, True, True],
              inplace=True)

sns.heatmap(df[['c_1_bool', 'c_2_bool', 'c_3_bool']], cbar=True, 
            cmap=sns.diverging_palette(360, 145, as_cmap=True))
plt.tight_layout()
plt.show()

__Facet Grids__

Note that we can facet figure-level plots by specifying column and row args.
Also note that we can essentially replicate any axes-level plot from the figure-level
method by specifying the kind argument. 

In [None]:
# basic figure-level relational plot
sns.relplot(data=mpg, x='mpg', y='horsepower', col='origin')
plt.show()

In [None]:
# specifying kind for relplot()
sns.relplot(data=mpg, x='mpg', y='horsepower', col='origin', kind='line', errorbar=None)
plt.show()

In [None]:
sns.catplot(data=mpg, x='model_year', y='mpg', col='origin', kind='strip') 
plt.show()

In [None]:
sns.catplot(data=mpg, x='model_year', y='mpg', row='origin', kind='violin',
                     height=6, aspect=1.5)


plt.show()

In [None]:
sns.set_style('whitegrid')

__Specifying a FacetGrid__

note the availability of the methods FacetGrid.map()

and FacetGrid.map_dataframe() for applying plotting functions.


In [None]:
g = sns.FacetGrid(diamonds, col='cut', row='color')

# mapping an axes-level plotting method
# and specifying a member of self.data (a feature from diamonds in this case)
# here g.data will return a reference to our diamonds dataframe.

g.map(sns.kdeplot, 'price') 
sns.despine(left=True, bottom=True)
plt.show()

#### map_dataframe() is vey similar to map(), but it's designed for use with plotting functions that accept the data keyword argument and allow you to access column values by passing strings.

In [None]:
g2 = sns.FacetGrid(mpg, col='origin')
g2.map_dataframe(sns.scatterplot, 'mpg', 'acceleration', hue='model_year')
g2.add_legend()
plt.show()

#### Below using .map() will give us the same result as above, however we pass slightly different arguments.  The args are column names from self.data (the mpg dataframe in this case), and all kwargs are passed to the function (sns.scatterplot in this case).

In [None]:
g2 = sns.FacetGrid(mpg, col='origin')
g2.map(sns.scatterplot, 'mpg', 'acceleration', hue=mpg['model_year'])
plt.show()

#### Same result...

In [None]:
g2 = sns.FacetGrid(mpg, col='origin')
g2.map(sns.scatterplot, 'mpg', 'acceleration', data=mpg, hue='model_year')
plt.show()

#### A modified plot using scatter and line kws.

In [None]:
g3 = sns.FacetGrid(mpg, col='origin')
# scatter_kws and line_kws are addtional args that get passed to plt.scatter and plt.plot
g3.map_dataframe(sns.regplot, 'mpg', 'acceleration', scatter_kws={"color": "black"}, line_kws={"color": "red"})
plt.show()

In [None]:
# note that lmplot is figure-level while regplot is axes-level
# lmplot combines facetgrids with elements of regplot()

g4 = sns.lmplot(data=mpg, x='mpg', y='acceleration', col='origin', hue='origin')
plt.show()

In [None]:
# note that lmplot is the figure-level equivalent of regplot

g5 = sns.lmplot(data=mpg, x='mpg', y='acceleration', hue='origin')
plt.show()

__pairplots__

These were tricky for us to achieve in matplotlib but easy with seaborn.
These will detect and operate only on numeric columns.

In [None]:
sns.pairplot(mpg)
plt.show()

In [None]:
sns.reset_orig()

#### More on Gridspec and customizing figure layouts

Subplots is probably the most common approach for specifying multiple axes,
but as we've seen there are options that provide greater flexibility.
Gridspec allows us to set the geometry in terms of #rows and #cols.

#### a basic 2x2 subplot space

In [None]:
fig1, axes1 = plt.subplots(2, 2, figsize=(9, 6))
plt.show()

#### The above is more cumbersome to achieve with gridspec...

but notice the flexibility in terms of achieving differential axes sizing.

Note: constrained layout is similar to tight_layout() but needs
to be implemented before axes are added to a figure.

In [None]:
fig2 = plt.figure(constrained_layout=True)

gs = gridspec.GridSpec(2, 2, figure=fig2)

ax2_1 = fig2.add_subplot(gs[0, 0])
ax2_2 = fig2.add_subplot(gs[0, 1])
ax2_3 = fig2.add_subplot(gs[1, 0])
ax2_4 = fig2.add_subplot(gs[1, 1])

plt.show()

#### we can easily achieve something more nuanced

In [None]:
fig3 = plt.figure(constrained_layout=True)

gs = gridspec.GridSpec(3, 3, figure=fig3)

ax3_1 = fig3.add_subplot(gs[0, :])
ax3_2 = fig3.add_subplot(gs[1, :2])
ax3_3 = fig3.add_subplot(gs[1, 2:])
ax3_4 = fig3.add_subplot(gs[2:, :1])
ax3_5 = fig3.add_subplot(gs[2:, 1:])

plt.show()

#### note that .add_gridspec() is a convenience method to accomplish the above. This can save you an import.

In [None]:
fig4 = plt.figure(constrained_layout=True)

gs = fig4.add_gridspec(3, 3)

ax4_1 = fig4.add_subplot(gs[0, :])
ax4_1.set_title('gs[0, :]')
ax4_2 = fig4.add_subplot(gs[1, :2])
ax4_2.set_title('gs[1, :2]')
ax4_3 = fig4.add_subplot(gs[1, 2:])
ax4_3.set_title('gs[1, 2:]')
ax4_4 = fig4.add_subplot(gs[2:, :1])
ax4_4.set_title('gs[2:, :1]')
ax4_5 = fig4.add_subplot(gs[2:, 1:])
ax4_5.set_title('gs[2:, 1:]')

plt.show()

#### Now we'll specify some width and height ratios.

In [None]:
fig5 = plt.figure(constrained_layout=True)

"""Note the absolute values don't matter here...

We're only concerned with the ratios.
[2, 3, 1.5] is equiv to [4, 6, 3]"""

widths = [2, 3, 1.5]
heights = [1, 3, 2]

# initializing the gridspec geometry
gs5 = fig5.add_gridspec(nrows=3, ncols=3, width_ratios=widths,
                        height_ratios=heights)

"""We have our gridspec. Now we're just assigning
gridspec components to axes and annotating for clarity."""

for row in range(3):
    for col in range(3):
        ax = fig5.add_subplot(gs5[row, col])
        label = 'Width: {}\nHeight: {}'.format(widths[col], heights[row])
        ax.annotate(label, (0.1, 0.5), xycoords='axes fraction', va='center')

plt.show()

#### Now we use the gridspec_kw parameter with subplots.

Note, we're passing the width and height params (as a dict) to gridspec_kw
as part of our call to subplots() instead of passing them to
add_gridspec() or gridspec.GridSpec()

In [None]:
# using gridspec_kw...a subplots() parameter.
# any parameter accepted by GridSpec() can be passed to subplots() through the gridspec_kw parameter.
# below width_ratios and height_ratios are keyword params accepted by GridSpec(), but we'll pass them directly to subplots().

widths = [1, 1, 2] # col width
heights = [1, 1, 1] # row height

gs_kw = dict(width_ratios=widths, height_ratios=heights)

fig, axs = plt.subplots(ncols=3, nrows=3, constrained_layout=True,
                         gridspec_kw=gs_kw)

"""Since we're passing gridspec params through subplots
we already have our axes specified, so we'll iterate a bit differently than above."""

for i, ax in np.ndenumerate(axs):
        label = 'Width: {}\nHeight: {}'.format(widths[i[1]], heights[i[0]])
        ax.annotate(label, (0.1, 0.5), xycoords='axes fraction', va='center')        

plt.show()

__IN-Class__

Read in the following csv files. These are US higher education enrollment data.

chars = pd.read_csv('https://nces.ed.gov/ipeds/datacenter/data/HD2021.zip', 
                    compression='zip',
                    encoding="ISO-8859-1")

enr = pd.read_csv('https://nces.ed.gov/ipeds/datacenter/data/EFFY2021.zip',
                  compression='zip',encoding="ISO-8859-1")
                  

- Retain INSTNM,  STABBR, CONTROL and UNITID from chars.  
- Only retain enr records where EFFYALEV = 1 (all credit seeking students).
- Retain EFYTOTLT and UNITID from enr.  

CONTROL (1=public, 2=private, 3=for profit) drop -3.

UNITID can be used to join these two DataFrames. 

Listwise drop records with any null values.

INSTNM = school
STABBR = state
CONTROL = control
EFYTOTLT = enrollment

_Try to address the first task below. If you have time attempt the second and third items._

1. Using subplots in conjunction with sns plotting functions plot a histogram of total enrollment and overlay the cummulative distribution function on top of it.  There may be some extreme enrollment values, so think about an appropriate approach for excluding records that will allow us to get a good view of the distribution.

2. Compare the enrollment distributionss of public, private and for profit institutions.

3. Plot the enrollment distribution of Colorado institutions, and try to call out DU's enrollment specifically.

The preliminary cleaning steps are provided below so you can focus on the plotting.

In [None]:
chars = pd.read_csv('https://nces.ed.gov/ipeds/datacenter/data/HD2021.zip', 
                    compression='zip',
                    encoding="ISO-8859-1")

#### The second zip file contains two csv files, so you can run the following bash commands to curl and unizp it, or you can download it manually.

In [None]:
! curl -O https://nces.ed.gov/ipeds/datacenter/data/EFFY2021.zip

In [None]:
! unzip EFFY2021.zip

In [None]:
enr = pd.read_csv('effy2021.csv',
                  encoding="ISO-8859-1")

In [None]:
chars = chars[['UNITID', 'INSTNM', 'STABBR', 'CONTROL']]

enr = enr.loc[enr['EFFYALEV'] == 1, ['UNITID', 'EFYTOTLT']]

data = pd.merge(chars, enr, how='left', on='UNITID')

In [None]:
data.dropna(how='any', axis=0, inplace=True)

In [None]:
data.rename(columns={'INSTNM': 'school',
                    'STABBR': 'state',
                    'CONTROL': 'control',
                    'EFYTOTLT': 'enrollment'},
           inplace=True)

In [None]:
data['control'] = data['control'].map({1: 'public', 2: 'private', 3: 'for-profit'})