### Visualization in python, using seaborn

Included with this notebook is an anonymized  dataset from a thesis project assessing the rates of natural history digitization. We'll use these data to practice visualization in python.

In [18]:
# notice the import with simplified alias
import pandas as pd

# On load in, tell pandas that the 'Date' column should be treated as a date object
df = pd.read_csv('imaging_rates.csv', parse_dates=['Date'])
df.head(2)

Unnamed: 0,Date,name,qty_imaged,minutes_spent,Were there any setbacks? Please describe.,rate,tot_min,binned,tot_hours
0,2017-01-10,Emma,190.0,180.0,,1.056,180.0,"(120.0, 180.0]",3.0
1,2017-01-10,Vincent,242.0,175.0,,1.383,175.0,"(120.0, 180.0]",2.92


In [19]:
# add a rate column, as qty imaged / minutes spent
df['rate'] = round(df['qty_imaged'] / df['minutes_spent'], 3)
df.head(2)

Unnamed: 0,Date,name,qty_imaged,minutes_spent,Were there any setbacks? Please describe.,rate,tot_min,binned,tot_hours
0,2017-01-10,Emma,190.0,180.0,,1.056,180.0,"(120.0, 180.0]",3.0
1,2017-01-10,Vincent,242.0,175.0,,1.383,175.0,"(120.0, 180.0]",2.92


In [20]:
# add a total cumulative minutes column
# first groupby name, then calculate the cumulative sum minutes_spent for each group, aka person.
df['tot_min'] = df.groupby('name')['minutes_spent'].cumsum()
# for simplicity, add a total hours
df.head(3)

Unnamed: 0,Date,name,qty_imaged,minutes_spent,Were there any setbacks? Please describe.,rate,tot_min,binned,tot_hours
0,2017-01-10,Emma,190.0,180.0,,1.056,180.0,"(120.0, 180.0]",3.0
1,2017-01-10,Vincent,242.0,175.0,,1.383,175.0,"(120.0, 180.0]",2.92
2,2017-01-10,Emma,198.0,180.0,,1.1,360.0,"(120.0, 180.0]",6.0


### Activity: add a calculated column.

For simplicity, add a `tot_hours` column to the DataFrame similar to the way we added the rate column.


Notice there is a setbacks field. Should we remove them? 

```{python}
df = df.loc[df['Were there any setbacks? Please describe.'].isna()]
```

In [None]:
display(df.loc[df['Were there any setbacks? Please describe.'].notna()].sample(5))

In [None]:

df.shape

There are many options for visualization, most plotting is based on the [Matplotlib](https://matplotlib.org/) library which is worth getting to know. However, the [seaborn](https://seaborn.pydata.org/) library offers simplified access to the Matplotlib functions, making it a great place to start.

In [None]:
# import seaborn using the sns alias
import seaborn as sns
# In order to control the plot size, we have to use an underlying Matplotlib function.

import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [20, 10]

sns.lineplot(x='tot_min', 
             y='rate', 
             hue='name',
             data=df,
             linewidth=2.5)

### Activity: explore various visualization options for the large dataset.

Presume you're the principal investigator on a NSF funded project to digitize a quarter milion natural history specimens. Over the course of a semester, many students have been earning credit hours by helping with this digitization effort. You want to hire 3 students to continue their work over the summer, but you're not sure which students to keep. Explore the results from the various visualization options below to make an informed decision.

In [None]:
# this initates the grid which will be filled in with whatever option is chosen below.
g = sns.FacetGrid(df, hue="name", col="name", col_wrap=5)

# Choose from these options by uncommenting them one at a time.
#g.map(plt.scatter, 'tot_hours', "rate", edgecolor="white")
#g.map(plt.plot, 'tot_hours', 'rate', marker="o")
#g.map(sns.regplot, 'tot_hours', "rate")#, x_bins=range(0,60))

# this sets the limits to the plots
g.set(ylim=(0, 6), xlim=(0, 60))

In [None]:
# here is one example of attempting to gain an average for all combined students

g = sns.regplot(x='tot_hours', y='rate',data=df, x_bins= range(1,61))
g.set(ylim=(0, 6), xlim=(0, 60))

### Activity: try some of the example recipes from seaborn

seaborn [has a great example gallery](https://seaborn.pydata.org/examples/index.html), try some one of their recipes in the cell below.