# Advanced Visualization with Seaborn
**Introduction to Python Programming for Earth Scientists**, session #XX, 27 November 2023

## Goals
* Find data online and understand how to use metadata to interpret it
* Use seaborn and pandas to understand how data is correlated
* Use the "hue" argument to categorize data

Earlier in this semester some fellow students have already introduced you to Seaborn, a libary to make statistical visualizations of data.  We will spend a little more time using this library to explore a dataset of water samples from Gordon Gulch, west of Boulder.  This data was collected by CU Boulder goegraphy masters student Maggie Burns and her advisor, Holly Barnard.  The data is available here: https://www.hydroshare.org/resource/25ba8374892541c4bfe0c9cf18f520ca/  
Please read about what the data is on this website.  Once you understand what the data, please download it from Hydroshare and upload it to the JupyterHub

Seaborn is build on Pandas, and matplotlib.  While we won't have to engadge with matplotlib directly, understanding matplotlib is necessary to customize seaborn plots, and many advanced examples freely blend the two.  For now, we will import pandas and seaborn.

In [None]:
import pandas as pd
import seaborn as sns

Next, we will load in our data CSV into seaborn.  First we will define the path variable as a string (the exact value will depend on where you uploaded the data file!) and load it as a dataframe

In [None]:
!ls ../../Scratch/DOM*

In [None]:
data_path = "../../Scratch/DOM_MA_Data.csv/DOM_MA_Data.csv" # for me, it was "cub/teaching/python/DOM_MA_Data.csv"
df = pd.read_csv(data_path)

Now, preview the data frame to think about how we will reference it

In [None]:
df

In seaborn, like pandas, we pass a data frame and reference columns with a stringcorresponding to the column name.  This is in contrast to MatPlotLib, where we pass arrays of data.  Lets start with making a line plot to look at how Dissolved Organic Carbon (DOC) changes over the year

In [None]:
sns.lineplot(data=df, x="Date", y="DOC (mg/L)")

Wow!  Not just a simple a line plot!  There's a few things we should think about.  First, did you get some scary looking warnings?  Read what they *actually* say.  Do you think this is something you need to worry about?  Secondly, notice that there is a line, and an area.  What do you think that area is?  Take a look at the lineplot documentation: https://seaborn.pydata.org/generated/seaborn.lineplot.html
Lastly, that x-axis is troublesome.  Lets clean it up a bit.  The immediate issue, is that the "date" column contains strings, which Python isn't parsing as a date.  This means it can't group the labels on the X axis correctly, but it's also an issue because, as you may have seen from the table, the rows of observations aren't in chronological order!
 ## <font color = green> IN-CLASS PRACTICE </font> 
Make a lineplot of DOC that is in chronolgical order.  

In [None]:
# HINT: There are a few ways to do this, but there is a column
# in the table that you can use without issue
df = df.sort_values(by="Date")
sns.lineplot(data=df, x="DOY", y="DOC (mg/L)")

## Exploring Correlations
Now we will look at correlations in this dataset.  There are a LOT of columns here, so we will select some that we think might be interesting.  Feel free to add and remove columns from the list

In [None]:
variables = ["DOC (mg/L)", # Dissolved organic carbon
             "HIX", # Humification Index
             "BIX", # Freshness Index
             "FI", # Florescene Index
             "Percent Protein"]

Now we will subset our dataframe so that we are only working with the values we care about, and avoid things that might confuse python, like sample IDs

In [None]:
df_subset = df[variables]
df_subset

Dataframes hace a method `corr`, which generate a correlation matrix, which scores each varaible pair by how correlated they are using the pearson correlation coefficent.  We can pass that to the seaborn heatmap function to get a nice figure:

In [None]:
sns.heatmap(df_subset.corr(), vmax=1, vmin=-1, cmap="seismic")

Lets look at how these data compare directly.  Seaborn has a plot for that!  The pairplot!  It does a scatterplot of every combination and a histogram of each variable.

In [None]:
sns.pairplot(df_subset)

### Categorizing Data
Seaborn also lets you color data by category in a plot using the "hue" parameter.  This is an easy way to plot multiple groups of the same data in the same plot.  We will try categorizing by site

In [None]:
variables.append("Site")
df_subset = df[variables]
sns.pairplot(df_subset, hue="Site")

 ## <font color = green> IN-CLASS PRACTICE </font> 
Choose a pair of variables to explore in more depth with seaborn.  I suggest using the function `regplot` (https://seaborn.pydata.org/generated/seaborn.regplot.html#seaborn.regplot) but explore the seaborn documentation and use any function or functions you want!

In [None]:
sns.regplot(data=df, x="DOC (mg/L)", y="HIX")

In [None]:
help(sns.regplot)