# Week 6: Data Visualization with `matplotlib`

## a Brief Recap:

* Hello, how are you?
* Today: visualizing data in Python with:
    - `matplotlib`
    - `seaborn`
    - and some other helpfull examples
* Next week: (can you believe it will be week 8 already?!?!)
    - building data dashboards
    - catching up with Python Classes

## about `matplotlib`


*"My goal is to make high quality, publication quality plotting easy in python, with syntax familiar to MATLAB users"* - John D. Hunter


<tr>
<td> <img src="https://paw.princeton.edu/sites/default/files/styles/portrait_feature/public/images/content/90-Hunter-John-D.jpg?itok=wDYFhIbT" alt="Drawing" style="width: 200px;"/> </td>
<td> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/Created_with_Matplotlib-logo.svg/1024px-Created_with_Matplotlib-logo.svg.png" alt="Drawing" style="width: 200px;"/> </td>
</tr>

## more about `matplotlib`

* a plotting library designed to have MATLABlike functionality
* an extension of `NumPy`
* many other Python visualizations are implemented as extensions of `matplotlib`
* interfaces with `.py`, `.ipynb`, web applications and GUI toolkits
* can export to various hard-copy formats (publication quality)
* bummer: no great for 3D visualizations


## `matplotlib` architecture

* Three layer stack: 
    1. **scripting** - what the user uses to create
        * the [**pyplot API**](https://matplotlib.org/stable/api/index.html) - the command style functions that make `matplotlib` work like MATLAB
    2. **Artist** - does the internal work of rendering
        * the Artist object draws various elements of a plot
            - primitives: what to render (lines, text, shapes)
            - containers: where to render (axes, figure)
    3. **backend** - displays the plot
        * **user-interface (interactive)** - GUIs, etc
        * **hard-copy (non-interactive)** - .png, .svg, .pdf

## `matplotlib` figure

**Figure** - the higher level opbject that contains all elements of a graph: the primatives and the containers
<img src="anatomy_matplotlib.png" width="50%" style="margin-left:auto; margin-right:auto">

## `matplotlib` getting started

We'll be working in non-interactive mode. To display the plot, we need to explicitly do so by calling `plt.show()`

In [None]:
# set the screen output as the backend
#%matplotlib inline

# import matplotlib with the synonym 'plt'
import matplotlib.pyplot as plt 

In [None]:
# create a simple line graph
plt.plot([1,3]) 
plt.show()

### invite Artists

In [None]:
# add some labels to the axes and a title
#plt.plot([1,3]) equivalent to:
plt.plot([0,1],[1,3])
plt.title( 'Example Plot' )
plt.xlabel( 'some units' )
plt.ylabel( 'some other units' )
plt.show()

## plotting Data

the `.plot()` method can take python lists, numpy arrays or pandas Dataframe columns  
we need our data to be formatted correctly for `matplotlib`

In [None]:
import numpy as np
import pandas as pd

## Revisit the Pima Indians data

In [None]:
path = 'https://raw.githubusercontent.com/SmilodonCub/DS4VS/master/datasets/diabetes.csv'
diabetes_pima = pd.read_csv( path )
pimacolumns_2change = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
diabetes_pima[ pimacolumns_2change ] = diabetes_pima[ pimacolumns_2change ].replace(0, np.nan )
diabetes_pima.info()

## visualize `DiabetesPedigreeFunction` ~ Age

**Diabetes Pedigree Function** - provides a measure of the diabetes mellitus history in relatives and the genetic relationship of those relatives to the patient. Could be considered a measure of the hereditary risk associated with the onset of diabetes mellitus. 

In [None]:
diabetes_pima.plot( x = 'Age', y = 'DiabetesPedigreeFunction')
plt.show()

## we can do so much better...

a line plot was clearly inappropriate to render the distribution of the data.  
`matplotlib` has many other plotting methods we can chose from. Here are some basics:  

* Bar 
* Scatter 
* Histogram
* Box
* Violin
* for extensive lists with example code:
    - [python-graph-gallery](https://www.python-graph-gallery.com/)
    - [Matplotlib Gallery](https://matplotlib.org/stable/gallery/)

## Scatter plot

In [None]:
plt.scatter( x = diabetes_pima['Age'], 
            y = diabetes_pima['DiabetesPedigreeFunction'])
plt.show()

In [None]:
fig, ax = plt.subplots()
colors = [ 'red', 'blue' ]
for idx, outcome in enumerate( set( diabetes_pima['Outcome'] ) ):
    #print( outcome )
    outcome_idx = np.where( diabetes_pima['Outcome'] == outcome )
    #print( outcome_idx )
    ax.scatter(  x = diabetes_pima['Age'].iloc[outcome_idx], 
               y = diabetes_pima['DiabetesPedigreeFunction'].iloc[outcome_idx],
               color = colors[ idx ] )
plt.legend(labels = ['0','1'], title = 'Outcome')
plt.show()

## Histogram

In [None]:
bins = np.linspace( 20,90,70 )
colors = [ 'red', 'blue' ]

for idx, outcome in enumerate( set( diabetes_pima['Outcome'] ) ):
    outcome_idx = np.where( diabetes_pima['Outcome'] == outcome )
    plt.hist(  diabetes_pima['Age'].iloc[outcome_idx], 
             bins, alpha = 0.5, color = colors[ idx ] )
    
plt.legend(labels = ['0','1'], title = 'Outcome')
plt.title( 'Age' )
plt.xlabel( 'Age' )
plt.ylabel( 'Count' )
plt.show()

In [None]:
bins = np.linspace( 0,3 )
colors = [ 'red', 'blue' ]

for idx, outcome in enumerate( set( diabetes_pima['Outcome'] ) ):
    outcome_idx = np.where( diabetes_pima['Outcome'] == outcome )
    plt.hist(  diabetes_pima['DiabetesPedigreeFunction'].iloc[outcome_idx], 
             bins, alpha = 0.5, color = colors[ idx ] )
    
plt.legend(labels = ['0','1'], title = 'Outcome')
plt.title( 'Diabetes Pedigree Function Distribution' )
plt.xlabel( 'Diabetes Pedigree Function' )
plt.ylabel( 'Count' )
plt.show()

In [None]:
fig, ax = plt.subplots( figsize = (10,5) )
binsAge = np.linspace( 20,90, 70 )
binsDPF = np.linspace( 0,3 )
colors = [ 'red', 'blue' ]

for idx, outcome in enumerate( set( diabetes_pima['Outcome'] ) ):
    outcome_idx = np.where( diabetes_pima['Outcome'] == outcome )
    plt.subplot( 1,2,1 )
    plt.hist(  diabetes_pima['Age'].iloc[outcome_idx], binsAge, alpha = 0.5, color = colors[ idx ] )
    plt.subplot( 1,2,2 )
    plt.hist(  diabetes_pima['DiabetesPedigreeFunction'].iloc[outcome_idx], 
             binsDPF, alpha = 0.5, color = colors[ idx ] )
    
plt.legend(labels = ['0','1'], title = 'Outcome')
plt.show()

## Grouped bar chart

In [None]:
fig, ax = plt.subplots( figsize = (10,5) )
colors = [ 'red', 'blue' ]
width = 0.35

outcome0_idx = np.where( diabetes_pima['Outcome'] == 0 )
outcome1_idx = np.where( diabetes_pima['Outcome'] == 1 )

series0 = ax.bar( diabetes_pima['Age'].iloc[outcome0_idx] - width/2, 
                 diabetes_pima['DiabetesPedigreeFunction'].iloc[outcome0_idx], width, 
                 label = '0', alpha = 0.5, color = colors[0])
series1 = ax.bar( diabetes_pima['Age'].iloc[outcome1_idx ] + width/2, 
                 diabetes_pima['DiabetesPedigreeFunction'].iloc[outcome1_idx ], width,
                 label = '1', alpha = 0.5, color = colors[1])
    
plt.legend(labels = ['0','1'], title = 'Outcome')
plt.show()

## Hmmmm let's bin the data by `Age`

In [None]:
# binning by age
diabetes_pima['rounded_Age'] = diabetes_pima['Age'].apply( lambda x: int( 10 * round( float(x)/10 ) ) )
diabetes_pima.head()

## Aggregate the data around the age bins

In [None]:
# aggegate by the new feature 'rounded_Age'
diabetes_pima_ageAgg = diabetes_pima.groupby(['rounded_Age', 'Outcome']).mean().reset_index(level='Outcome')
diabetes_pima_ageAgg 

In [None]:
fig, ax = plt.subplots( figsize = (10,5) )
colors = [ 'red', 'blue' ]
labels = [ '20s', '30s', '40s', '50s', '60s', '70s', '80s' ]
x0 = np.arange( len( labels ) )
x1 = x0[:-1]
width = 0.35


outcome0_idx = np.where( diabetes_pima_ageAgg['Outcome'] == 0 )
outcome0_means = list( diabetes_pima_ageAgg['DiabetesPedigreeFunction'].iloc[outcome0_idx] )
outcome1_idx = np.where( diabetes_pima_ageAgg['Outcome'] == 1 )
outcome1_means = list( diabetes_pima_ageAgg['DiabetesPedigreeFunction'].iloc[outcome1_idx] )

series0 = ax.bar( x0 - width/2, outcome0_means, width, label = '0', alpha = 0.5, color = colors[0])
series1 = ax.bar( x1 + width/2, outcome1_means, width, label = '1', alpha = 0.5, color = colors[1])

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('mean DPF')
ax.set_title('Diabetes Pedigree Function grouped by Age')
ax.set_xticks(x0)
ax.set_xticklabels(labels)
ax.legend()

fig.tight_layout()
    
plt.legend(labels = ['0','1'], title = 'Outcome')
plt.show()

## reshape the data to facilitate visualization

use `pandas` to create a pivot table  

**pivot table** - restructure around existing patterns in the data and extract (aggregate) statistics about the groupings.

In [None]:
diabetes_pima_2plot = diabetes_pima_ageAgg.reset_index().pivot( index = 'rounded_Age', 
                                                               columns = 'Outcome', 
                                                               values = 'DiabetesPedigreeFunction')
diabetes_pima_2plot

In [None]:
fig, ax = plt.subplots( figsize = (10,5) )
colors = [ 'red', 'blue' ]
labels = [ '20s', '30s', '40s', '50s', '60s', '70s', '80s' ]
x = np.arange( len( labels ) )
width = 0.35

series0 = ax.bar( x - width/2, diabetes_pima_2plot.iloc[:,0], width, label = '0', alpha = 0.5, color = colors[0])
series1 = ax.bar( x + width/2, diabetes_pima_2plot.iloc[:,1], width, label = '1', alpha = 0.5, color = colors[1])

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('mean DPF')
ax.set_title('Diabetes Pedigree Function grouped by Age')
ax.set_xticks(x0)
ax.set_xticklabels(labels)
ax.legend()

fig.tight_layout()
    
plt.legend(labels = ['0','1'], title = 'Outcome')
plt.show()

## Is there another relationship in the data you would like to visualize?

Let's take a moment here.  
Think about the data and about what each feature/variable/column means.
Visualize another variable or combination of variables in a way that interests you.

In [None]:
# create a plot of your choice

## Revisit the Psychophysics data from last week

### The Vernier Hyperacuity task:

* **stimulus**: 2 drifting sinusoid gratings were presented
    - luminance, equiluminant chromatic
    - variable gap size between the two gratings
* the task:
    - subject must discriminate whether the left of right grating is shifted higher than the partner
    - 2 conditions: same polarity alignment, reverse polarity alignment
* the threshold measure:
    - we will take the mean of the last 4 for reversals for the 2 starcases in each file

here we will load a .csv file with the fields we will need for aggregating and visualizing results from this example psychophysics tasks

In [None]:
gratvernier_df = pd.read_csv( 'gratvernier_mj.csv', converters={'stim1_reversals': eval,'stim2_reversals': eval} )
#gratvernier_df.head()

In [None]:
gratvernier_df.info()

In [None]:
# let's make some new columns that hold the last 4 reversals of each staircase.
gratvernier_df['stim1_tail4'] = [ x[-4:] for x in gratvernier_df['stim1_reversals'] ]
gratvernier_df['stim2_tail4'] = [ x[-4:] for x in gratvernier_df['stim2_reversals'] ]
gratvernier_df['reversals'] = gratvernier_df['stim1_tail4'] + gratvernier_df['stim2_tail4']
gratvernier_df.head()

## Aggregating and filtering information to plot 

we need to shape our data in a way that will facilitate passing the data of interest to `matplotlib` scripting functions  

we are most interested in:  

* `grat_gap` - the size of the gap between the target gratings
* `type_vernier` - the polarity of alignment (In/Out of Phase)

we will aggregate the data on `grat_gap` and `type_vernier` and summarize the mean thresholds

In [None]:
res = gratvernier_df.groupby( [ 'grat_gap', 'type_vernier' ] ).agg({'reversals':'sum'})
res['reversal_mean'] = [ np.array( x ).mean() for x in res['reversals'] ]
res = res.reset_index()

In [None]:
res

## Pivot the data to facilitate visualization

* index: which column do you want to be the explanatory variable
* columns: which variable do you want to split into columns (typically a categorical feature)
* values: which data do you want to be the response variable

In [None]:
plotres = res.pivot( index = '______', columns = '______', values = '______')
plotres

## a simple line plot to sketch the outcome

In [None]:
plt.plot( plotres['InPhase'], color = 'red' )
plt.plot( plotres['OutOfPhase'], color = 'blue' )
plt.show()

## Take a moment to add more elements to the figure

Can you add:  

* title
* x,y labels
* legend
* anything else?...

## We've used `matplotlib` for basic figures. Next week we'll learn about 'higher level' plots with `seaborn` and some more specialized visualization packages.
<img src="https://content.techgig.com/photo/80071467/pros-and-cons-of-python-programming-language-that-every-learner-must-know.jpg?132269" width="100%" style="margin-left:auto; margin-right:auto">