In [None]:
# Setting up the Colab environment. DO NOT EDIT!
try:
  from applied_biostats import setup_environment
except ImportError:
  !pip -q install applied-biostats-helper
  from applied_biostats import setup_environment
finally:
  grader = setup_environment('Module05_walkthrough')

# Walkthrough

## Learning Objectives
At the end of this learning activity you will be able to:
 - Create barplots of categorical count data.
 - Adjust the limits, labels, and titles of matplotlib axes.
 - Create boxplots of continious numerical data.
 - Generate histograms of continious numerical data.
 - Construct scatterplots to compare continious variables.

In [None]:
import numpy as np
import pandas as pd

# A common import style you'll see across the web
import matplotlib.pyplot as plt

# Make the notebook show images as we make them
%matplotlib inline

## Matplotlib

Matplotlib is a highly influential plotting library in Python dating back to the early 2000s.
It was initially created by John D. Hunter, a neurobiologist, as an alternative to MATLAB, which was widely used at the time for scientific computing and data visualization.
His primary motivation was to have an open-source tool that could replicate MATLAB's plotting capabilities, which he needed for his work in electrophysiology.
Over the years, it has grown with contributions from a large community of developers, evolving to support a wide range of plots and visualizations.

A key to Matplotlib's success is been its flexibility and integration with other Python libraries.
It works well with NumPy and Pandas, making it a go-to choice for data analysis and manipulation tasks.
Its integration with Jupyter notebooks has also made it popular for exploratory data analysis in a notebook environment.

Matplotlib's design philosophy revolves around the idea of allowing users to create simple plots with just a few lines of code, while also giving them the ability to make complex customizations.
This balance between simplicity and power has contributed significantly to its widespread adoption.

If you are interested, you can read more about the history of the package at their [website](https://matplotlib.org/stable/users/project/history.html).


## Data

This week we will look at data from a cohort of People Living with HIV (PLH) here at Drexel.

As we discussed in the introduction, this data collection effort was done to provide a resource for many projects across the fields of HIV, aging, inflammation, neurocognitive impairment, immune function, and unknowable future projects.
In this walkthrough we will explore a collection of cytokines and chemokines measured by a Luminex panel of common biomarkers of inflammation.

In [None]:
data = pd.read_csv('cytokine_data.csv')
data.head()

## Basic Plotting

`pandas` and `matplotlib` are tightly coupled and provide a number of ways to make simple plots easily.
Most pandas objects have `.plot()` method that can graph the data within it and control many of the outputs.

Columns (or any `pd.Series` object) have a method for easily counting categorical values:
`.value_counts()`

In [None]:
data['Sex'].value_counts()

In [None]:
# Just plot it.

data['Sex'].value_counts().plot()

That's _almost_ what we want.
By default, the `kind` of plot is a line-plot, because it was originally designed for time-series financial data.
Nicely, pandas allows many different ways to customize a plot.
One of which, is to change its `kind`, we can change that like so.

In [None]:
data['Sex'].value_counts().plot(kind = 'bar')

Like we learned last week, grouping samples by categories can be insightful.
What if we wanted to know whether there was a balance of racial minorities across our gender categories?

To do this, you can use `groupby` to create multiple levels.

In [None]:
data.groupby('Sex')['isAA'].value_counts()

In [None]:
# Notice kind='barh' to make it horizontal

data.groupby('Sex')['isAA'].value_counts().plot(kind = 'barh')

We can also pivot the data such that we have a table with a column for each `isAA`.

In [None]:
gender_race_piv = pd.pivot_table(data,
                                 index = 'Sex',
                                 columns = 'isAA',
                                 values = 'Age', # Can be any column, we're just counting them
                                 aggfunc = 'count')
gender_race_piv

Then, it will plot each column as a different bar.

In [None]:
gender_race_piv.plot(kind = 'bar')

In [None]:
gender_race_piv.plot(kind = 'bar', stacked=True)

There are _dozens_ of things you can customize about your plots in this manner.
You can see them either by checking the `help` here in Colab.
To do this, run `data.plot?` in a cell by itself, and Colab will bring up some information to read.
You can also check out the documentation on the `pandas` website [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) and in their tutorial [here](https://pandas.pydata.org/docs/user_guide/visualization.html).


## Plot Handles

If we want to make edits to the plot, we need to capture the `handle` that is generated by the plot.
This variable represents the object of the plot and allows us to manipulate its properties like the axis limits, labels, etc.
This must be done in the same cell before the image is presented.

In [None]:
axis_handle = data.groupby('Sex')['isAA'].value_counts().plot(kind = 'barh')
axis_handle.set_xlim(0, 160)

In [None]:
axis_handle = data.groupby('Sex')['isAA'].value_counts().plot(kind = 'barh')
axis_handle.set_xlim(0, 160)
axis_handle.set_xlabel('Participants')

### Q1: Explore the `cocaine_use` and `cannabinoid_use` columns.

Create a barplot of the number of cocaine, cannabinoid, multi-use, and non-use.

|               |    |
| --------------|----|
| Points        | 2  |
| Public Checks | 4  |

_Points:_ 2

In [None]:
# Add a new column indicating True for multi-use

...

# Add a new column indicating True for non-use
...


In [None]:
# Sum the number of True's in each use column

use_counts = ...
use_counts

In [None]:
# Create a barplot
use_axis = ...

In [None]:
grader.check("q1_drug_use_plot")

## Numeric Variables

We can summarize numerical columns in a number of ways.

### Box Plots

In [None]:
data['Age'].plot(kind = 'box')

Breaking it down:
 - The middle green line is the _mean_
 - The box represents the 25-75 quartiles
 - The whiskers represent the 95% confidence interval
 - The dots are outliers outside the 95% CI.

You can do multiple box plots if your data is in `wide` form.

In [None]:
data[['egf', 'eotaxin', 'fgfbasic', 'gcsf', 'gmcsf']].plot(kind='box')

You can also group by another column to create subplots.

In [None]:
data[['Sex', 'egf', 'eotaxin']].plot(kind='box', by = 'Sex')

### Q2: Is the expression of `infalpha` or `vegf` different across neurological impairment status?

Create a set of boxplots to visualize the `infalpha` or `vegf` at different neurological states in the `neuro_screen_impairment_level` column.

|               |    |
| --------------|----|
| Points        | 2  |
| Public Checks | 4  |

_Points:_ 2

In [None]:
q2_axes = ...

In [None]:
grader.check("q2_neuro_use_plot")

In [None]:
# DO NOT REMOVE!
plt.close()
# For the grader

### Histograms

In [None]:
data['eotaxin'].plot(kind = 'hist')

Personally, I prefer to specify my bin edges explicitly instead of letting the computer decide.

In [None]:
data['eotaxin'].plot(kind = 'hist',
                     bins = np.arange(0, 300, 25))

In [None]:
data.groupby('Sex')['eotaxin'].plot(kind = 'hist',
                                    bins = np.arange(0, 300, 25),
                                    alpha = 0.75,
                                    legend=True)

## Comparison of Variables

In [None]:
data.plot(kind = 'scatter', x = 'mip1alpha', y = 'mip1beta')

In [None]:
# We can also add colors
colors = data['Sex'].replace({'Male': 'b', 'Female': 'r', 'Transgender': 'g'})

ax = data.plot(kind = 'scatter', x = 'il13', y = 'ifngamma', 
               s = 'Age', # Make the size proportional to age
               c = colors
          )

One can also make a _GIANT_ matrix of different comparisons.

In [None]:
# It is helpful to pick columns first to prevent a figure explosion
cols = ['Age', 'gcsf', 'gmcsf',
       'ifnalpha', 'ifngamma', 'il10', 'il12', 'il13', 'il15', 'il17',
       'il1beta', 'il2', 'il2r', 'il4', 'il5', 'il6', 'il7', 'il8', 'ilra']

pd.plotting.scatter_matrix(data[cols], figsize=(10, 10));

We can also get a numeric summary of these correlations.

Method:
 - `method = 'pearson'` -  Pearson's correlation is ideal for continuous variables that have a linear relationship and are normally distributed.
 - `method = 'kendall'` - Kendall's tau is suitable for ordinal data or when dealing with non-linear relationships, especially in small samples or when data contains ties.
 - `method = 'spearman'` - Spearman's rank is best used with ordinal or non-normal data to assess monotonic relationships, being robust to outliers.


In [None]:
cross_corr = data[cols].corr(method = 'pearson')

# Using .style we can create a visually accented table
cross_corr.style.background_gradient(cmap='RdBu', vmin=-1, vmax=1)

`cross_corr` is just a `DataFrame`, which means we can extract columns.

In [None]:
# How does each cytokine correlate with Age?

cross_corr['Age'].plot(kind='bar')

These excercises should provide a basic set of plotting tools to visualize tabular data.
In the next week we'll explore more advanced 'statistical plotting' with the `seaborn` library.
This will add additional features like better faceting across groups, confidence intervals through bootstrapping, better legends, and more control to our plots.
In future weeks we'll also explore how to assess statistical significance across groups and strategies for finding correlated parameters.

## Matplotlib Gotchas

![Rakes](https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExbDdhNHg4NjE2N2s1cnd2MTdhYjV3NGttaThwbHE5MG93MDIydWhwdyZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/3o6Mbtdd7dhvbvugg0/giphy.gif)

While Matplotlib is great, it is sometimes incredibly frustrating.
Here's a handful of common rakes that I run across.

How do you get plots out of here?

In [None]:
# Make the plot and grab the axis object

ax = data['eotaxin'].plot(kind = 'hist')

ax.set_xlabel('eotaxin')

# Get the Figure handle this axis is on
fig = ax.figure


# Save the figure
fig.savefig('eotaxin_hist.png', # Can be any extension, but you probably want PNGs
            dpi = 50 # Good quality for viewing and debugging, use 300 for publications
            )

Overlapping labels.

In [None]:
data[['Sex', 'egf', 'eotaxin', 'hgf', 'gmcsf']].plot(kind='box', by = 'Sex')

In [None]:
# Grab the series of axis objects
ax_ser = data[['Sex', 'egf', 'eotaxin', 'hgf', 'gmcsf']].plot(kind='box', by = 'Sex')

# Somehow get the figure object
fig = ax_ser.iloc[0].figure

# Re-layout the figure
fig.tight_layout()

Rotating labels.

In [None]:
# Grab the series of axis objects
ax_ser = data[['Sex', 'egf', 'eotaxin', 'hgf', 'gmcsf']].plot(kind='box', by = 'Sex')

# Somehow get the figure object
fig = ax_ser.iloc[0].figure

# Create a function that fixes each axis
# lambda ax: ax.tick_params(axis='x', labelrotation=90)

# Apply that function across all axes BEFORE the re-layout
ax_ser.map(lambda ax: ax.tick_params(axis='x', labelrotation=90))

# Re-layout the figure
fig.tight_layout()

---------------------------------------------

## Submission

You do not need to submit this walkthrough notebook.
Simply complete the quiz.