In [None]:
# Setting up the Colab environment. DO NOT EDIT!
try:
  from applied_biostats import setup_environment
except ImportError:
  !pip -q install applied-biostats-helper
  from applied_biostats import setup_environment
finally:
  grader = setup_environment('Module05_lab')

# Lab

## Introduction

This week we will look at data from a cohort of People Living with HIV (PLwH) here at Drexel.

As we discussed in the introduction, this data collection effort was done to provide a resource for many projects across the fields of HIV, aging, inflammation, neurocognitive impairment, immune function, and unknowable future projects.
In this lab we will explore a collection of cytokines and chemokines measured by a Luminex panel of common biomarkers of inflammation.

## Learning Objectives
At the end of this learning activity you will be able to:
 - Pratice creating barplots and scatterplots.
 - Employ `DataFrame.corr` to measure the correlation between variables.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
data = pd.read_csv('cytokine_data.csv')
data.head()

### Q1: Explore the neurological function of the participants in the dataset.

Create a barplot of the counts for each of the `neuro_screen_impairment_level` level categories.

 - Adjust the y-axis to have a limit of 0 to 150 and the label to `'Participants'`
 - Adjust the x-label to `'Impairment Level'`

**Checked variables:**
 * `q1_ax` - A matplotlib Axes object with a bar plot of impairment level counts
   - Should have x-label: 'Impairment Level'
   - Should have y-label: 'Participants'
   - Should have y-axis limits from 0 to 150
 * `q1_smallest` - A string of the impairment level with the fewest participants
 * `q1_largest` - A string of the impairment level with the most participants

<details><summary>Hint</summary>
Use value_counts() to count each impairment level, then .plot(kind='bar') to create the plot. Use set_xlabel(), set_ylabel(), and set_ylim() to adjust labels and limits. See Module 5 walkthrough for plotting examples.</details>

|               |    |
| --------------|----|
| Points        | 2  |
| Public Checks | 4  |
| Hidden Tests | 2  |

_Points:_ 2

In [None]:

# Generate the figure
q1_ax = ...


# Adjust labels and limits

# Looking at the plot, which impairment level has the most participants?
q1_largest = ...

# Looking at the plot, which impairment level has the fewest participants?
q1_smallest = ...


In [None]:
# DO NOT REMOVE!
plt.close()
# For the grader

In [None]:
grader.check("q1_impairement_plot")

### Q2: Consider how pro-inflamatory markers are related to neurological impairment.

Examine the expression of the following cytokines.
 - `tnfalpha`
 - `il6`
 - `mcp1`
 - `mip1alpha`

**Checked variables:**
 * `q2_axs` - A set of 4 matplotlib Axes objects (one for each cytokine)
   - Should have boxplots showing each cytokine vs neuro_screen_impairment_level
   - Each plot should be titled with the cytokine name
 * `q2_ans` - The name of the cytokine that seems most strongly associated with impairment (string)

<details><summary>Hint</summary>
Use a for loop to create 4 subplots, one for each cytokine. Use .boxplot() or seaborn's boxplot to show each cytokine grouped by impairment level. Visually identify which shows the clearest differences across groups. See Module 5 walkthrough for boxplot examples.</details>



|               |    |
| --------------|----|
| Points        | 10 |
| Public Checks | 5  |
| Hidden Tests | 1  |

_Points:_ 10

In [None]:
# Use data.groupby(...) to take mean of each cytokine for each neuro_screen_impairment_level
# The table should have each of the 4 cytokines as columns 
#  and each row should be one of the different impairment levels.


q2_cytokine_summary = ...


In [None]:
# Use `.plot(kind = 'box', ...)` to create a set of boxplots for each cytokine split across each `neuro_screen_impairment_level` value.
# Each axis should be a cytokine
# Each category in each axis should be a neuro_screen_impairment_level


q2_axs = ...


In [None]:
# Which cytokine has the largest *absolute* difference in mean expression between `mild` and `none`?
# Answer as a string

q2_ans = ...

In [None]:
# DO NOT REMOVE!
plt.close()
# For the grader

In [None]:
grader.check("q2_pro_inflam")

### Q3: Hypothesis generation

One advantage of a cohort-style study is that the data can be used to generate new hypotheses to test.
Here, we have collected the cytokine expression of many people along with their BMI.
Use the `.corr()` method to find the correlation between BMI and all cytokines.
Then, generate a hypothesis about which top-5 cytokines are worth a followup.

**Checked variables:**
 * `q3_cross_cor` - Correlation coefficients between BMI and all cytokines (Series)
 * `q3_top5` - A list of the 5 cytokine names with strongest correlations to BMI
 * `q3_bar_ax` - A bar plot showing the correlations

<details><summary>Hint</summary>
Select only cytokine columns, then calculate correlation with BMI using .corr()['bmi']. Sort by absolute value to find strongest correlations. Plot as a bar chart. Write a hypothesis about why these cytokines might be related to BMI. See Module 5 walkthrough for correlation examples.</details>

|               |    |
| --------------|----|
| Points        | 10 |
| Public Checks | 6  |
| Hidden Tests | 2  |

_Points:_ 10

In [None]:
# Use this list of all cytokines in the dataset to answer the following questions
all_cytokines = list(data.columns[3:-5])
print(', '.join(all_cytokines))

In [None]:
# Calculate the cross correlation matrix that only includes bmi and all_cytokines


q3_cross_cor = ...


In [None]:
# Plot the correlation between BMI and all other columns as a bar plot

q3_bar_ax = ...

In [None]:
# Extract a Series of the top 5 cytokines
# Be sure to remove BMI

q3_top5 = ...

In [None]:
# Create a scatterplot between the bmi (on the x-axis) and the most correlated cytokine (on the y-axis)

q3_scatter_ax = ...

# Leave the axes labels as defaults for the grader

In [None]:
grader.check("q3_bmi_hypothesis_gen")

In [None]:
# DO NOT REMOVE!
plt.close()
# For the grader

With this information in hand, one could design more directed experiments to further understand whether these correlations are biologically meaningful.
This hypothesis generating technique is useful in a number of ways.

<!-- BEGIN QUESTION -->

### Q4: Exploration

Use this technique to find correlations between between cytokines and any other demographic variable.

Include at least one barplot of correlation coefficients.
If your variable is categorical, use show a boxplot of the most correlated cytokine.
If your variable is continious, instead show a scatterplot.

With each figure, include a text-box with a figure caption.

There is no grader for this question.

**Checked variables:**
 * None - This question is manually graded

<details><summary>Hint</summary>
Choose a demographic variable (e.g., age, sex, education). Calculate correlations between it and all cytokines. Create visualizations showing the relationships. Include figure captions explaining what you observe. See Module 5 walkthrough for examples of exploratory analysis.</details>

_Points:_ 10

<!-- END QUESTION -->



In [None]:
# DO NOT REMOVE!
plt.close()
# For the grader

--------------------------------------------

## Submission

Check:
 - That all tables and graphs are rendered properly.
 - Code completes without errors by using `Restart & Run All`.
 - All checks **pass**.
 
 Then save the notebook and the `File` -> `Download` -> `Download .ipynb`. Upload this file to BBLearn.
