In [1]:
# Setting up the Colab environment. DO NOT EDIT!
try:
  from applied_biostats import setup_environment
except ImportError:
  !pip -q install applied-biostats-helper
  from applied_biostats import setup_environment
finally:
  grader = setup_environment('Module04_lab')

# Lab

## Introduction

In this session, we will delve into the relationship between the microbiome of the sinus system and the severity and duration of sinus infections.
We have classified our patients into three groups: those with typical infections that resolve without complications on standard therapy, those with severe infections that require aggressive medical intervention, and those with persistent infections that relapse shortly after initial symptoms resolve.
Using pivot tables we will compare the microbiomes of these different groups and explore the clinical implications of our findings.
Throughout this lab, you will have the opportunity to practice adding data to a `DataFrame`, employing pivot tables to compare microbiomes, describing biostatistical results, and formulating clinical uses for your analysis.

Let's get started!

In this learning activity you will:
  - Practice adding data into a `DataFrame`
  - Employ pivot tables to compare microbiomes across disease outcomes
  - Describe biostatistical results in "paragraph form"
  - Formulate a clinical use for your analysis

In [2]:
import numpy as np
import pandas as pd

In [3]:
data = pd.read_csv('microbiome_phylum_data.tsv', delimiter = '\t')
data

Unnamed: 0,Patient,Location,CollectionType,Actinobacteria,Bacteroidetes,Firmicutes,Proteobacteria,num_otu,Predominant
0,3062,Nasal Vestibule,Swab,2516,44,14987,0,16,Firmicutes
1,3094,Nasal Vestibule,Swab,103,0,1397,0,15,Firmicutes
2,3095,Nasal Vestibule,Swab,1474,0,5510,29,21,Firmicutes
3,3115,Nasal Vestibule,Swab,0,0,5480,0,2,Firmicutes
4,3116,Nasal Vestibule,Swab,2,0,2324,1,4,Firmicutes
...,...,...,...,...,...,...,...,...,...
103,3094,Sphenoid Tissue,Biopsy,1540,0,784,13,14,Actinobacteria
104,3095,Sphenoid Tissue,Biopsy,670,0,703,0,14,Firmicutes
105,3116,Sphenoid Tissue,Biopsy,309,9,6709,495,17,Firmicutes
106,3117,Sphenoid Tissue,Biopsy,1812,0,1954,129,18,Firmicutes


### Q1: Merge the `biome_data` table with the sample information

We are interested in exploring the relationship between the microbiome of the sinus system and the severity and duration of a sinus infection.
To do this, we need to first classify each patient into one of three groups:
 - those with a `typical` infection that resolved without complications on standard therapy
 - those with a `severe` infection that required aggressive medical intervention
 - those with a `persistent` infection that relapsed shortly after the initial symptoms resolved

By assigning each row of our table with the appropriate outcome, we can begin to compare the microbiomes of these different groups and consider the potential clinical implications of our analysis.

This step is crucial because it allows us to accurately analyze and interpret the data, and is a necessary foundation for the rest of the lab.

Use Pandas to load the `sample_info.csv` file and merge it with the `biome_data`.

|               |    |
| --------------|----|
| Points        | 3  |
| Public Checks | 5  |

_Points:_ 3

In [4]:
# BEGIN SOLUTION NO PROMPT

sample_info = pd.read_csv('sample_info.csv')
merged_data = pd.merge(data, sample_info,
                       left_on = 'Patient',
                       right_on = 'PID',
                       how = 'inner')

# END SOLUTION
""" # BEGIN PROMPT

# Load the sample information from sample_info.csv
# Merge that information with the biome_data

merged_data = ...

"""; # END PROMPT


In [None]:
grader.check("q1_add_outcomes")

### Q2: Determine the predomininant phylum across regions.

Use a pivot table to count the number of unique patients that have `Actinobacteria` or `Firmicutes` as the `Predominant` phylum at each body site.

|               |    |
| --------------|----|
| Points        | 5  |
| Public Checks | 7  |
| Hidden Testss | 1  |

_Points:_ 5

In [10]:
# BEGIN SOLUTION NO PROMPT
q2_pivot = pd.pivot_table(merged_data,
                          index = 'Location',
                          columns = 'Predominant',
                          values = 'Patient',
                          aggfunc = 'nunique')
# END SOLUTION
""" # BEGIN PROMPT

# Create the pivot table described in the question.
# Pay attention to using the correct `index`, `columns`, `values`, and `aggfunc` parameters
# It expects the rows to be body-sites
# The columns to be phylumns
# And the values to number of unique patients with that predominant phylum at that body-site

q2_pivot = ...

"""; # END PROMPT

In [11]:
q2_pivot

Predominant,Actinobacteria,Firmicutes
Location,Unnamed: 1_level_1,Unnamed: 2_level_1
Ethmoid Culture (Deep to Ethmoid Bulla),2.0,10.0
Ethmoid Tissue (Deep to Ethmoid Bulla),6.0,3.0
Head of Inferior Turbinate Tissue,,11.0
Maxillary Sinus,2.0,2.0
Maxillary Sinus Tissue,1.0,2.0
Middle Meatus,,11.0
Nasal Vestibule,,10.0
Sphenoethmoidal Recess Tissue,3.0,6.0
Sphenoid,3.0,9.0
Sphenoid Tissue,2.0,3.0


In [12]:
# Which regions have at least twice as many patients with Firmicutes as predominant relative to Actinobacteria
# This should be a subset of the q2_pivot DataFrame
q2_firmi_regions = q2_pivot.query('Firmicutes >= 2*Actinobacteria') # SOLUTION
q2_firmi_regions

Predominant,Actinobacteria,Firmicutes
Location,Unnamed: 1_level_1,Unnamed: 2_level_1
Ethmoid Culture (Deep to Ethmoid Bulla),2.0,10.0
Maxillary Sinus Tissue,1.0,2.0
Sphenoethmoidal Recess Tissue,3.0,6.0
Sphenoid,3.0,9.0
Superior Meatus,2.0,10.0


In [None]:
grader.check("q2_count_pivot")

### Q3: Which body site has the largest increase in Actinobacteria when comparing typical and severe disease outcomes?

Find which body site has the largest increase in Actinobacteria when comparing typical and severe disease outcomes.
Utilize pivot tables to compare the relative abundances of Actinobacteria across disease states and body sites.

|               |    |
| --------------|----|
| Points        | 5  |
| Public Checks | 8  |
| Hidden Testss | 1  |

_Points:_ 5

In [21]:
# Create a pivot table which averages the count of Actinobacteria across each patient
# for each body-site and disease type
# BEGIN SOLUTION NO PROMPT
q3_pivot = pd.pivot_table(merged_data,
                          index = 'Location',
                          columns = 'disease_type',
                          values = 'Actinobacteria')
# END SOLUTION
""" # BEGIN PROMPT
q3_pivot = ...

"""; # END PROMPT

In [22]:
# Add a relative_abundance column
# This should be the difference between the severe and typical columns
q3_pivot['relative_abundance'] = q3_pivot['severe'] - q3_pivot['typical'] # SOLUTION

In [23]:
# Which body site has the largest *increase* in Actinobacteria in those with *severe* disease?

# Display the table above (optionally with sorting)
# Or use programatically https://pandas.pydata.org/docs/search.html?q=idxmax
# Answer as a text string.

q3_ans = q3_pivot['relative_abundance'].idxmax()  # SOLUTION

In [None]:
grader.check("q3_mean_pivot")

The above analysis describes a population level result.
A _sample_ of people with severe disease have more Actinobacteria in some regions than those that have typical disease.
However, that tells us little about the impact on an _individual_.
Let's reframe this into a clinical application.

Can you use the amount of Actinobacteria as a predictor of disease?

### Q4: Which tissues are "swabbable"?

If we would like to use microbiome sampling as a clinical assay to detect severe infections, it would be helpful if those areas are "easy to access".
This dataset is a collection of samples that came from both biopsies and swabs.
Create a subset of the data that only contains `Swab` samples.

|               |    |
| --------------|----|
| Points        | 2  |
| Public Checks | 6  |
| Hidden Testss | 1  |

_Points:_ 2

In [33]:
# Use boolean indexing or a query create a new table with only samples that came from samples with data from swabs.
swabbable_data = merged_data.query('CollectionType == "Swab"')  # SOLUTION

In [34]:
# What fraction of the data came from swab samples?
# Your answer should be between 0 and 1.
q4_fraction_swabbable = (merged_data['CollectionType'] == 'Swab').mean()


In [35]:
print(f'{q4_fraction_swabbable*100:0.2f}% of the data came from swabbable samples')

56.48% of the data came from swabbable samples


In [None]:
grader.check("q4_swabable")

### Q5: Which samples are _high_?

Previously, we saw that there was more Actinobacteria in across certain regions in severe disease.
In this case, we'll consider **high** as being 1 standard-deviation above the average of typical patients for that region.



Create a new column in `swabbable_data` called `is_high` that is true if the Actinobacteria is 1 standard-deviation above the average for that region.

|               |    |
| --------------|----|
| Points        | 5  |
| Public Checks | 2  |
| Hidden Testss | 0  |

_Points:_ 5

In [43]:
# Isolate the subset of swabbable_data that come from typical disease types
typical_swab_data = swabbable_data.query('disease_type == "typical"') # SOLUTION

In [44]:
# Use `groupby()` to aggregate values while keeping the same shape

typical_region_means = typical_swab_data.groupby('Location')['Actinobacteria'].agg('mean') # SOLUTION
typical_region_stds = typical_swab_data.groupby('Location')['Actinobacteria'].agg('std') # SOLUTION

In [45]:
# Combine the values to create a cutoff 
typical_region_cutoff = typical_region_means+typical_region_stds

In [46]:
typical_region_cutoff

Location
Ethmoid Culture (Deep to Ethmoid Bulla)    2519.803017
Maxillary Sinus                                    NaN
Middle Meatus                               369.665695
Nasal Vestibule                              97.540631
Sphenoid                                   3822.517645
Superior Meatus                            3416.455422
Name: Actinobacteria, dtype: float64

In [47]:
# Add a new column called `is_high`
# If you've done the above cells correctly, this will run.

# Get the appropriate cutoff into each row
row_cutoff = swabbable_data['Location'].map(typical_region_cutoff.get)

swabbable_data = swabbable_data.assign(is_high = swabbable_data['Actinobacteria'] > row_cutoff)

In [None]:
grader.check("q5_high_values")

### Q6: Which swabbable region has the highest positive predictive value when predicting **persistent** disease?

The positive predictive value is the ratio of patients truly diagnosed as positive vs all those who had positive test results.
It indicates the likelihood that a patient "has the condition" given a positive test.
This makes PPV one of the most useful metrics when describing diagnostic tests to clinicians and patients.

Use the `is_high` column as a prediction of whether a patient has a persistent infection.
Calculate the PPV for each region.
In the provided space, place the most accurate region and its calculated PPV.

|               |    |
| --------------|----|
| Points        | 10 |
| Public Checks | 3  |
| Hidden Testss | 2  |

_Points:_ 10

In [50]:
# BEGIN SOLUTION NO PROMPT

swabbable_data['is_persistent'] = swabbable_data['disease_type'] == 'persistent'


region_ppvs = swabbable_data.query('is_high').groupby('Location')['is_persistent'].mean()

q6_highest_region = region_ppvs.idxmax()
q6_best_ppv = region_ppvs.max()

# END SOLUTION
""" # BEGIN PROMPT

# There are a number of ways to approach this problem
# Look back through the groupby and pivot table explanations


q6_highest_region = ...
q6_best_ppv = ...
"""; # END PROMPT

In [None]:
grader.check("q6_swabbable_ppv")

<!-- BEGIN QUESTION -->

### Q7: Context

Put these results into context.

_Points:_ 5

Write your solution here in this box and consider the following:
 - What are the number 2 and 3 swabbable regions when ranking by PPV?
 - How many patients are these results based on?
 - Consider the likelihood of persistent infections relative to the PPV. Is the test giving more information compared to a null assumption?

**SOLUTION**

From the data we found, there is one other region with similar PPV to the Superios Meatus, that was the Nasal Vestibule at 62% PPV.
Other regions fell down to 50/50 range.

However, if we consider the likelihood of getting a persistent infection, which is likely much rarer than 50/50.
This means that the positive test is still providing new information.

<!-- END QUESTION -->

--------------------------------------------

## Submission

Check:
 - That all tables and graphs are rendered properly.
 - Code completes without errors by using `Restart & Run All`.
 - All checks **pass**.
 
Then save the notebook and the `File` -> `Download` -> `Download .ipynb`. Upload this file to BBLearn.