In [None]:
# Setting up the Colab environment. DO NOT EDIT!
import os
import warnings
warnings.filterwarnings("ignore")

try:
    import otter

except ImportError:
    ! pip install -q otter-grader==4.0.0
    import otter

if not os.path.exists('walkthrough-tests'):
    zip_files = [f for f in os.listdir() if f.endswith('.zip')]
    assert len(zip_files)>0, 'Could not find any zip files!'
    assert len(zip_files)==1, 'Found multiple zip files!'
    ! unzip {zip_files[0]}

grader = otter.Notebook(colab=True,
                        tests_dir = 'walkthrough-tests')

# Walkthrough

## Introduction

An emerging area of biomedical research over the past decade as been the human microbiome.
This field studies the commensal bacteria that inhabit our bodies and how they influence our health.
These can be found everywhere from our digestive system, to our skin, to our ears, and to every part of our body.
Often times hundreds of different bacterial species can be isolated from a single body site of a single individual.
Disease can be caused or exacerbated by an imbalance in these species.

This week, we will explore the data generated by researchers here at Drexel.
From a collection of 12 patients they measured the microbiome of of 12 body sites of the inner ear.
Some of these patients had inner ear infections (_otitis media_) and different disease outcomes.
This week, we will use Python to generate pivot-tables and bar-plots to understand whether the microbiome is impacted by disease outcome.

## Learning Objectives
At the end of this learning activity you will be able to:
 - Practice using `query` to extract data from a larger table.
 - Calculate summary values across a a `pd.DataFrame` using methods like sum, mean, and max.
 - Utilize `pd.DataFrame.groupby` to aggregate and transform data by group.
 - Use `pd.merge` to combine data held in two different tables.
 - Employ `pd.pivot_table` and `pd.melt` to reshape and summarize data.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
data = pd.read_csv('microbiome_phylum_data.csv', sep = '\t')
data

## Exploring a single patient

First, we'll explore the distribution of bacteria of a single individual across body site.

### Q1: Extract the information for patient 3116


|               |    |
| --------------|----|
| Points        | 2  |
| Public Checks | 3  |

_Points:_ 2

In [None]:
pat_3116 = ...

In [None]:
pat_3116.head()

In [None]:
grader.check("q1_extract_single")

### Q2: Calculate the average count across regions for each phylum for patient 3116.


|               |    |
| --------------|----|
| Points        | 2  |
| Public Checks | 4  |

_Points:_ 2

In [None]:
q2_Actinobacteria_mean = ...
q2_Bacteroidetes_mean = ...
q2_Firmicutes_mean = ...
q2_Proteobacteria_mean = ...

In [None]:
grader.check("q2_summary_vals")

## Summarizing by grouping

Now that we've looked at the summary values for a single individual, how would we look at this for each individual?
Copy-pasting that over and over is unsustainable, `DataFrame`s have useful methods for dealing with this problem.

All of these fall into the same basic strategy.

**Split** - **Apply** - **Combine**.

In [None]:
# Split

grouped_patients = data.groupby('Patient')

In [None]:
# Apply - Combine

# Capitalizing constants is useful if you will re-use them often.
PHYLUM_COLS = ['Actinobacteria', 'Bacteroidetes',
                'Firmicutes', 'Proteobacteria']

mean_vals = grouped_patients[PHYLUM_COLS].mean()
mean_vals

In [None]:
# This is commonly done in a single "sentence"

mean_vals = data.groupby('Patient')[PHYLUM_COLS].mean()

### Q3: Calculate the average counts of each phylum by body site.

|               |    |
| --------------|----|
| Points        | 2  |
| Public Checks | 4  |

_Points:_ 2

In [None]:
q3_mean_phylum_site = ...

In [None]:
q3_mean_phylum_site

In [None]:
grader.check("q3_mean_by_site")

There are a number of different built-in summary functions like this.

In [None]:
data.groupby('Patient')[PHYLUM_COLS].median()

In [None]:
data.groupby('Patient')[PHYLUM_COLS].count()

In [None]:
data.groupby('Patient')[PHYLUM_COLS].max()

You can see an extensive list of available summary functions at the [Pandas Documentation](https://pandas.pydata.org/docs/reference/groupby.html#dataframegroupby-computations-descriptive-stats)

If there isn't a function that does what you want, you can also make your own.

Here is a simple one that scales the data to a _unit-norm_.

In [None]:
def unit_norm(values):
    "Given a series, return a scaled version"

    mu = values.mean()
    std = values.std()

    return (values-mu)/std

unit_normed_data = data.groupby('Patient', as_index=False)[PHYLUM_COLS].transform(unit_norm)

In [None]:
unit_normed_data

Notice I used the `transform` method here instead of a common name.

When applying custom functions to groups of data there are three different methods depending on your final output shape:

* `.aggregate()` or `.agg()` - Each group of data produces a single summary number. Commonly used to summarize groups.
* `.transform()` - The output will have the same number (and order) of rows as the input. Commonly used for normalizations.
* `.apply()` - Everything else.

## Merging data

Now we come to a common problem, our sample information is in a different file.

In [None]:
sample_info = pd.read_csv('sample_info.csv')
sample_info.head()

Now that we have two `DataFrame`s with a common key we can use `pd.merge`.

In [None]:
merged_info = pd.merge(data, sample_info,
                       left_on = 'Patient', # The column of the key in biome_data
                       right_on = 'PID', # The column of the key in sample_info
                       how = 'inner') # Keep only those in both

In [None]:
merged_info.head()

### Q4: Calculate the average counts of each phylum by `severe_disease`.

|               |    |
| --------------|----|
| Points        | 2  |
| Public Checks | 4  |

_Points:_ 2

In [None]:
q4_severe_means = ...

In [None]:
grader.check("q4_servere")

We can also do more advanced things like this:

In [None]:
merged_info.groupby(['Location', 'severe_disease'])[PHYLUM_COLS].aggregate(['mean', 'std'])

Here I've broken things down by body-site and disease status and calculated both a mean and standard deviation.
In future lectures we will explore how to quantify this with a significance test.
For now, we'll leave it as a visual comparison.

## Pivoting & Melting Dataframes

This is a process of reshaping, and optionally summarizing, your data as you convert it between `wide` and `long` format.
These techniques are often required for generating different types of plots.

These are best shown by example.

### Pivoting

`long` -> `wide`

In [None]:
pd.pivot_table(merged_info,
               index = 'Patient',
               columns = 'Location',
               values = 'Firmicutes',
               aggfunc = 'mean')

This took our "long" data format in which each row represented the observation at a different site of a different person
and converted it into a "wide" data format such that each row is a patient and each column is a number of Firmicutes at a location.
`NaNs` represent missing information.

We also had to "give up" some information for this transformation ... this is only Firmicutes.

One _can_ do this to include more information:

In [None]:
pd.pivot_table(merged_info,
               index = 'Patient',
               columns = 'Location',
               values = ['Actinobacteria', 'Firmicutes'],
               aggfunc = 'mean')

But that is usually not a great idea.

### Melting

`wide` -> `long`

Our data is part `long` and part `wide` (like most real datasets).
In some plotting instances we may want to make it "longer" by having each bacteria be a diffent row instead of a different column.

In [None]:
pd.melt(merged_info,
        id_vars = ['Patient', 'Location'], # The things you want preserved in each row
        value_vars = PHYLUM_COLS, # The columns you want to melt
        var_name = 'Phylum', # The name of the column that will have the value_var name
        value_name = 'Counts') # The name of the column that will have the value

We'll explore these in more detail as we move into plotting next week.
This is just a taste.

---------------------------------------------

## Submission

Submit this assignment through BBLearn.