# Problem 0: Calculating statistics with Pandas

In this problem you'll use `pandas` to calculate conditional probabilities on a set of simulated blood pressure GWAS data. The goal is to practice working with `pandas` dataframes, performing some of the same tasks that until now we've solved with basic Python syntax.

You will read in file data, calculate summary statistics, and determine conditional probabilities with `pandas` tools.

### Blood pressure GWAS
A recent study of Americans of Japanese ancestry (http://www.ncbi.nlm.nih.gov/pubmed/26476085) looked at the association of several so-called "longevity" genetic variants with blood pressure. They found that the SNP rs2802288 was associated with hypertension in women. Women who are homozygous for the minor allele (AA) were significantly less likely to have hypertension, defined as a systolic blood pressure above 140 mmHG, than women homozygous for the major allele (GG).

I didn't have access to the original data from the study, so I simulated the data for this problem based on the published mean and standard deviations for each genotype. For simplicity, this data only includes homozygous genotypes.

Make sure that the file `bloodpressure.txt` is in your ps4 folder. This file consists of two columns of data, a systolic blood pressure measurement, and a genotype. (You can open this file from the Jupyter dashboard to take a look before writing your code.)



## 0.1 import `pandas` and load file data
In the cell below, import `pandas` (as `pd`). Then use `pd.read_csv` to read in the data from the file `bloodpressure.txt`. This file is **tab-delimited**, so you'll need to include the separator argument indicating that `'\t'` is the delimiter. The argument to use is `sep='\t'`.

Load the data into a variable called `data`. What kind of Python object is `data`?

In [None]:
# YOUR ANSWER HERE

# check the input - .head() displays the header and first few rows
print(type(data))
data.head()

In [None]:
assert len(data) == 261

## 0.2 Calculate allele frequencies

As you can see, the data table consists of two columns. The first is the systolic blood pressure and the second is the genotype (`AA` or `GG`). Notice that the columnes are named (BP and allele). You can use these column names (as strings) to access those columns in the dataframe. For example, `data['allele']` would get you the second column of the data frame.

When you have different levels of categorical data in a `pandas` dataframe, you can use the convenient function `.groupby('column_name')` to calculate summary statistics on different groups within your data. In this case, you can quickly determine the mean blood pressure by genotype with the following syntax:

```python
data.groupby('allele')['BP'].mean()
```
Look carefully at the syntax, and make sure you understand the meaning of all of the parentheses, square brackets, and periods. You've seen sequential `.` functions before, with code like `line.strip().split()`.

Run the code in the cell below and look at the output, which shows mean blood pressure for both genotypes. Let's say you wanted to assign those genotype means to their own variables. How could you use `dataframe.groupby()['column'].mean()`?

The trick is to use unpacking. Recall how unpacking works. You can put multiple variables on the **left** of the `=` operator to unpack an object into its elements. For example the follwing code unpacks the list into two variables:

```python
item1, item2 = [10, 20]
```

**In the scratch cell below**, unpack the output of `data.groupby('allele')['BP'].mean()` so that the mean blood pressure for each genotype is assigned to its own variable. (You might have to experiment a little.)

In [None]:
# SCRATCH CELL

Now use unpacking with `.groupby()` to answer the following: How many data points are there for each genotype? **Hint:** Rather than `.mean()`, the function `.count()` will be useful here.

In the cell below, use `.groupby()` and `.count()` to determine the count of each genotype in `data`. Unpack the output into the variables `aa` and `gg`.

In [None]:
# YOUR ANSWER HERE
print(aa, gg)

In [None]:
assert aa + gg == 261

## 0.3 Boolean masks

With dataframes, you often want to select rows of data that meet certain criteria. For example, you may want all rows with a `GG` genotype. A useful way to select data from a dataframe (or a numpy array) is to use a *boolean mask* - that is, an object that is the same size as the dataframe, and consists of true/false values. The mask can be used as an indexing statement (placed within square brackets) to select the data you want.

Here's an example to select all rows with the `GG` genotype:
```python
mask = data['allele'] == 'GG'
gg_data = data[mask]
```

Run this code in the cell below. What does the object `mask` look like? Print it out to see.

In [None]:
# SCRATCH CELL


You can use masks together with other functions like `.mean()`. For example:

```python
data[mask].mean()
```

Also, while it's often helpful to define the mask as its own variable, that's not necessary. You can put the mask statement itself between the indexing brackets:

```python
gg_mean = data[data['allele'] == 'GG'].mean()
```

In the cell below, use a boolean mask to calculate the probability of high blood pressure (BP greater than 140) in the data. More specifically, do the following:

1) Define a boolean mask to select rows in which the `'BP'` value is > 140.

2) Use the mask to select those rows, and use `len()` to count the number of high blood pressure rows/subjects.

3) Calculate the probability of high blood pressure: Divide the number of subjects with high blood pressure by the total number of subjects. (Again, use `len()` to get the number of dataframe rows.) **Save the result as the variable `p_hi`.**

In [None]:
# YOUR ANSWER HERE

# Probability of high blood pressure:
print(p_hi)

In [None]:
assert 0.264367 < p_hi < 0.2643679 

## 0.4 Calculate conditional probabilities
Calculate the probability of high blood pressure, given the allele:

$P(hypertension \mid allele)$

To do this, use a mask to apply two selection criteria to `data`. Select rows with BP > 140 and one of the two alleles.

You can string together multiple boolean statements using `&` (for `AND`) and `|` (for `OR`). The only trick here is to enclose each boolean statement in parentheses before adding another:

```python
(data['allele'] == 'GG') | (second boolean statement)
```
In the cell below do the following:

1) Define two masks, one for each genotype, to select rows with BP > 140 and the particular genotype. *Save the masks* as `hi_mask_gg` and `hi_mask_aa`.

2) Calculate the probability of high blood pressure for each genotype (the number of subjects with the genotype and high bp, divided by the number of subjects with the genotype). As you did above, use the mask and `len()` to count the number of subjects. You already calculated the total subject with each genotype - the variables `aa` and `gg`. 

3) Assign the calculated probabilities to the variables `p_hi_aa` and `p_hi_gg`.

In [None]:
# YOUR ANSWER HERE

print(p_hi_aa, p_hi_gg)

In [None]:
assert 0.317 < p_hi_gg < 0.318
assert 0.0714 < p_hi_aa < 0.075
assert len(hi_mask_aa) + len(hi_mask_gg) == 522

## Observed vs expected `GG` subjects with high blood pressure

Note that the probability of high blood pressure among subjects with the `GG` genotype is about 32%. The probability of high blood pressure in the overall sample is about 26%. In the last stats lectures, we'll discuss how to calculate if this difference is significant.

For the final task of this problem, calculate the *expected* number of `GG` subjects with high blood pressure, given a baseline probability of `p_hi` and the number of `GG` subjects, `gg`. Assign the result to `exp_hi`.

Then calculate the observed number of `GG` subjects with high blood pressure. (Use `hi_mask_gg` and `len()`.) Assign the answer to the variable `obs_hi`.

In [None]:
# YOUR ANSWER HERE


print(int(exp_hi), obs_hi)

In [None]:
assert int(exp_hi) + obs_hi == 119