# Problem Set 1

For this problem set we will look at patterns of genetic diversity at the _LCT_ and _MCM6_ genes. Studies have shown that this locus exhibits signals consistent with positive selection in European populations and is associated with lactase persistence—a trait unique among mammals and is thought to be a result of cattle domestication and the incorporation of milk into the adult diets of several human populations (Enattah et al. 2002; Marnetto and Huerta-Sánchez 2017; Smith et al. 2018).

__ASSIGNMENT__
- There are six coding problems and one interpretation problem.
- For partial credit please annotate your code (ie un-annotated code will not receive partial credit).
- The following python modules are required to complete this problem set, however, you can use other modules at your own risk.
- Do not alter the code in the `Data Processing` section except for changing file paths.
- You may not work with other students, but you may ask the instructor questions by email or by coming to office hours, reference package documentation, reference coding exercises from previous lectures, and refernce the course GitHub.

__HELPFUL HINTS__
- Feel free to add more cells if needed!
- Don't forget to consider ploidy.
- Remember the bounds for the site frequency spectrum.
- Take a deep breathe, and remember that you are very capable!

In [1]:
# Import modules.
import h5py
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import toyplot

## Data Processing 

First, I will load the converted vcf file as a HDF5 file and extract the necessary information needed to complete this problem set.

In [2]:
# Load the hdf5 data.
lct_mcm6_h5 = h5py.File('./data/tgp_lct_mcm6_biallelic_snps_anc_calls_filtered.h5', mode='r')
# Extract the genotypes.
lct_mcm6_gt = lct_mcm6_h5['calldata/GT'][:]
# Convert the genotypes to an alternative allele count matrix.
lct_mcm6_aac_mat = np.sum(lct_mcm6_gt, axis=2)
# Extract the variable positions array.
lct_mcm6_pos = lct_mcm6_h5['variants/POS'][:]

Next, I will define a function to polarize the allele count matrix and subsequently convert the alternative allele count matrix to the derived allele count matrix.

In [3]:
# Define a function to convert genotypes to derived allele counts.
def aac_2_dac(aac):
    """Returns a derived allele count matrix where an individual can have
       the following possible genotype entries:
       
       0 = homozygous for the ancestral allele
       1 = heterozygous
       2 = homozygous for the derived allele
    
    aac -- alternative allele count matrix with the outgroup encoded in 
           the last column of the matrix.
    """
    # Intialize a derived allele count matrix.
    dac = np.empty_like(aac[:, 0:-1])
    # For every site...
    for site in range(aac.shape[0]):
        # Extract the tgp samples and ancestor.
        tgp = aac[site, 0:-1]
        anc = aac[site, -1]
        # If the alternative allele is the derived allele...
        if anc > 0:
            # Polarize the tgp.
            p_tgp = np.abs(tgp - 2)
            # Fill the derived allele count matrix.
            dac[site, :] = p_tgp
        # Else...
        else:
            # Fill the derived allele count matrix.
            dac[site, :] = tgp
    return dac

In [4]:
# Convert the alternative allele count matrix to
# the derived allele count matrix.
lct_mcm6_dac_mat = aac_2_dac(lct_mcm6_aac_mat)

As, a sanity check let's make sure there are the same number of sites in the positions array as in the derived allele count matrix. Note: If the size of the positions array and the size of the first dimension in the derived allele count matrix is not 2464 contact the instructor ASAP.

In [5]:
# Show the size of posistions array.
lct_mcm6_pos.size

2464

In [6]:
# Show the shape of the derived allele count matrix.
lct_mcm6_dac_mat.shape

(2464, 2504)

Great, the positions array and the derived allele count matrix are in agreement. Throughout this problem set you will run analyses on each super-population from the 1000 Genome's Project (TGP), so the last thing I will do for you is subset the original derived allele count matrix by super-population—however feel free to work from the original derived allele count matrix if you wish!

In [7]:
# Load the tgp metadata as a pandas dataframe.
meta_df = pd.read_csv(
    './data/tgp_meta_data.txt', sep='\t',
    names=['Individual', 'Population', 'Super-Population'],
)
# Intialize a super population list.
superpop_list = ['AFR', 'SAS', 'EAS', 'EUR', 'AMR']
# Intialize a dictionary to store indicies.
superpop_idx_dicc = {}
# For every super population...
for superpop in superpop_list:
    # Fill the dictionary.
    superpop_idx_dicc[superpop] = meta_df[meta_df['Super-Population'] == superpop].index.values
# Extract the derived allele count matrix for each super population.
afr_dac_mat = lct_mcm6_dac_mat[:, superpop_idx_dicc['AFR']]
sas_dac_mat = lct_mcm6_dac_mat[:, superpop_idx_dicc['SAS']]
eas_dac_mat = lct_mcm6_dac_mat[:, superpop_idx_dicc['EAS']]
eur_dac_mat = lct_mcm6_dac_mat[:, superpop_idx_dicc['EUR']]
amr_dac_mat = lct_mcm6_dac_mat[:, superpop_idx_dicc['AMR']]

## Segregating Sites ($S$)

In the `Data Processing` section we determined how many segrgating sites were observed among all individuals in the TGP.

__(1) Compute the number of segregating sites observed in each super-population.__

## Gene Diversity ($H$)

__(2) Compute the average gene diversity for each super-population.__

## Nucleotide Diversity ($\pi$)

__(3) Compute the average nucleotide diversity—do not normalize by the number of sites—for each super-population.__

## Derived Allele Frequency Spectrum (aka Unfolded SFS)

__(4) Compute the derived allele frequency spectrum for each super-population.__

## Interpretation

__(5) Generate a table displaying the number of segregating sites, average gene diversity, and average nucleotide diversity for each super-population__

__(6) Plot the derived allele frequency spectrum for each super-population.__

__(7) Compare the results for each super-population and intepret your results.__ (Hint: Reflect on our conversations about the assigned reading: 1000 Genomes Project Consortium. "A global reference for human genetic variation." _Nature_ 526.7571 (2015): 68.


__Your Interpretation:__