# Epistasis Data Set #
## Goal ##
The goal of this test set is to evaluate the ability of the new covariation method to predict epistatic interactions. 
## Warning ##
Before attempting to use this notebook make sure that your .env file has been properly setup to reflect the correct locations of command line tools and the location of files and directories needed for execution.
### Initial Import###
This first cell performs the necessary imports required to begin this notebook.

In [None]:
from dotenv import find_dotenv, load_dotenv
try:
    dotenv_path = find_dotenv(raise_error_if_not_found=True)
except IOError:
    dotenv_path = find_dotenv(raise_error_if_not_found=True, usecwd=True)
load_dotenv(dotenv_path)
import os
import sys
sys.path.append(os.path.join(os.environ.get('PROJECT_PATH'), 'src', 'SupportingClasses'))
input_dir = os.environ.get('INPUT_PATH')

## Data Set Construction ##
The first task required to test the data set is to download the required data and construct any necessary input files for all down stream analyses.
In this case that means:
* Downloading PDB files for the proteins in our small test set.
* Extracting a query sequence from each PDB file.
* Searching for paralogs, homologs, and orthologs in a custom BLAST database built by filtering the Uniref90 database.
* Filtering the hits from the BLAST search to meet minimum and maximum length requirements, as well as minimum and maximum identity requirements.
* Building alignments using CLUSTALW in both the fasta and msf formats since some of the tools which will be used for comparison need different formats.
* Filtering the alignment for maximum identity similarity between seqeunces.
* Re-aligning the filtered sequences using CLUSTALW.
This is all handeled by the DataSetGenerator class found in the src/SupportingClasses folder

In [None]:
protein_list_dir = os.path.join(input_dir, 'ProteinLists')
if not os.path.isdir(protein_list_dir):
    os.makedirs(protein_list_dir)
small_list_fn = os.path.join(protein_list_dir, 'EpistasisDataSet.txt')
if not os.path.isfile(small_list_fn):
    proteins_of_interest = ['1pgaA']
    with open(small_list_fn, 'wb') as small_list_handle:
        for p_id in proteins_of_interest:
            small_list_handle.write('{}\n'.format(p_id))

In [None]:
from time import time
from DataSetGenerator import DataSetGenerator
generator = DataSetGenerator(input_dir)
start = time()
generator.build_pdb_alignment_dataset(protein_list_fn=os.path.basename(small_list_fn), num_threads=10,
                                      database='customuniref90.fasta', max_target_seqs=2500, remote=False, verbose=False)
end = time()
print('It took {} min to generate the data set.'.format((end - start) / 60.0))

In [None]:
def epistasis_model_product(single_mut1, single_mut2, double_mut):
    product = single_mut1 * single_mut2
    epistasis_value = double_mut - product
    return epistasis_value

def epistasis_model_additive(single_mut1, single_mut2, double_mut, wild_type=1.0):
    single_muts = single_mut1 + single_mut2
    double_muts = double_mut + wild_type
    epistasis_value = double_muts - single_muts
    
def epistasis_model_log(single_mut1, single_mut2, double_mut, wild_type=1.0):
    power1 = np.power(2, single_mut1) - wild_type
    power2 = np.power(2, single_mut2) - wild_type
    product = power1 * power2
    inner = product + wild_type
    log_value = np.log2(inner)
    epistasis_value = double_mut - log_value
    return epistasis_value
    
def epistasis_model_min(single_mut1, single_mut2, double_mut):
    min_value = np.min([single_mut1, single_mut2])
    epistasis_value = double_mut - min_value
    return epistasis_value

## Importing Epistasis Data ##
Each epistatsis dataset is formatted differently and uses different end points to measure fitness. The following cells import data from epistasis studies of different protein domains.

### GB1 ###
The following dataset characterizes the protein G (GB1) and characterizes 1,045 single mutants and 509,693 double mutants covering 1,485 of the possible pairs of positions in this 56 amino acid domain.
#### Reference ####
Olson, C. A., Wu, N. C., & Sun, R. (2014). A Comprehensive Biophysical Description of Pairwise Epistasis throughout an Entire Protein Domain. Current Biology, 24(22), 2643–2651. https://doi.org/10.1016/J.CUB.2014.09.072

In [None]:
import pandas
gb1_raw = pd.read_excel(os.environ.get('GB1_EXCEL_FILE'))
mut_cols = {}
last_col = None
for col in gb1_raw.columns:
    if not col.starswith('Unnamed:'):
        last_col = col
        mut_cols[last_col] = []
    if last_col:
        mut_cols[last_col].append(col)
mut_tables = {}
for mut_class in mut_cols:
    curr_table = test.loc[1:, mut_cols[mut_class]]
    curr_table.rename(columns={curr_table.columns[i]: test.loc[0, mut_cols[mut_class]][i]
                               for i in range(len(mut_cols[mut_class]))}, inplace=True)
    curr_table.dropna(axis='index', how='all', inplace=True)
    curr_table.dropna(axis='columns', how='all', inplace=True)
    mut_tables[mut_class]=curr_table
mut_tables['DOUBLE MUTANTS']['Input Fraction'] = mut_tables['DOUBLE MUTANTS']['Input Count'].apply(
    lambda x: float(x)/ float(mut_tables['WILD TYPE'].loc[1, 'Input Count']))
mut_tables['DOUBLE MUTANTS']['Selection Fraction'] = mut_tables['DOUBLE MUTANTS']['Selection Count'].apply(
    lambda x: float(x) / float(mut_tables['WILD TYPE'].loc[1, 'Selection Count']))
mut_tables['DOUBLE MUTANTS']['Double Mut Fitness'] = mut_tables['DOUBLE MUTANTS'].apply(
    lambda row: row['Selection Fraction'] / row['Input Fraction'], axis=1)