# Small Test Set #
## Goal ##
The goal of this test set is to perform proof of concept testing on a small number of proteins with a wide range of sizes and available homologs, orthologs, and paralogs. By doing so it should be possible to test the best parameterization for this tool as well as identifying the strengths and weaknesses of the tool using various measurments as end points.
## Warning ##
Before attempting to use this notebook make sure that your .env file has been properly setup to reflect the correct locations of command line tools and the location of files and directories needed for execution.
### Initial Import###
This first cell performs the necessary imports required to begin this notebook.

In [15]:
from dotenv import find_dotenv, load_dotenv
try:
    dotenv_path = find_dotenv(raise_error_if_not_found=True)
except IOError:
    dotenv_path = find_dotenv(raise_error_if_not_found=True, usecwd=True)
load_dotenv(dotenv_path)
import os
import sys
sys.path.append(os.path.join(os.environ.get('PROJECT_PATH'), 'src', 'SupportingClasses'))
input_dir = os.environ.get('INPUT_PATH')

## Data Set Construction ##
The first task required to test the data set is to download the required data and construct any necessary input files for all down stream analyses.
In this case that means:
* Downloading PDB files for the proteins in our small test set.
* Extracting a query sequence from each PDB file.
* Searching for paralogs, homologs, and orthologs in a custom BLAST database built by filtering the Uniref90 database.
* Filtering the hits from the BLAST search to meet minimum and maximum length requirements, as well as minimum and maximum identity requirements.
* Building alignments using CLUSTALW in both the fasta and msf formats since some of the tools which will be used for comparison need different formats.
* Filtering the alignment for maximum identity similarity between seqeunces.
* Re-aligning the filtered sequences using CLUSTALW.
This is all handeled by the DataSetGenerator class found in the src/SupportingClasses folder

In [16]:
protein_list_dir = os.path.join(input_dir, 'ProteinLists')
if not os.path.isdir(protein_list_dir):
    os.makedirs(protein_list_dir)
small_list_fn = os.path.join(protein_list_dir, 'SmallDataSet.txt')
if not os.path.isfile(small_list_fn):
    proteins_of_interest = ['2ysdA', '1c17A', '3tnuA', '7hvpA', '135lA', '206lA', '2b59A', '2werA', '1bolA', '3q05A',
                            '1axbA', '2rh1A', '1hckA', '3b6vA', '2z0eA', '1jwlA', '1a26A', '1c0kA', '1h1vA', '4lliA',
                            '4ycuA', '2iopA', '2zxeA']
    with open(small_list_fn, 'wb') as small_list_handle:
        for p_id in proteins_of_interest:
            small_list_handle.write('{}\n'.format(p_id))

In [17]:
from time import time
from DataSetGenerator import DataSetGenerator
generator = DataSetGenerator(input_dir)
start = time()
generator.build_pdb_alignment_dataset(protein_list_fn=os.path.basename(small_list_fn), num_threads=10,
                                      database='customuniref90.fasta', max_target_seqs=2500, remote=False, verbose=False)
end = time()
print('It took {} min to generate the data set.'.format((end - start) / 60.0))

Importing protein list
Downloading structures and parsing in query sequences
BLASTing query sequences
Filtering BLAST hits, aligning, filtering by identity, and re-aligning
It took 0.153495001793 min to generate the data set.


In [18]:
output_dir = os.environ.get('OUTPUT_PATH')
small_set_out_dir = os.path.join(output_dir, 'SmallTestSet')
if not os.path.isdir(small_set_out_dir):
    os.makedirs(small_set_out_dir)
from SeqAlignment import SeqAlignment

# Generating Values For Comparision#
To determine the effectiveness of the new method and implementation the covariation of the same proteins will be computed using the previous Evolutionary Trace covariation method (ET-MIp) and other methods in the field.

## ET-MIp##
Scoring the the covariation of the proteins using the previous Evolutionary Trace covariation method (ET-MIp).

In [19]:
from ETMIPWrapper import ETMIPWrapper
etmip_out_dir = os.path.join(small_set_out_dir, 'ET-MIp')
if not os.path.isdir(etmip_out_dir):
    os.makedirs(etmip_out_dir)
etmip_scores = {}
for p_id in generator.protein_data:
    print('Attempting to calculate ET-MIp covariance for: {}'.format(p_id))
    try:
        protein_out_dir = os.path.join(etmip_out_dir, p_id)
        if not os.path.isdir(protein_out_dir):
            os.makedirs(protein_out_dir)
        curr_aln = SeqAlignment(file_name=generator.protein_data[p_id]['Final_FA_Aln'], query_id=p_id, polymer_type='Protein')
        curr_aln.import_alignment()
        curr_etmip = ETMIPWrapper(alignment=curr_aln)
        curr_etmip.calculate_scores(out_dir=protein_out_dir, delete_files=False)
        etmip_scores[p_id] = curr_dca
        print('Successfully computed ET-MIp covariance for: {}'.format(p_id))
    except ValueError:
        print('Could not compute ET-MIp covariance for: {} with seq_length: {} and size: {}'.format(
            p_id, curr_aln.seq_length, curr_aln.size))

Attempting to calculate ET-MIp covariance for: 2iop
Output:

Error:
allocation failure 2 in dmatrix().

Could not compute ET-MIp covariance for: 2iop with seq_length: 1690 and size: 2473
Attempting to calculate ET-MIp covariance for: 7hvp
15.208096981
Successfully computed ET-MIp covariance for: 7hvp
Attempting to calculate ET-MIp covariance for: 1c0k
129.20468998
Successfully computed ET-MIp covariance for: 1c0k
Attempting to calculate ET-MIp covariance for: 1c17
1031.44045806
Successfully computed ET-MIp covariance for: 1c17
Attempting to calculate ET-MIp covariance for: 135l
3960.81864905
Successfully computed ET-MIp covariance for: 135l
Attempting to calculate ET-MIp covariance for: 2ysd
319.227418184
Successfully computed ET-MIp covariance for: 2ysd
Attempting to calculate ET-MIp covariance for: 206l
68.6146600246
Successfully computed ET-MIp covariance for: 206l
Attempting to calculate ET-MIp covariance for: 1hck
Output:

Error:
allocation failure 2 in imatrix() for int.
allocati

## DCA##
Scoring the the covariation of the proteins using a DCA julia implementation.

In [9]:
from DCAWrapper import DCAWrapper
dca_out_dir = os.path.join(small_set_out_dir, 'DCA')
if not os.path.isdir(dca_out_dir):
    os.makedirs(dca_out_dir)
dca_scores = {}
for p_id in generator.protein_data:
    print('Attempting to calculate DCA covariance for: {}'.format(p_id))
    try:
        protein_out_dir = os.path.join(dca_out_dir, p_id)
        if not os.path.isdir(protein_out_dir):
            os.makedirs(protein_out_dir)
        curr_aln = SeqAlignment(file_name=generator.protein_data[p_id]['Final_FA_Aln'], query_id=p_id,
                                polymer_type='Protein')
        curr_aln.import_alignment()
        # Since the DCA implementation used here does not provide a way to specify the query sequence we remove the gaps
        # from the query sequences so positions will be referenced correctly for that sequence (and unnecessary
        # computations can be avoided).
        curr_aln = curr_aln.remove_gaps()
        new_aln_fn = os.path.join(protein_out_dir, '{}_no_gap.fasta'.format(p_id))
        curr_aln.write_out_alignment(new_aln_fn)
        curr_aln.file_name = new_aln_fn
        curr_dca = DCAWrapper(alignment=curr_aln)
        curr_dca.calculate_scores(out_dir=protein_out_dir, delete_file=False)
        dca_scores[p_id] = curr_dca
        print('Successfully computed DCA covariance for: {}'.format(p_id))
    except ValueError:
        print('Could not compute DCA covariance for: {} with seq_length: {} and size: {}'.format(
            p_id, curr_aln.seq_length, curr_aln.size))

Attempting to calculate DCA covariance for: 2iop
Removing gaps took 0.121132250627 min
Output:
theta = 0.24214823332465546 threshold = 149.0
M = 2473 N = 618 Meff = 640.4080550722614

Error:

30.1861131191
Successfully computed DCA covariance for: 2iop
Attempting to calculate DCA covariance for: 7hvp
Removing gaps took 0.000621664524078 min
Output:
theta = 0.19415079847616118 threshold = 18.0
M = 59 N = 97 Meff = 24.145491838655456

Error:

5.05751204491
Successfully computed DCA covariance for: 7hvp
Attempting to calculate DCA covariance for: 1c0k
Removing gaps took 0.000751682122548 min
Output:
theta = 0.26199960512671616 threshold = 95.0
M = 54 N = 363 Meff = 36.266666666666666

Error:

10.8456110954
Successfully computed DCA covariance for: 1c0k
Attempting to calculate DCA covariance for: 1c17
Removing gaps took 0.00439543326696 min
Output:
theta = 0.25974123850629577 threshold = 20.0
M = 813 N = 79 Meff = 164.31823105079218

Error:

4.85462784767
Successfully computed DCA covarian

## EVCouplings##
Scoring the the covariation of the proteins using the EVCouplings method.