# Small Test Set #
## Goal ##
The goal of this test set is to perform proof of concept testing on a small number of proteins with a wide range of sizes and available homologs, orthologs, and paralogs. By doing so it should be possible to test the best parameterization for this tool as well as identifying the strengths and weaknesses of the tool using various measurments as end points.
## Warning ##
Before attempting to use this notebook make sure that your .env file has been properly setup to reflect the correct locations of command line tools and the location of files and directories needed for execution.
### Initial Import###
This first cell performs the necessary imports required to begin this notebook.

In [1]:
from dotenv import find_dotenv, load_dotenv
try:
    dotenv_path = find_dotenv(raise_error_if_not_found=True)
except IOError:
    dotenv_path = find_dotenv(raise_error_if_not_found=True, usecwd=True)
load_dotenv(dotenv_path)
import os
import sys
sys.path.append(os.path.join(os.environ.get('PROJECT_PATH'), 'src'))
sys.path.append(os.path.join(os.environ.get('PROJECT_PATH'), 'src', 'SupportingClasses'))
input_dir = os.environ.get('INPUT_PATH')

## Data Set Construction ##
The first task required to test the data set is to download the required data and construct any necessary input files for all down stream analyses.
In this case that means:
* Downloading PDB files for the proteins in our small test set.
* Extracting a query sequence from each PDB file.
* Searching for paralogs, homologs, and orthologs in a custom BLAST database built by filtering the Uniref90 database.
* Filtering the hits from the BLAST search to meet minimum and maximum length requirements, as well as minimum and maximum identity requirements.
* Building alignments using CLUSTALW in both the fasta and msf formats since some of the tools which will be used for comparison need different formats.
* Filtering the alignment for maximum identity similarity between seqeunces.
* Re-aligning the filtered sequences using CLUSTALW.
This is all handeled by the DataSetGenerator class found in the src/SupportingClasses folder

In [2]:
from time import time
from DataSetGenerator import DataSetGenerator
protein_list_dir = os.path.join(input_dir, 'ProteinLists')
if not os.path.isdir(protein_list_dir):
    os.makedirs(protein_list_dir)
small_list_fn = os.path.join(protein_list_dir, 'SmallDataSet.txt')
if not os.path.isfile(small_list_fn):
    proteins_of_interest = ['2ysdA', '1c17A', '3tnuA', '7hvpA', '135lA', '206lA', '2werA', '1bolA', '3q05A', '1axbA',
                            '2rh1A', '1hckA', '3b6vA', '2z0eA', '1jwlA', '1a26A', '1c0kA', '4lliA', '4ycuA', '2iopA',
                            '2zxeA', '2b59B', '1h1vG']
    with open(small_list_fn, 'w') as small_list_handle:
        for p_id in proteins_of_interest:
            small_list_handle.write('{}\n'.format(p_id))
generator = DataSetGenerator(input_dir)
start = time()
summary = generator.build_pdb_alignment_dataset(protein_list_fn=os.path.basename(small_list_fn), processes=10,
                                                database='customuniref90.fasta', max_target_seqs=2500, remote=False,
                                                verbose=False)
summary['Chain'] = summary['Protein_ID'].apply(lambda x: generator.protein_data[x]['Chain'])
summary['Accession'] = summary['Protein_ID'].apply(lambda x: generator.protein_data[x]['Accession'])
summary['Length'] = summary['Protein_ID'].apply(lambda x: generator.protein_data[x]['Length'])
summary['Total_Size'] = summary.apply(lambda x: float(x['Length']) * float(x['Filtered_Alignment']), axis=1)
summary.sort_values(by=['Filtered_Alignment', 'Length'], axis=0, inplace=True)
summary_columns = ['Protein_ID', 'Chain', 'Accession', 'BLAST_Hits', 'Filtered_BLAST',
                   'Filtered_Alignment', 'Length', 'Total_Size']
print(summary[summary_columns])
end = time()
print('It took {} min to generate the data set.'.format((end - start) / 60.0))
summary.to_csv(os.path.join(input_dir, 'small_data_set_summary.tsv'), sep='\t', index=False, header=True,
               columns=summary_columns)

Importing protein list
Downloading structures and parsing in query sequences
Unique Sequences Found: 23!
BLASTing query sequences
Filtering BLAST hits, aligning, filtering by identity, and re-aligning
   Protein_ID Chain     Accession  BLAST_Hits  Filtered_BLAST  \
9        2b59     B    CIPA_CLOTM        1606               4   
5        7hvp     A     POL_HV1A2        2500              33   
20       1c0k     A    OXDA_RHOTO        2500              51   
15       206l     A      LYS_BPT4        1039             131   
13       1bol     A    RNRH_RHINI        2500             131   
7        3q05     A     P53_HUMAN         932             306   
16       1jwl     A    LACI_ECOLI        2500             228   
21       1a26     A   PARP1_CHICK        2500             267   
1        2ysd     A   MAGI1_HUMAN        2500             412   
17       2z0e     A   ATG4B_HUMAN        2111             360   
4        4lli     A   MYO5A_HUMAN        2500             441   
19       2rh1     A

Create a location to store the output of this method comparison.

In [3]:
output_dir = os.environ.get('OUTPUT_PATH')
small_set_out_dir = os.path.join(output_dir, 'SmallTestSet')
if not os.path.isdir(small_set_out_dir):
    os.makedirs(small_set_out_dir)

## Setting Up Scoring For Each Method
To reduce memory load during prediction and evaluation, the scoring objects needed to compute the metrics used to compare methods will be created ahead of time so they are available to each method when it computes its predictions for a given protein. This will ensure that results do not need to be kept in memory while waiting for all other results to be computed, only the metrics measured for each method will be recorded.

In [4]:
import pandas as pd
from SeqAlignment import SeqAlignment
from PDBReference import PDBReference
from ContactScorer import ContactScorer, plot_z_scores
protein_order = list(summary['Protein_ID'])
method_order = ['DCA', 'EVC Standard', 'EVC Mean Field', 'ET-MIp', 'cET-MIp', 'ET-MIp_MAX']
sequence_separation_order = ['Any', 'Neighbors', 'Short', 'Medium', 'Long']
protein_scorers = {}
small_comparison_df = None
small_comparison_fn = os.path.join(small_set_out_dir, 'Small_Comparision_Data.csv')
if os.path.isfile(small_comparison_fn):
    small_comparison_df = pd.read_csv(small_comparison_fn, sep='\t', header=0, index_col=False)
else:
    for p_id in summary['Protein_ID']:
        protein_scorers[p_id] = {}
        # Import alignment and remove gaps
        full_aln = SeqAlignment(file_name=generator.protein_data[p_id]['Final_FA_Aln'], query_id=p_id)
        full_aln.import_alignment()
        non_gap_aln = full_aln.remove_gaps()
        # Import structure
        pdb_structure = PDBReference(pdb_file=generator.protein_data[p_id]['PDB'])
        pdb_structure.import_pdb(structure_id=p_id)
        protein_scorers[p_id]['Structure'] = pdb_structure
        # Initialize Beta Carbon distance scorer
        contact_scorer_cb = ContactScorer(query=p_id, seq_alignment=non_gap_aln,
                                          pdb_reference=pdb_structure, cutoff=8.0)
        contact_scorer_cb.best_chain = generator.protein_data[p_id]['Chain']
        contact_scorer_cb.fit()
        contact_scorer_cb.measure_distance(method='CB')
        protein_scorers[p_id]['Scorer_CB'] = contact_scorer_cb
        # Initialize distance scorer minimizing distance between any atoms
        contact_scorer_any = ContactScorer(query=p_id, seq_alignment=non_gap_aln,
                                           pdb_reference=pdb_structure, cutoff=8.0)
        contact_scorer_any.best_chain = generator.protein_data[p_id]['Chain']
        contact_scorer_any.fit()
        contact_scorer_any.measure_distance(method='Any')
        protein_scorers[p_id]['Scorer_Any'] = contact_scorer_any
        # Initialize z-scoring subproblems
        protein_scorers[p_id]['biased_w2_ave'] = None
        protein_scorers[p_id]['unbiased_w2_ave'] = None
output_columns = ['Protein', 'Protein Length', 'Alignment Size', 'Method', 'Distance', 'Init Time', 'Import Time', 'Dist Tree Time', 'Trace Time', 'Total Time', 
                  'Sequence_Separation', 'AUROC', 'AUPRC', 'AUTPRFDRC',
                  'Top K Predictions', 'Precision', 'Recall', 'F1 Score',
                  'Biased Z-Score at 10%', 'Biased Z-Score at 30%', 'Max Biased Z-Score', 'AUC Biased Z-Score',
                  'Unbiased Z-Score at 10%', 'Biased Z-Score at 30%', 'Max Unbiased Z-Score', 'AUC Unbiased Z-Score']

Removing gaps took 0.00040810108184814454 min
Importing the PDB file took 0.0029550949732462567 min




Removing gaps took 0.0005920926729838053 min
Importing the PDB file took 0.0014386137326558432 min
Mapping query sequence and pdb took 0.0025120178858439126 min
Computing the distance matrix based on the PDB file took 0.0029610554377237958 min
Removing gaps took 0.0005215525627136231 min
Importing the PDB file took 0.0013111114501953125 min
Mapping query sequence and pdb took 0.0024066805839538575 min




Computing the distance matrix based on the PDB file took 0.00402217706044515 min
Removing gaps took 0.0008041739463806152 min
Importing the PDB file took 0.001248776912689209 min
Removing gaps took 0.0008612235387166341 min




Importing the PDB file took 0.0006717840830485026 min
Mapping query sequence and pdb took 0.0018079121907552083 min
Computing the distance matrix based on the PDB file took 0.0012967427571614583 min
Removing gaps took 0.0007571895917256673 min




Importing the PDB file took 0.0016282161076863607 min
Mapping query sequence and pdb took 0.0026790658632914227 min
Computing the distance matrix based on the PDB file took 0.00202560822168986 min
Removing gaps took 0.0008410453796386719 min
Importing the PDB file took 0.0020082831382751466 min
Removing gaps took 0.0006177385648091634 min
Importing the PDB file took 0.0016821980476379394 min
Mapping query sequence and pdb took 0.002474025885264079 min
Computing the distance matrix based on the PDB file took 0.01726288398106893 min
Removing gaps took 0.0005861759185791015 min
Importing the PDB file took 0.0022502779960632325 min
Mapping query sequence and pdb took 0.0029920061429341634 min
Computing the distance matrix based on the PDB file took 0.015199518203735352 min
Removing gaps took 0.000313111146291097 min
Importing the PDB file took 0.0007047812143961588 min
Removing gaps took 0.0003086407979329427 min
Importing the PDB file took 0.00034815470377604166 min
Mapping query sequence



Removing gaps took 0.005238882700602214 min
Importing the PDB file took 0.0027441898981730144 min
Mapping query sequence and pdb took 0.008429086208343506 min




Computing the distance matrix based on the PDB file took 0.006136651833852132 min
Removing gaps took 0.005414235591888428 min
Importing the PDB file took 0.0027650038401285807 min
Mapping query sequence and pdb took 0.008622888724009197 min




Computing the distance matrix based on the PDB file took 0.007420738538106282 min
Removing gaps took 0.002460193634033203 min




Importing the PDB file took 0.0020055333773295087 min
Removing gaps took 0.003102584679921468 min




Importing the PDB file took 0.0015887618064880371 min
Mapping query sequence and pdb took 0.004878489176432291 min
Computing the distance matrix based on the PDB file took 0.012115057309468586 min
Removing gaps took 0.0030904213587443032 min
Importing the PDB file took 0.0015345056851704916 min
Mapping query sequence and pdb took 0.0048231045405069985 min




Computing the distance matrix based on the PDB file took 0.013928079605102539 min
Removing gaps took 0.01654237906138102 min
Importing the PDB file took 0.0010261972745259603 min
Removing gaps took 0.01667482058207194 min
Importing the PDB file took 0.0006369948387145996 min
Mapping query sequence and pdb took 0.01790359417597453 min
Computing the distance matrix based on the PDB file took 0.0135992964108785 min
Removing gaps took 0.015516177813212077 min
Importing the PDB file took 0.0006197730700174968 min
Mapping query sequence and pdb took 0.016702409585316977 min
Computing the distance matrix based on the PDB file took 0.0164287805557251 min
Removing gaps took 0.026801021893819173 min
Importing the PDB file took 0.005480766296386719 min
Removing gaps took 0.026091845830281575 min
Importing the PDB file took 0.004432189464569092 min
Mapping query sequence and pdb took 0.04291700124740601 min
Computing the distance matrix based on the PDB file took 0.0003768324851989746 min
Removing



Removing gaps took 0.009690157572428386 min
Importing the PDB file took 0.0008085449536641439 min
Mapping query sequence and pdb took 0.01080549160639445 min




Computing the distance matrix based on the PDB file took 0.013539222876230876 min
Removing gaps took 0.0096137801806132 min
Importing the PDB file took 0.000773477554321289 min
Mapping query sequence and pdb took 0.010704509417215983 min




Computing the distance matrix based on the PDB file took 0.012912603219350179 min
Removing gaps took 0.05212547381718954 min
Importing the PDB file took 0.0017905155817667642 min




Removing gaps took 0.052713127930959065 min
Importing the PDB file took 0.0014315048853556316 min




Mapping query sequence and pdb took 0.05706042448679606 min
Computing the distance matrix based on the PDB file took 0.01618785858154297 min
Removing gaps took 0.05081588824590047 min
Importing the PDB file took 0.0013565421104431152 min




Mapping query sequence and pdb took 0.055082090695699054 min
Computing the distance matrix based on the PDB file took 0.019417003790537516 min
Removing gaps took 0.009499796231587728 min
Importing the PDB file took 0.0015480399131774902 min
Removing gaps took 0.00931399663289388 min
Importing the PDB file took 0.0008538285891215007 min
Mapping query sequence and pdb took 0.018233350912729897 min
Computing the distance matrix based on the PDB file took 0.021357421080271402 min
Removing gaps took 0.009117523829142252 min
Importing the PDB file took 0.0008279204368591309 min
Mapping query sequence and pdb took 0.018087502320607504 min
Computing the distance matrix based on the PDB file took 0.021988725662231444 min
Removing gaps took 0.027496095498402914 min
Importing the PDB file took 0.001483309268951416 min




Removing gaps took 0.030768632888793945 min
Importing the PDB file took 0.0010478814442952474 min
Mapping query sequence and pdb took 0.03289890686670939 min




Computing the distance matrix based on the PDB file took 0.010174997647603353 min
Removing gaps took 0.028402650356292726 min
Importing the PDB file took 0.00103528102238973 min
Mapping query sequence and pdb took 0.030061777432759604 min




Computing the distance matrix based on the PDB file took 0.012347022692362467 min
Removing gaps took 0.046332287788391116 min
Importing the PDB file took 0.0016238768895467123 min




Removing gaps took 0.04246002038319906 min
Importing the PDB file took 0.001286907990773519 min
Mapping query sequence and pdb took 0.04440484046936035 min




Computing the distance matrix based on the PDB file took 0.01140980323155721 min
Removing gaps took 0.0428015391031901 min
Importing the PDB file took 0.0012775699297587076 min
Mapping query sequence and pdb took 0.0447906772295634 min




Computing the distance matrix based on the PDB file took 0.0129135529200236 min
Removing gaps took 0.0848992387453715 min
Importing the PDB file took 0.0025701642036437987 min




Removing gaps took 0.08556890885035197 min
Importing the PDB file took 0.0021897594134012857 min




Mapping query sequence and pdb took 0.08970083793004353 min
Computing the distance matrix based on the PDB file took 0.11183563868204753 min
Removing gaps took 0.0857035756111145 min
Importing the PDB file took 0.0021821260452270508 min
Mapping query sequence and pdb took 0.08982928991317748 min




Computing the distance matrix based on the PDB file took 0.11220600207646687 min
Removing gaps took 0.0042406121889750166 min
Importing the PDB file took 0.003572078545888265 min
Removing gaps took 0.004219349225362142 min
Importing the PDB file took 0.003180090586344401 min
Mapping query sequence and pdb took 0.007545757293701172 min
Computing the distance matrix based on the PDB file took 0.0007024010022481283 min
Removing gaps took 0.002185229460398356 min
Importing the PDB file took 0.003153971831003825 min
Mapping query sequence and pdb took 0.007818988958994548 min
Computing the distance matrix based on the PDB file took 0.0009917338689168294 min
Removing gaps took 0.014275777339935302 min
Importing the PDB file took 0.0005586425463358561 min
Removing gaps took 0.011546754837036132 min
Importing the PDB file took 0.00027225414911905926 min
Mapping query sequence and pdb took 0.012006103992462158 min
Computing the distance matrix based on the PDB file took 0.0019278287887573241 mi



Removing gaps took 0.10059206485748291 min
Importing the PDB file took 0.0008927226066589355 min
Mapping query sequence and pdb took 0.10268610715866089 min




Computing the distance matrix based on the PDB file took 0.004586756229400635 min
Removing gaps took 0.1005159298578898 min
Importing the PDB file took 0.0030982335408528644 min
Mapping query sequence and pdb took 0.10489243666330973 min




Computing the distance matrix based on the PDB file took 0.006357665856679281 min
Removing gaps took 0.07042027314503987 min
Importing the PDB file took 0.0011283198992411295 min




Removing gaps took 0.07048648993174235 min
Importing the PDB file took 0.0005630016326904297 min
Mapping query sequence and pdb took 0.07213631073633829 min
Computing the distance matrix based on the PDB file took 0.0007193406422932942 min




Removing gaps took 0.07063308954238892 min
Importing the PDB file took 0.0005658984184265137 min
Mapping query sequence and pdb took 0.07208762963612875 min
Computing the distance matrix based on the PDB file took 0.0009574254353841145 min




Removing gaps took 0.06768288612365722 min
Importing the PDB file took 0.0008033355077107748 min
Removing gaps took 0.07069494326909383 min
Importing the PDB file took 0.0005538304646809896 min
Mapping query sequence and pdb took 0.07229482730229696 min
Computing the distance matrix based on the PDB file took 0.009499577681223552 min
Removing gaps took 0.06852363348007202 min
Importing the PDB file took 0.0005452195803324381 min
Mapping query sequence and pdb took 0.07017403443654378 min
Computing the distance matrix based on the PDB file took 0.009795443216959635 min
Removing gaps took 0.047313690185546875 min
Importing the PDB file took 0.000951850414276123 min
Removing gaps took 0.0453234593073527 min
Importing the PDB file took 0.00047654310862223307 min
Mapping query sequence and pdb took 0.04654147227605184 min
Computing the distance matrix based on the PDB file took 0.006908794244130452 min
Removing gaps took 0.04769770701726277 min
Importing the PDB file took 0.0004740834236145



Importing the PDB file took 0.004469144344329834 min
Removing gaps took 0.1601968765258789 min




Importing the PDB file took 0.006264722347259522 min
Mapping query sequence and pdb took 0.168387770652771 min
Computing the distance matrix based on the PDB file took 0.027894926071166993 min
Removing gaps took 0.15968736012776694 min
Importing the PDB file took 0.0032224416732788085 min




Mapping query sequence and pdb took 0.16484851042429607 min
Computing the distance matrix based on the PDB file took 0.03382241725921631 min
Removing gaps took 0.10231130520502726 min




Importing the PDB file took 0.007267216841379802 min
Removing gaps took 0.1058316946029663 min




Importing the PDB file took 0.0040301322937011715 min
Mapping query sequence and pdb took 0.1111327568689982 min
Computing the distance matrix based on the PDB file took 0.041872306664784746 min
Removing gaps took 0.10257659355799358 min




Importing the PDB file took 0.0070395231246948246 min
Mapping query sequence and pdb took 0.11093360980351766 min
Computing the distance matrix based on the PDB file took 0.04497403303782145 min


# Generating Values For Comparision#
To determine the effectiveness of the new method and implementation the covariation of the same proteins will be computed using the previous Evolutionary Trace covariation method (ET-MIp) and other methods in the field.

## ET-MIp##
Scoring the the covariation of the proteins using the previous Evolutionary Trace covariation method (ET-MIp).

In [5]:
# from ETMIPWrapper import ETMIPWrapper
# etmip_out_dir = os.path.join(small_set_out_dir, 'ET-MIp')
# if not os.path.isdir(etmip_out_dir):
#     os.makedirs(etmip_out_dir)
# etmip_scores = {}
# counts = {'success':0, 'value': 0, 'attribute':0}
# for p_id in generator.protein_data:
#     print('Attempting to calculate ET-MIp covariance for: {}'.format(p_id))
#     try:
#         protein_out_dir = os.path.join(etmip_out_dir, p_id)
#         if not os.path.isdir(protein_out_dir):
#             os.makedirs(protein_out_dir)
#         curr_aln = SeqAlignment(file_name=generator.protein_data[p_id]['Final_FA_Aln'], query_id=p_id, polymer_type='Protein')
#         curr_aln.import_alignment()
#         curr_etmip = ETMIPWrapper(alignment=curr_aln)
#         curr_etmip.calculate_scores(out_dir=protein_out_dir, delete_files=False)
#         etmip_scores[p_id] = curr_etmip
#         print('Successfully computed ET-MIp covariance for: {}'.format(p_id))
#         counts['success'] += 1
#     except ValueError:
#         print('Could not compute ET-MIp covariance for: {} with seq_length: {} and size: {}'.format(
#             p_id, curr_aln.seq_length, curr_aln.size))
#         counts['value'] += 1
#     except AttributeError:
#         print('Could not compute ET-MIp covariance for: {} no alignment'.format(p_id))
#         counts['attribute'] += 1
# print('{}\tSuccesses\n{}\tValue Errors\n{}\tAttribute Errors'.format(counts['success'], counts['value'],
#                                                                      counts['attribute']))

## ET-MIp (Continued)
The previous implementation is not able to run for alignments of the size used here. Instead we use the new implementation with the same parameterization used by the previous implementation (Distance Model - blosum62 similarity, Tree - ET UPGMA variant, Scoring Metric - filtered average product corrected mutual information, Ranks - all).

In [5]:
from EvolutionaryTrace import EvolutionaryTrace
import numpy as np
import pandas as pd
if not os.path.isfile(small_comparison_fn):
    etmip_out_dir = os.path.join(small_set_out_dir, 'ET-MIp')
    if not os.path.isdir(etmip_out_dir):
        os.makedirs(etmip_out_dir)
    etmip_method_fn = os.path.join(etmip_out_dir, 'ET-MIp_Method_Data.csv')
    if os.path.isfile(etmip_method_fn):
        etmip_method_df = pd.read_csv(etmip_method_fn, sep='\t', header=0, index_col=False)
    else:    
        etmip_method_df = None
        counts = {'success':0, 'value': 0, 'attribute':0}
        for p_id in summary['Protein_ID']:
            print('Attempting to calculate ET-MIp covariance for: {}'.format(p_id))
            protein_dir = os.path.join(etmip_out_dir, p_id)
            if not os.path.isdir(protein_dir):
                os.makedirs(protein_dir)
            protein_fn = os.path.join(protein_dir, '{}_Protein_Data.csv'.format(p_id))
            if os.path.isfile(protein_fn):
                protein_df = pd.read_csv(protein_fn, sep='\t', header=0, index_col=False)
            else:
                try:

                    start_time = time()
                    curr_etmip = EvolutionaryTrace(query_id=p_id, polymer_type='Protein',
                                                   aln_fn=generator.protein_data[p_id]['Final_FA_Aln'], et_distance=True,
                                                   distance_model='blosum62', tree_building_method='et', tree_building_options={},
                                                   ranks=None, position_type='pair',
                                                   scoring_metric='filtered_average_product_corrected_mutual_information',
                                                   gap_correction=None, maximize=False, out_dir=protein_dir,
                                                   output_files={'original_aln', 'non_gap_aln', 'tree', 'scores'},
                                                   processors=10, low_memory=True)
                    init_time = time()
                    curr_etmip.import_and_process_aln()
                    import_time = time()
                    curr_etmip.compute_distance_matrix_tree_and_assignments()
                    dist_tree_time = time()
                    curr_etmip.perform_trace()
                    end_time = time()
                    print('Successfully computed ET-MIp covariance for: {}'.format(p_id))
                    # Compute statistics for the final scores of the ET-MIp model
                    protein_df, _, _ = protein_scorers[p_id]['Scorer_CB'].evaluate_predictor(
                        predictor=curr_etmip, verbosity=2, out_dir=protein_dir, dist='CB', biased_w2_ave=None,
                        unbiased_w2_ave=None, processes=10, threshold=0.5, pos_size=curr_etmip.scorer.position_size,
                        rank_type=curr_etmip.scorer.rank_type, file_prefix='ET-MIp_Scores_', plots=True)
                    # Score Prediction Clustering
                    z_score_fn = os.path.join(protein_dir, 'ET-MIp_Scores_Dist-Any_{}_ZScores.tsv')
                    z_score_plot_fn = os.path.join(protein_dir, 'ET-MIp_Scores_Dist-Any_{}_ZScores.png')
                    z_score_biased, biased_w2_ave, biased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
                        1.0 - curr_etmip.coverage, bias=True, file_path=z_score_fn.format('Biased'),
                        w2_ave_sub=protein_scorers[p_id]['biased_w2_ave'], processes=10)
                    if protein_scorers[p_id]['biased_w2_ave'] is None:
                        protein_scorers[p_id]['biased_w2_ave'] = biased_w2_ave
                    biased_z_score_array = np.array(pd.to_numeric(z_score_biased['Z-Score'], errors='coerce'))
                    protein_df['Max Biased Z-Score'] = np.nanmax(biased_z_score_array)
                    protein_df['Biased Z-Score at 10%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
                    protein_df['Biased Z-Score at 30%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
                    protein_df['AUC Biased Z-Score'] = biased_scw_z_auc
                    plot_z_scores(z_score_biased, z_score_plot_fn.format('Biased'))
                    z_score_unbiased, unbiased_w2_ave, unbiased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
                        1.0 - curr_etmip.coverage, bias=False, file_path=z_score_fn.format('Unbiased'),
                        w2_ave_sub=protein_scorers[p_id]['unbiased_w2_ave'], processes=10)
                    if protein_scorers[p_id]['unbiased_w2_ave'] is None:
                        protein_scorers[p_id]['unbiased_w2_ave'] = unbiased_w2_ave
                    unbiased_z_score_array = np.array(pd.to_numeric(z_score_unbiased['Z-Score'], errors='coerce'))
                    protein_df['Max Unbiased Z-Score'] = np.nanmax(unbiased_z_score_array)
                    protein_df['Unbiased Z-Score at 10%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
                    protein_df['Unbiased Z-Score at 30%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
                    protein_df['AUC Unbiased Z-Score'] = unbiased_scw_z_auc
                    plot_z_scores(z_score_unbiased, z_score_plot_fn.format('Unbiased'))
                    # Record execution times
                    protein_df['Init Time'] = init_time - start_time
                    protein_df['Import Time'] = import_time - init_time
                    protein_df['Dist Tree Time'] = dist_tree_time - import_time
                    protein_df['Trace Time'] = end_time - dist_tree_time
                    protein_df['Total Time'] = end_time - start_time
                    # Record static data for this protein
                    protein_df['Protein'] = p_id
                    protein_df['Method'] = 'ET-MIp'
                    protein_df['Alignment Size'] = summary['Filtered_Alignment'].values[summary['Protein_ID'] == p_id][0]
                    protein_df.to_csv(protein_fn, sep='\t', header=True, index=False, columns=output_columns)
                    temp_data = os.path.join(protein_dir, 'unique_node_data')
                    for temp_fn in os.listdir(temp_data):
                        if not temp_fn.endswith("_pair_rank_filtered_average_product_corrected_mutual_information_score.npz"):
                            os.remove(os.path.join(temp_data, temp_fn))
                    print('Metrics meastured for ET-MIp covariance for: {}'.format(p_id))
                    counts['success'] += 1
                except ValueError:
                    print('Could not compute ET-MIp covariance for: {} with seq_length: {} and size: {}'.format(
                        p_id, curr_etmip.original_aln.seq_length, curr_etmip.original_aln.size))
                    counts['value'] += 1
                    continue
                except AttributeError:
                    print('Could not compute ET-MIp covariance for: {} no alignment'.format(p_id))
                    counts['attribute'] += 1
                    continue
            if etmip_method_df is None:
                etmip_method_df = protein_df
            else:
                etmip_method_df = etmip_method_df.append(protein_df)
        print('{}\tSuccesses\n{}\tValue Errors\n{}\tAttribute Errors'.format(counts['success'], counts['value'],
                                                                             counts['attribute']))
        etmip_method_df.to_csv(etmip_method_fn, sep='\t', header=True, index=False, columns=output_columns)
    if small_comparison_df is None:
        small_comparison_df = etmip_method_df
    else:
        small_comparison_df = small_comparison_df.append(etmip_method_df)

## cET-MIp
This segment the ET-MIp method, when constrained to an arbitrary set of nodes (1, 2, 3, 5, 7, 10, 25) at the top of the phylogenetic tree.

In [6]:
if not os.path.isfile(small_comparison_fn):
    cetmip_out_dir = os.path.join(small_set_out_dir, 'cET-MIp')
    if not os.path.isdir(cetmip_out_dir):
        os.makedirs(cetmip_out_dir)
    cetmip_method_fn = os.path.join(cetmip_out_dir, 'cET-MIp_Method_Data.csv')
    if os.path.isfile(cetmip_method_fn):
        cetmip_method_df = pd.read_csv(cetmip_method_fn, sep='\t', header=0, index_col=False)
    else:
        cetmip_method_df = None
        counts = {'success':0, 'value': 0, 'attribute':0, 'key': 0}
        for p_id in summary['Protein_ID']:
            print('Attempting to calculate cET-MIp covariance for: {}'.format(p_id))
            protein_dir = os.path.join(cetmip_out_dir, p_id)
            if not os.path.isdir(protein_dir):
                os.makedirs(protein_dir)
            protein_fn = os.path.join(protein_dir, '{}_Protein_Data.csv'.format(p_id))
            if os.path.isfile(protein_fn):
                protein_df = pd.read_csv(protein_fn, sep='\t', header=0, index_col=False)
            else:
                try:
                    start_time = time()
                    curr_cetmip = EvolutionaryTrace(query_id=p_id, polymer_type='Protein',
                                                   aln_fn=generator.protein_data[p_id]['Final_FA_Aln'], et_distance=True,
                                                   distance_model='blosum62', tree_building_method='et', tree_building_options={},
                                                   ranks=[1, 2, 3, 5, 7, 10, 25], position_type='pair',
                                                   scoring_metric='filtered_average_product_corrected_mutual_information',
                                                   gap_correction=None, maximize=False, out_dir=protein_dir,
                                                   output_files={'original_aln', 'non_gap_aln', 'tree', 'scores'},
                                                   processors=10, low_memory=True)
                    init_time = time()
                    curr_cetmip.import_and_process_aln()
                    import_time = time()
                    curr_cetmip.compute_distance_matrix_tree_and_assignments()
                    dist_tree_time = time()
                    curr_cetmip.perform_trace()
                    end_time = time()
                    # Compute statistics for the final scores of the ET-MIp model
                    protein_df, _, _ = protein_scorers[p_id]['Scorer_CB'].evaluate_predictor(
                        predictor=curr_cetmip, verbosity=2, out_dir=protein_dir, dist='CB', biased_w2_ave=None,
                        unbiased_w2_ave=None, processes=10, threshold=0.5, pos_size=curr_cetmip.scorer.position_size,
                        rank_type=curr_cetmip.scorer.rank_type, file_prefix='cET-MIp_Scores_', plots=True)
                    # Score Prediction Clustering
                    z_score_fn = os.path.join(protein_dir, 'cET-MIp_Scores_Dist-Any_{}_ZScores.tsv')
                    z_score_plot_fn = os.path.join(protein_dir, 'cET-MIp_Scores_Dist-Any_{}_ZScores.png')
                    z_score_biased, biased_w2_ave, biased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
                        1.0 - curr_cetmip.coverage, bias=True, file_path=z_score_fn.format('Biased'),
                        w2_ave_sub=protein_scorers[p_id]['biased_w2_ave'], processes=10)
                    if protein_scorers[p_id]['biased_w2_ave'] is None:
                        protein_scorers[p_id]['biased_w2_ave'] = biased_w2_ave
                    biased_z_score_array = np.array(pd.to_numeric(z_score_biased['Z-Score'], errors='coerce'))
                    protein_df['Max Biased Z-Score'] = np.nanmax(biased_z_score_array)
                    protein_df['Biased Z-Score at 10%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
                    protein_df['Biased Z-Score at 30%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
                    protein_df['AUC Biased Z-Score'] = biased_scw_z_auc
                    plot_z_scores(z_score_biased, z_score_plot_fn.format('Biased'))
                    z_score_unbiased, unbiased_w2_ave, unbiased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
                        1.0 - curr_cetmip.coverage, bias=False, file_path=z_score_fn.format('Unbiased'),
                        w2_ave_sub=protein_scorers[p_id]['unbiased_w2_ave'], processes=10)
                    if protein_scorers[p_id]['unbiased_w2_ave'] is None:
                        protein_scorers[p_id]['unbiased_w2_ave'] = unbiased_w2_ave
                    unbiased_z_score_array = np.array(pd.to_numeric(z_score_unbiased['Z-Score'], errors='coerce'))
                    protein_df['Max Unbiased Z-Score'] = np.nanmax(unbiased_z_score_array)
                    protein_df['Unbiased Z-Score at 10%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
                    protein_df['Unbiased Z-Score at 30%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
                    protein_df['AUC Unbiased Z-Score'] = unbiased_scw_z_auc
                    plot_z_scores(z_score_unbiased, z_score_plot_fn.format('Unbiased'))
                    # Record execution times
                    protein_df['Init Time'] = init_time - start_time
                    protein_df['Import Time'] = import_time - init_time
                    protein_df['Dist Tree Time'] = dist_tree_time - import_time
                    protein_df['Trace Time'] = end_time - dist_tree_time
                    protein_df['Total Time'] = end_time - start_time
                    # Record static data for this protein
                    protein_df['Protein'] = p_id
                    protein_df['Method'] = 'cET-MIp'
                    protein_df['Alignment Size'] = summary['Filtered_Alignment'].values[summary['Protein_ID'] == p_id][0]
                    protein_df.to_csv(protein_fn, sep='\t', header=True, index=False, columns=output_columns)
                    temp_data = os.path.join(protein_dir, 'unique_node_data')
                    for temp_fn in os.listdir(temp_data):
                        if not temp_fn.endswith("_pair_rank_filtered_average_product_corrected_mutual_information_score.npz"):
                            os.remove(os.path.join(temp_data, temp_fn))
                    print('Successfully computed cET-MIp covariance for: {}'.format(p_id))
                    counts['success'] += 1
                except ValueError:
                    print('Could not compute cET-MIp covariance for: {} with seq_length: {} and size: {}'.format(
                        p_id, curr_cetmip.original_aln.seq_length, curr_etmip.original_aln.size))
                    counts['value'] += 1
                    continue
                except AttributeError:
                    print('Could not compute cET-MIp covariance for: {} no alignment'.format(p_id))
                    counts['attribute'] += 1
                    continue
                except KeyError:
                    print('Could not compute cET-MIp covariance for: {} not enough sequences'.format('p_ied'))
                    counts['key'] += 1
                    continue
            if cetmip_method_df is None:
                cetmip_method_df = protein_df
            else:
                cetmip_method_df = cetmip_method_df.append(protein_df)
        print('{}\tSuccesses\n{}\tValue Errors\n{}\tAttribute Errors'.format(counts['success'], counts['value'],
                                                                             counts['attribute']))
        cetmip_method_df.to_csv(cetmip_method_fn, sep='\t', header=True, index=False, columns=output_columns)
    if small_comparison_df is None:
        small_comparison_df = cetmip_method_df
    else:
        small_comparison_df = small_comparison_df.append(cetmip_method_df)

### ET-MIp with Group Maximiziation ###
This cell generates data for and tests the effect of maximizing the group score when moving from a parent node to child nodes.

In [7]:
from EvolutionaryTrace import EvolutionaryTrace
import numpy as np
import pandas as pd
if not os.path.isfile(small_comparison_fn):
    etmip_out_dir = os.path.join(small_set_out_dir, 'ET-MIp_MAX')
    if not os.path.isdir(etmip_out_dir):
        os.makedirs(etmip_out_dir)
    etmip_method_fn = os.path.join(etmip_out_dir, 'ET-MIp_MAX_Method_Data.csv')
    if os.path.isfile(etmip_method_fn):
        etmip_method_df = pd.read_csv(etmip_method_fn, sep='\t', header=0, index_col=False)
    else:    
        etmip_method_df = None
        counts = {'success':0, 'value': 0, 'attribute':0}
        for p_id in summary['Protein_ID']:
            print('Attempting to calculate ET-MIp MAXIMIZED covariance for: {}'.format(p_id))
            protein_dir = os.path.join(etmip_out_dir, p_id)
            if not os.path.isdir(protein_dir):
                os.makedirs(protein_dir)
            protein_fn = os.path.join(protein_dir, '{}_Protein_Data.csv'.format(p_id))
            if os.path.isfile(protein_fn):
                protein_df = pd.read_csv(protein_fn, sep='\t', header=0, index_col=False)
            else:
                try:

                    start_time = time()
                    curr_etmip = EvolutionaryTrace(query_id=p_id, polymer_type='Protein',
                                                   aln_fn=generator.protein_data[p_id]['Final_FA_Aln'], et_distance=True,
                                                   distance_model='blosum62', tree_building_method='et', tree_building_options={},
                                                   ranks=None, position_type='pair',
                                                   scoring_metric='filtered_average_product_corrected_mutual_information',
                                                   gap_correction=None, maximize=True, out_dir=protein_dir,
                                                   output_files={'original_aln', 'non_gap_aln', 'tree', 'scores'},
                                                   processors=10, low_memory=True)
                    init_time = time()
                    curr_etmip.import_and_process_aln()
                    import_time = time()
                    curr_etmip.compute_distance_matrix_tree_and_assignments()
                    dist_tree_time = time()
                    curr_etmip.perform_trace()
                    end_time = time()
                    print('Successfully computed ET-MIp MAX covariance for: {}'.format(p_id))
                    # Compute statistics for the final scores of the ET-MIp model
                    protein_df, _, _ = protein_scorers[p_id]['Scorer_CB'].evaluate_predictor(
                        predictor=curr_etmip, verbosity=2, out_dir=protein_dir, dist='CB', biased_w2_ave=None,
                        unbiased_w2_ave=None, processes=10, threshold=0.5, pos_size=curr_etmip.scorer.position_size,
                        rank_type=curr_etmip.scorer.rank_type, file_prefix='ET-MIp_MAX_Scores_', plots=True)
                    # Score Prediction Clustering
                    z_score_fn = os.path.join(protein_dir, 'ET-MIp_MAX_Scores_Dist-Any_{}_ZScores.tsv')
                    z_score_plot_fn = os.path.join(protein_dir, 'ET-MIp_MAX_Scores_Dist-Any_{}_ZScores.png')
                    z_score_biased, biased_w2_ave, biased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
                        1.0 - curr_etmip.coverage, bias=True, file_path=z_score_fn.format('Biased'),
                        w2_ave_sub=protein_scorers[p_id]['biased_w2_ave'], processes=10)
                    if protein_scorers[p_id]['biased_w2_ave'] is None:
                        protein_scorers[p_id]['biased_w2_ave'] = biased_w2_ave
                    biased_z_score_array = np.array(pd.to_numeric(z_score_biased['Z-Score'], errors='coerce'))
                    protein_df['Max Biased Z-Score'] = np.nanmax(biased_z_score_array)
                    protein_df['Biased Z-Score at 10%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
                    protein_df['Biased Z-Score at 30%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
                    protein_df['AUC Biased Z-Score'] = biased_scw_z_auc
                    plot_z_scores(z_score_biased, z_score_plot_fn.format('Biased'))
                    z_score_unbiased, unbiased_w2_ave, unbiased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
                        1.0 - curr_etmip.coverage, bias=False, file_path=z_score_fn.format('Unbiased'),
                        w2_ave_sub=protein_scorers[p_id]['unbiased_w2_ave'], processes=10)
                    if protein_scorers[p_id]['unbiased_w2_ave'] is None:
                        protein_scorers[p_id]['unbiased_w2_ave'] = unbiased_w2_ave
                    unbiased_z_score_array = np.array(pd.to_numeric(z_score_unbiased['Z-Score'], errors='coerce'))
                    protein_df['Max Unbiased Z-Score'] = np.nanmax(unbiased_z_score_array)
                    protein_df['Unbiased Z-Score at 10%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
                    protein_df['Unbiased Z-Score at 30%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
                    protein_df['AUC Unbiased Z-Score'] = unbiased_scw_z_auc
                    plot_z_scores(z_score_unbiased, z_score_plot_fn.format('Unbiased'))
                    # Record execution times
                    protein_df['Init Time'] = init_time - start_time
                    protein_df['Import Time'] = import_time - init_time
                    protein_df['Dist Tree Time'] = dist_tree_time - import_time
                    protein_df['Trace Time'] = end_time - dist_tree_time
                    protein_df['Total Time'] = end_time - start_time
                    # Record static data for this protein
                    protein_df['Protein'] = p_id
                    protein_df['Method'] = 'ET-MIp_MAX'
                    protein_df['Alignment Size'] = summary['Filtered_Alignment'].values[summary['Protein_ID'] == p_id][0]
                    protein_df.to_csv(protein_fn, sep='\t', header=True, index=False, columns=output_columns)
                    temp_data = os.path.join(protein_dir, 'unique_node_data')
                    for temp_fn in os.listdir(temp_data):
                        if not temp_fn.endswith("_pair_rank_filtered_average_product_corrected_mutual_information_score.npz"):
                            os.remove(os.path.join(temp_data, temp_fn))
                    print('Metrics meastured for ET-MIp MAX covariance for: {}'.format(p_id))
                    counts['success'] += 1
                except ValueError:
                    print('Could not compute ET-MIp MAX covariance for: {} with seq_length: {} and size: {}'.format(
                        p_id, curr_etmip.original_aln.seq_length, curr_etmip.original_aln.size))
                    counts['value'] += 1
                    continue
                except AttributeError:
                    print('Could not compute ET-MIp MAX covariance for: {} no alignment'.format(p_id))
                    counts['attribute'] += 1
                    continue
            if etmip_method_df is None:
                etmip_method_df = protein_df
            else:
                etmip_method_df = etmip_method_df.append(protein_df)
        print('{}\tSuccesses\n{}\tValue Errors\n{}\tAttribute Errors'.format(counts['success'], counts['value'],
                                                                             counts['attribute']))
        etmip_method_df.to_csv(etmip_method_fn, sep='\t', header=True, index=False, columns=output_columns)
    if small_comparison_df is None:
        small_comparison_df = etmip_method_df
    else:
        small_comparison_df = small_comparison_df.append(etmip_method_df)

Attempting to calculate ET-MIp MAXIMIZED covariance for: 2b59
Attempting to calculate ET-MIp MAXIMIZED covariance for: 7hvp
Attempting to calculate ET-MIp MAXIMIZED covariance for: 1c0k
Attempting to calculate ET-MIp MAXIMIZED covariance for: 206l
Attempting to calculate ET-MIp MAXIMIZED covariance for: 1bol
Attempting to calculate ET-MIp MAXIMIZED covariance for: 3q05
Attempting to calculate ET-MIp MAXIMIZED covariance for: 1jwl
Attempting to calculate ET-MIp MAXIMIZED covariance for: 1a26
Attempting to calculate ET-MIp MAXIMIZED covariance for: 2ysd
Attempting to calculate ET-MIp MAXIMIZED covariance for: 2z0e
Attempting to calculate ET-MIp MAXIMIZED covariance for: 4lli
Attempting to calculate ET-MIp MAXIMIZED covariance for: 2rh1
Attempting to calculate ET-MIp MAXIMIZED covariance for: 3b6v
Attempting to calculate ET-MIp MAXIMIZED covariance for: 1h1v
Attempting to calculate ET-MIp MAXIMIZED covariance for: 2zxe
Attempting to calculate ET-MIp MAXIMIZED covariance for: 1c17
Attempti

100%|██████████| 4693/4693 [00:03<00:00, 1306.41characterizations/s]
100%|██████████| 4693/4693 [00:03<00:00, 1364.41group/s]
  0%|          | 0/2347 [00:00<?, ?rank/s]

Group Score MAXIMIZATION Took: 192.91375775337218 min


100%|██████████| 2347/2347 [30:02<00:00,  1.30rank/s] 
100%|██████████| 177906/177906 [03:00<00:00, 987.55variation/s] 


Results written to file in 3.0754732489585876 min
Successfully computed ET-MIp MAX covariance for: 4ycu


  fdr = fps / fdr_denominator
  fdr = fps / fdr_denominator
  ret = (d * (y[slice1] + y[slice2]) / 2.0).sum(axis)


Compute SCW Z-Score took 0.3458731174468994 min
Compute SCW Z-Score took 0.33072091738382975 min


Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)


Metrics meastured for ET-MIp MAX covariance for: 4ycu
Attempting to calculate ET-MIp MAXIMIZED covariance for: 2iop


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort,


Removing gaps took 0.09433327913284302 min


100%|██████████| 2472/2472 [00:02<00:00, 1051.07sequences/s]
100%|██████████| 3054156/3054156 [1:27:27<00:00, 582.01distances/s]


Constructing tree took: 3.2616514364878335 min


100%|██████████| 4943/4943 [1:08:01<00:00,  1.55s/characterizations]  
100%|██████████| 4943/4943 [07:51<00:00,  1.74group/s]  
  0%|          | 0/2472 [00:00<?, ?rank/s]

Group Score MAXIMIZATION Took: 236.40004972219467 min


100%|██████████| 2472/2472 [36:38<00:00,  1.12rank/s]  
100%|██████████| 194376/194376 [03:29<00:00, 929.03variation/s] 


Results written to file in 3.5659597675005594 min
Successfully computed ET-MIp MAX covariance for: 2iop
Compute SCW Z-Score took 0.5762609799702962 min
Compute SCW Z-Score took 0.5886143922805787 min
Metrics meastured for ET-MIp MAX covariance for: 2iop
2	Successes
0	Value Errors
0	Attribute Errors


## DCA##
Scoring the the covariation of the proteins using a DCA julia implementation.

In [8]:
from DCAWrapper import DCAWrapper
from utils import compute_rank_and_coverage
if not os.path.isfile(small_comparison_fn):
    dca_out_dir = os.path.join(small_set_out_dir, 'DCA')
    if not os.path.isdir(dca_out_dir):
        os.makedirs(dca_out_dir)
    dca_method_fn = os.path.join(dca_out_dir, 'DCA_Method_Data.csv')
    if olarges.path.isfile(dca_method_fn):
        dca_method_df = pd.read_csv(dca_method_fn, sep='\t', header=0, index_col=False)
    else:
        dca_method_df = None
        counts = {'success':0, 'value': 0, 'attribute':0}
        for p_id in generator.protein_data:
            print('Attempting to calculate DCA covariance for: {}'.format(p_id))
            protein_dir = os.path.join(dca_out_dir, p_id)
            if not os.path.isdir(protein_dir):
                os.makedirs(protein_dir)
            protein_fn = os.path.join(protein_dir, '{}_Protein_Data.csv'.format(p_id))
            if os.path.isfile(protein_fn):
                protein_df = pd.read_csv(protein_fn, sep='\t', header=0, index_col=False)
            else:
                try:
                    curr_aln = SeqAlignment(file_name=generator.protein_data[p_id]['Final_FA_Aln'], query_id=p_id,
                                            polymer_type='Protein')
                    curr_aln.import_alignment()
                    # Since the DCA implementation used here does not provide a way to specify the query sequence we remove the gaps
                    # from the query sequences so positions will be referenced correctly for that sequence (and unnecessary
                    # computations can be avoided).
                    curr_aln = curr_aln.remove_gaps()
                    new_aln_fn = os.path.join(protein_dir, '{}_no_gap.fasta'.format(p_id))
                    curr_aln.write_out_alignment(new_aln_fn)
                    curr_aln.file_name = new_aln_fn
                    curr_dca = DCAWrapper(alignment=curr_aln)
                    curr_dca.calculate_scores(out_dir=protein_dir, delete_file=False)
                    # Compute statistics for the final scores of the ET-MIp model
                    protein_df, _, _ = protein_scorers[p_id]['Scorer_CB'].evaluate_predictor(
                        predictor=curr_dca, verbosity=2, out_dir=protein_dir, dist='CB', biased_w2_ave=None,
                        unbiased_w2_ave=None, processes=10, threshold=0.5, pos_size=2, rank_type='max', file_prefix='DCA_Scores_', plots=True)
                    # Score Prediction Clustering
                    _, dca_coverage  = compute_rank_and_coverage(seq_length=curr_dca.alignment.seq_length, scores=curr_dca.scores, pos_size=2,
                        rank_type='max')
                    z_score_fn = os.path.join(protein_dir, 'DCA_Scores_Dist-Any_{}_ZScores.tsv')
                    z_score_plot_fn = os.path.join(protein_dir, 'DCA_Scores_Dist-Any_{}_ZScores.png')
                    z_score_biased, biased_w2_ave, biased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
                        1.0 - dca_coverage, bias=True, file_path=z_score_fn.format('Biased'),
                        w2_ave_sub=protein_scorers[p_id]['biased_w2_ave'], processes=10)
                    if protein_scorers[p_id]['biased_w2_ave'] is None:
                        protein_scorers[p_id]['biased_w2_ave'] = biased_w2_ave
                    biased_z_score_array = np.array(pd.to_numeric(z_score_biased['Z-Score'], errors='coerce'))
                    protein_df['Max Biased Z-Score'] = np.nanmax(biased_z_score_array)
                    protein_df['Biased Z-Score at 10%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
                    protein_df['Biased Z-Score at 30%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
                    protein_df['AUC Biased Z-Score'] = biased_scw_z_auc
                    plot_z_scores(z_score_biased, z_score_plot_fn.format('Biased'))
                    z_score_unbiased, unbiased_w2_ave, unbiased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
                        1.0 - dca_coverage, bias=False, file_path=z_score_fn.format('Unbiased'),
                        w2_ave_sub=protein_scorers[p_id]['unbiased_w2_ave'], processes=10)
                    if protein_scorers[p_id]['unbiased_w2_ave'] is None:
                        protein_scorers[p_id]['unbiased_w2_ave'] = unbiased_w2_ave
                    unbiased_z_score_array = np.array(pd.to_numeric(z_score_unbiased['Z-Score'], errors='coerce'))
                    protein_df['Max Unbiased Z-Score'] = np.nanmax(unbiased_z_score_array)
                    protein_df['Unbiased Z-Score at 10%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
                    protein_df['Unbiased Z-Score at 30%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
                    protein_df['AUC Unbiased Z-Score'] = unbiased_scw_z_auc
                    plot_z_scores(z_score_unbiased, z_score_plot_fn.format('Unbiased'))
                    # Record execution times
                    protein_df['Init Time'] = None
                    protein_df['Import Time'] = None
                    protein_df['Dist Tree Time'] = None
                    protein_df['Trace Time'] = None
                    protein_df['Total Time'] = None
                    # Record static data for this protein
                    protein_df['Protein'] = p_id
                    protein_df['Method'] = 'DCA'
                    protein_df['Alignment Size'] = summary['Filtered_Alignment'].values[summary['Protein_ID'] == p_id][0]
                    protein_df.to_csv(protein_fn, sep='\t', header=True, index=False, columns=output_columns)
                    print('Successfully computed DCA covariance for: {}'.format(p_id))
                    counts['success'] += 1
                except ValueError:
                    print('Could not compute DCA covariance for: {} with seq_length: {} and size: {}'.format(
                        p_id, curr_aln.seq_length, curr_aln.size))
                    counts['value'] += 1
                    continue
                except AttributeError:
                    print('Could not compute DCA covariance for: {} no alignment'.format(p_id))
                    counts['attribute'] += 1
                    continue
            if dca_method_df is None:
                dca_method_df = protein_df
            else:
                dca_method_df = dca_method_df.append(protein_df)
        print('{}\tSuccesses\n{}\tValue Errors\n{}\tAttribute Errors'.format(counts['success'], counts['value'],
                                                                             counts['attribute']))
        dca_method_df.to_csv(dca_method_fn, sep='\t', header=True, index=False, columns=output_columns)
    if small_comparison_df is None:
        small_comparison_df = dca_method_df
    else:
        small_comparison_df = small_comparison_df.append(dca_method_df)

## EVCouplings##
Scoring the the covariation of the proteins using the EVCouplings method standard protocol.

In [9]:
from EVCouplingsWrapper import EVCouplingsWrapper
if not os.path.isfile(small_comparison_fn):
    evc_standard_out_dir = os.path.join(small_set_out_dir, 'EVCouplings_Standard')
    if not os.path.isdir(evc_standard_out_dir):
        os.makedirs(evc_standard_out_dir)
    evc_standard_method_fn = os.path.join(evc_standard_out_dir, 'EVCouplings_Standard_Method_Data.csv')
    if os.path.isfile(evc_standard_method_fn):
        evc_standard_method_df = pd.read_csv(evc_standard_method_fn, sep='\t', header=0, index_col=False)
    else:
        evc_standard_method_df = None
        counts = {'success':0, 'value': 0, 'attribute':0}
        for p_id in generator.protein_data:
            print('Attempting to calculate EV couplings standard protocol covariance for: {}'.format(p_id))
            protein_dir = os.path.join(evc_standard_out_dir, p_id)
            if not os.path.isdir(protein_dir):
                os.makedirs(protein_dir)
            protein_fn = os.path.join(protein_dir, '{}_Protein_Data.csv'.format(p_id))
            if os.path.isfile(protein_fn):
                protein_df = pd.read_csv(protein_fn, sep='\t', header=0, index_col=False)
            else:
                try:
                    curr_aln = SeqAlignment(file_name=generator.protein_data[p_id]['Final_FA_Aln'], query_id=p_id,
                                            polymer_type='Protein')
                    curr_aln.import_alignment()
                    curr_evc = EVCouplingsWrapper(alignment=curr_aln, protocol='standard')
                                curr_evc.calculate_scores(out_dir=protein_dir, cores=10, delete_files=True)
                    # Compute statistics for the final scores of the ET-MIp model
                    protein_df, _, _ = protein_scorers[p_id]['Scorer_CB'].evaluate_predictor(
                        predictor=curr_evc, verbosity=2, out_dir=protein_dir, dist='CB', biased_w2_ave=None,
                        unbiased_w2_ave=None, processes=10, threshold=0.5, pos_size=2,
                        rank_type='max', file_prefix='EVC_Standard_Scores_', plots=True)
                    # Score Prediction Clustering
                    _, evc_standard_coverage  = compute_rank_and_coverage(seq_length=curr_evc.alignment.seq_length, scores=curr_evc.scores, pos_size=2,
                        rank_type='max')
                    z_score_fn = os.path.join(protein_dir, 'EVC_Standard_Scores_Dist-Any_{}_ZScores.tsv')
                    z_score_plot_fn = os.path.join(protein_dir, 'EVC_Standard_Scores_Dist-Any_{}_ZScores.png')
                    z_score_biased, biased_w2_ave, biased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
                        1.0 - evc_standard_coverage, bias=True, file_path=z_score_fn.format('Biased'),
                        w2_ave_sub=protein_scorers[p_id]['biased_w2_ave'], processes=10)
                    if protein_scorers[p_id]['biased_w2_ave'] is None:
                        protein_scorers[p_id]['biased_w2_ave'] = biased_w2_ave
                    biased_z_score_array = np.array(pd.to_numeric(z_score_biased['Z-Score'], errors='coerce'))
                    protein_df['Max Biased Z-Score'] = np.nanmax(biased_z_score_array)
                    protein_df['Biased Z-Score at 10%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
                    protein_df['Biased Z-Score at 30%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
                    protein_df['AUC Biased Z-Score'] = biased_scw_z_auc
                    plot_z_scores(z_score_biased, z_score_plot_fn.format('Biased'))
                    z_score_unbiased, unbiased_w2_ave, unbiased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
                        1.0 - evc_standard_coverage, bias=False, file_path=z_score_fn.format('Unbiased'),
                        w2_ave_sub=protein_scorers[p_id]['unbiased_w2_ave'], processes=10)
                    if protein_scorers[p_id]['unbiased_w2_ave'] is None:
                        protein_scorers[p_id]['unbiased_w2_ave'] = unbiased_w2_ave
                    unbiased_z_score_array = np.array(pd.to_numeric(z_score_unbiased['Z-Score'], errors='coerce'))
                    protein_df['Max Unbiased Z-Score'] = np.nanmax(unbiased_z_score_array)
                    protein_df['Unbiased Z-Score at 10%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
                    protein_df['Unbiased Z-Score at 30%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
                    protein_df['AUC Unbiased Z-Score'] = unbiased_scw_z_auc
                    plot_z_scores(z_score_unbiased, z_score_plot_fn.format('Unbiased'))
                    # Record execution times
                    protein_df['Init Time'] = None
                    protein_df['Import Time'] = None
                    protein_df['Dist Tree Time'] = None
                    protein_df['Trace Time'] = None
                    protein_df['Total Time'] = None
                    # Record static data for this protein
                    protein_df['Protein'] = p_id
                    protein_df['Method'] = 'EVC Standard'
                    protein_df['Alignment Size'] = summary['Filtered_Alignment'].values[summary['Protein_ID'] == p_id][0]
                    protein_df.to_csv(protein_fn, sep='\t', header=True, index=False, columns=output_columns)
                    print('Successfully computed EV couplings standard protocol covariance for: {}'.format(p_id))
                    counts['success'] += 1
                except ValueError:
                    print('Could not compute EV couplings standard protocol covariance for: {} with seq_length: {} and size: {}'.format(
                        p_id, curr_aln.seq_length, curr_aln.size))
                    counts['value'] += 1
                    continue
                except AttributeError:
                    print('Could not compute EV couplings standard protocol covariance for: {} no alignment'.format(p_id))
                    counts['attribute'] += 1
                    continue
            if evc_standard_method_df is None:
                evc_standard_method_df = protein_df
            else:
                evc_standard_method_df = evc_standard_method_df.append(protein_df)
        print('{}\tSuccesses\n{}\tValue Errors\n{}\tAttribute Errors'.format(counts['success'], counts['value'],
                                                                             counts['attribute']))
        evc_standard_method_df.to_csv(evc_standard_method_fn, sep='\t', header=True, index=False, columns=output_columns)
    if small_comparison_df is None:
        small_comparison_df = evc_standard_method_df
    else:
        small_comparison_df = small_comparison_df.append(evc_standard_method_df)

Scoring the covariation of the proteins using the EVCouplings method mean field protocol.

In [10]:
if not os.path.isfile(small_comparison_fn):
    evc_mf_out_dir = os.path.join(small_set_out_dir, 'EVCouplings_Mean_Field')
    if not os.path.isdir(evc_mf_out_dir):
        os.makedirs(evc_mf_out_dir)
    evc_mf_method_fn = os.path.join(evc_mf_out_dir, 'EVCouplings_Mean_Field_Method_Data.csv')
    if os.path.isfile(evc_mf_method_fn):
        evc_mf_method_df = pd.read_csv(evc_mf_method_fn, sep='\t', header=0, index_col=False)
    else:
        evc_mf_method_df = None
        counts = {'success':0, 'value': 0, 'attribute':0}
        for p_id in generator.protein_data:
            print('Attempting to calculate EV couplings covariance for: {}'.format(p_id))
            try:
                protein_dir = os.path.join(evc_mf_out_dir, p_id)
                if not os.path.isdir(protein_dir):
                    os.makedirs(protein_dir)
                curr_aln = SeqAlignment(file_name=generator.protein_data[p_id]['Final_FA_Aln'], query_id=p_id,
                                        polymer_type='Protein')
                curr_aln.import_alignment()
                curr_evc = EVCouplingsWrapper(alignment=curr_aln, protocol='mean_field')
                curr_evc.calculate_scores(out_dir=protein_dir, cores=10, delete_files=True)
                # Compute statistics for the final scores of the ET-MIp model
                protein_df, _, _ = protein_scorers[p_id]['Scorer_CB'].evaluate_predictor(
                    predictor=curr_evc, verbosity=2, out_dir=protein_dir, dist='CB', biased_w2_ave=None,
                    unbiased_w2_ave=None, processes=10, threshold=0.5, pos_size=2, rank_type='max',
                    file_prefix='EVC_Standard_Scores_', plots=True)
                # Score Prediction Clustering
                _, evc_mf_coverage  = compute_rank_and_coverage(seq_length=curr_evc.alignment.seq_length, scores=curr_evc.scores, pos_size=2,
                    rank_type='max')
                z_score_fn = os.path.join(protein_dir, 'EVC_Mean_Field_Scores_Dist-Any_{}_ZScores.tsv')
                z_score_plot_fn = os.path.join(protein_dir, 'EVC_Mean_Field_Scores_Dist-Any_{}_ZScores.png')
                z_score_biased, biased_w2_ave, biased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
                    1.0 - evc_mf_coverage, bias=True, file_path=z_score_fn.format('Biased'),
                    w2_ave_sub=protein_scorers[p_id]['biased_w2_ave'], processes=10)
                if protein_scorers[p_id]['biased_w2_ave'] is None:
                        protein_scorers[p_id]['biased_w2_ave'] = biased_w2_ave
                biased_z_score_array = np.array(pd.to_numeric(z_score_biased['Z-Score'], errors='coerce'))
                protein_df['Max Biased Z-Score'] = np.nanmax(biased_z_score_array)
                protein_df['Biased Z-Score at 10%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
                protein_df['Biased Z-Score at 30%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
                protein_df['AUC Biased Z-Score'] = biased_scw_z_auc
                plot_z_scores(z_score_biased, z_score_plot_fn.format('Biased'))
                z_score_unbiased, unbiased_w2_ave, unbiased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
                    1.0 - evc_mf_coverage, bias=False, file_path=z_score_fn.format('Unbiased'),
                    w2_ave_sub=protein_scorers[p_id]['unbiased_w2_ave'], processes=10)
                if protein_scorers[p_id]['unbiased_w2_ave'] is None:
                        protein_scorers[p_id]['unbiased_w2_ave'] = unbiased_w2_ave
                unbiased_z_score_array = np.array(pd.to_numeric(z_score_unbiased['Z-Score'], errors='coerce'))
                protein_df['Max Unbiased Z-Score'] = np.nanmax(unbiased_z_score_array)
                protein_df['Unbiased Z-Score at 10%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
                protein_df['Unbiased Z-Score at 30%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
                protein_df['AUC Unbiased Z-Score'] = unbiased_scw_z_auc
                plot_z_scores(z_score_unbiased, z_score_plot_fn.format('Unbiased'))
                # Record execution times
                protein_df['Init Time'] = None
                protein_df['Import Time'] = None
                protein_df['Dist Tree Time'] = None
                protein_df['Trace Time'] = None
                protein_df['Total Time'] = None
                # Record static data for this protein
                protein_df['Protein'] = p_id
                protein_df['Method'] = 'EVC Mean Field'
                protein_df['Alignment Size'] = summary['Filtered_Alignment'].values[summary['Protein_ID'] == p_id][0]
                protein_df.to_csv(protein_fn, sep='\t', header=True, index=False, columns=output_columns)
                print('Successfully computed EV couplings covariance for: {}'.format(p_id))
                counts['success'] += 1
            except ValueError:
                print('Could not compute EV couplings covariance for: {} with seq_length: {} and size: {}'.format(
                    p_id, curr_aln.seq_length, curr_aln.size))
                counts['value'] += 1
                continue
            except AttributeError:
                print('Could not compute EV couplings covariance for: {} no alignment'.format(p_id))
                counts['attribute'] += 1
                continue
            if evc_mf_method_df is None:
                evc_mf_method_df = protein_df
            else:
                evc_mf_method_df = evc_mf_method_df.append(protein_df)
        print('{}\tSuccesses\n{}\tValue Errors\n{}\tAttribute Errors'.format(counts['success'], counts['value'],
                                                                             counts['attribute']))
        evc_mf_method_df.to_csv(evc_mf_method_fn, sep='\t', header=True, index=False, columns=output_columns)
    if small_comparison_df is None:
        small_comparison_df = evc_mf_method_df
    else:
        small_comparison_df = small_comparison_df.append(evc_mf_method_df)

In [11]:
# Write out final comparison data so it can be loaded later for generating figures.
if not os.path.isfile(small_comparison_fn):
    small_comparison_df['Protein Length'] = small_comparison_df['Protein'].apply(lambda x: generator.protein_data[x]['Length'])
    small_comparison_df.to_csv(small_comparison_fn, sep='\t', header=True, index=False, columns=output_columns)

# Comparing Execution Time for ET-MIp and cET-MIp
The time to compute the trace for the full phylogenetic tree and the trace constrained to a subset of the top levels should take significantly less time to compute, here we evaluate if that is in fact the case or not.

## Data Cleaning
At least one protein in this data set has a very small alignment and could not be evaluated by cET-MIp because the tree was too small to each the levels set for other proteins. Here we remove those proteins.

In addition since this analysis focuses on times and there are many other types of data (some of which cause redundancies in the time data), we will use this opportunity to subset the data and drop duplicates.

Finally, for some it will be more informative to view execution time in terms of minutes or hours, as opposed to the originally reported seconds, so we will add columns for these units as well.

In [12]:
protein_method_groups = small_comparison_df[['Protein', 'Method']].drop_duplicates().groupby('Protein').count()
method_max = protein_method_groups['Method'].max()
proteins_to_keep = protein_method_groups.index[protein_method_groups['Method'] == method_max]
comparable_method_proteins = small_comparison_df[small_comparison_df['Protein'].isin(proteins_to_keep)]
time_columns = ['Protein', 'Protein Length', 'Alignment Size', 'Method', 'Init Time', 'Import Time', 'Dist Tree Time',
                'Trace Time', 'Total Time']
time_subset_df = comparable_method_proteins.loc[comparable_method_proteins['Method'].isin(['ET-MIp', 'cET-MIp']), time_columns]
time_subset_df['Total Time (min)'] = time_subset_df['Total Time'].apply(lambda x: x / 60.0)
time_subset_df['Total Time (hr)'] = time_subset_df['Total Time (min)'].apply(lambda x: x / 60.0)
time_columns += ['Total Time (min)', 'Total Time (hr)']
time_subset_df = time_subset_df.drop_duplicates(subset=None, inplace=False, keep='first')
time_subset_df.to_csv(os.path.join(small_set_out_dir, 'Small_Time_Comaprison_Data.csv'), sep='\t', header=True, index=False,
                      columns=time_columns)

## Time Comparison
Now that only comparable proteins are present in the data we compare the runtime of individual proteins by method, ordered by their length and the size of their alignemnts.

In [13]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# Displayed by protein length order
protein_length_order = comparable_method_proteins.sort_values('Protein Length')['Protein'].unique()
protein_length_time_plot = sns.barplot(x='Protein', y='Total Time', hue='Method', order=protein_length_order,
                                       hue_order=['ET-MIp', 'cET-MIp'], data=time_subset_df)
protein_length_time_plot.set_xticklabels(protein_length_time_plot.get_xticklabels(), rotation=90)
protein_length_time_plot.legend(loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1)
plt.savefig(os.path.join(small_set_out_dir, 'Protein_Length_Time_Comparison_Sec.png'), bbox_inches='tight', transparent=True, dpi=300)
plt.close()
protein_length_time_plot = sns.barplot(x='Protein', y='Total Time (min)', hue='Method', order=protein_length_order,
                                       hue_order=['ET-MIp', 'cET-MIp'], data=time_subset_df)
protein_length_time_plot.set_xticklabels(protein_length_time_plot.get_xticklabels(), rotation=90)
protein_length_time_plot.legend(loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1)
plt.savefig(os.path.join(small_set_out_dir, 'Protein_Length_Time_Comparison_Min.png'), bbox_inches='tight', transparent=True, dpi=300)
plt.close()
protein_length_time_plot = sns.barplot(x='Protein', y='Total Time (hr)', hue='Method', order=protein_length_order,
                                       hue_order=['ET-MIp', 'cET-MIp'], data=time_subset_df)
protein_length_time_plot.set_xticklabels(protein_length_time_plot.get_xticklabels(), rotation=90)
protein_length_time_plot.legend(loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1)
plt.savefig(os.path.join(small_set_out_dir, 'Protein_Length_Time_Comparison_Hr.png'), bbox_inches='tight', transparent=True, dpi=300)
plt.close()
# Displayed by alignment size order
protein_alignment_order = comparable_method_proteins.sort_values('Alignment Size')['Protein'].unique()
protein_alignment_time_plot = sns.barplot(x='Protein', y='Total Time', hue='Method', order=protein_alignment_order,
                                          hue_order=['ET-MIp', 'cET-MIp'], data=time_subset_df)
protein_alignment_time_plot.set_xticklabels(protein_alignment_time_plot.get_xticklabels(), rotation=90)
protein_alignment_time_plot.legend(loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1)
plt.savefig(os.path.join(small_set_out_dir, 'Protein_Alignment_Time_Comparison_Sec.png'), bbox_inches='tight', transparent=True, dpi=300)
plt.close()
protein_alignment_time_plot = sns.barplot(x='Protein', y='Total Time (min)', hue='Method', order=protein_alignment_order,
                                          hue_order=['ET-MIp', 'cET-MIp'], data=time_subset_df)
protein_alignment_time_plot.set_xticklabels(protein_alignment_time_plot.get_xticklabels(), rotation=90)
protein_alignment_time_plot.legend(loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1)
plt.savefig(os.path.join(small_set_out_dir, 'Protein_Alignment_Time_Comparison_Min.png'), bbox_inches='tight', transparent=True, dpi=300)
plt.close()
protein_alignment_time_plot = sns.barplot(x='Protein', y='Total Time (hr)', hue='Method', order=protein_alignment_order,
                                          hue_order=['ET-MIp', 'cET-MIp'], data=time_subset_df)
protein_alignment_time_plot.set_xticklabels(protein_alignment_time_plot.get_xticklabels(), rotation=90)
protein_alignment_time_plot.legend(loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1)
plt.savefig(os.path.join(small_set_out_dir, 'Protein_Alignment_Time_Comparison_Hr.png'), bbox_inches='tight', transparent=True, dpi=300)
plt.close()

Overall comparison of time by method.

In [49]:
# Comparison of ET-MIp and cET-MIp total computation time (sec).
protein_method_comp_plot = sns.boxplot(x='Method', y='Total Time', order=['ET-MIp', 'cET-MIp'], data=time_subset_df)
protein_method_comp_plot.set_xticklabels(protein_method_comp_plot.get_xticklabels(), rotation=90)
plt.savefig(os.path.join(small_set_out_dir, 'Protein_Method_Time_Comparison_Sec.png'), bbox_inches='tight', transparent=True, dpi=300)
plt.close()
# Comparison of ET-MIp and cET-MIp total computation time (sec).
protein_method_comp_plot = sns.boxplot(x='Method', y='Total Time (min)', order=['ET-MIp', 'cET-MIp'], data=time_subset_df)
protein_method_comp_plot.set_xticklabels(protein_method_comp_plot.get_xticklabels(), rotation=90)
plt.savefig(os.path.join(small_set_out_dir, 'Protein_Method_Time_Comparison_Min.png'), bbox_inches='tight', transparent=True, dpi=300)
plt.close()
# Comparison of ET-MIp and cET-MIp total computation time (sec).
protein_method_comp_plot = sns.boxplot(x='Method', y='Total Time (hr)', order=['ET-MIp', 'cET-MIp'], data=time_subset_df)
protein_method_comp_plot.set_xticklabels(protein_method_comp_plot.get_xticklabels(), rotation=90)
plt.savefig(os.path.join(small_set_out_dir, 'Protein_Method_Time_Comparison_Hr.png'), bbox_inches='tight', transparent=True, dpi=300)
plt.close()
# Statistical comparison
from scipy.stats import wilcoxon
et_mip_sub_df = time_subset_df[time_subset_df['Method'] == 'ET-MIp']
cet_mip_sub_df = time_subset_df[time_subset_df['Method'] == 'cET-MIp']
sec_stat, sec_p_val = wilcoxon(x=et_mip_sub_df['Total Time'], y=cet_mip_sub_df['Total Time'], zero_method='wilcox')
min_stat, min_p_val = wilcoxon(x=et_mip_sub_df['Total Time (min)'], y=cet_mip_sub_df['Total Time (min)'], zero_method='wilcox')
hr_stat, hr_p_val = wilcoxon(x=et_mip_sub_df['Total Time (hr)'], y=cet_mip_sub_df['Total Time (hr)'], zero_method='wilcox')
time_statistics = {'Time Unit': ['sec', 'min', 'hr'], 'Statistic': [sec_stat, min_stat, hr_stat], 'P-Value': [sec_p_val, min_p_val, hr_p_val]}
pd.DataFrame(time_statistics).to_csv(os.path.join(small_set_out_dir, 'Small_Time_Comaprison_Statistics.csv'), sep='\t', header=True,
                                     index=False, columns=['Time Unit', 'Statistic', 'P-Value'])

## Method Comparison
We now begin comparing methods based on their ability to predict the structural contacts in the proteins in this test set. There is an important consideration in the case of sequence separation and different measures by which to compare the methods.

### Data Cleaning
These data need an additional cleaning step beyond what was performed for the timing comparison. Since there are multiple categories of sequence separation and some proteins may not have any True Positive contacts for a category the scoring for that protein is incomplete. We will remove all such proteins from the comparison, performing the clean up separate for each metric of success. Another contributing factor which necessitates this kind of cleaning is assessment of the top K predictions for a protein or best L/K predictions which for poor predictions may not include any predictions of True Positives.

### Sequence Separation
One important consideration for the difficulty of prediction and interest in predictions is the distance between the residues for which coupling was predicted. As has been documented in the literature, especially in the CASP competitions, there are several categories of prediction:
* Neighbors (1 - 5 residues apart) - This is the least interesting category of predictions. It is highly likely that residues this close together will show covariance signal. Predicting two residues are in contact that are this close together is trivial and uninformative.
* Short (6 - 12 residues apart) - This is also not a very interesting type of prediction. Residues this close in proximity can be more easily modeled by alogrithms which focus on 2D protein structure modeling (identifying beta sheets, alpha helices, etc.).
* Medium (13 - 24 residues apart) - This is a more interesting type of prediction. The resiudes in this range of separation are on the edge of the 2D protein structure prediction range.
* Long (24 and more residues apart) - The most interesting category of predictions. Resiudes this far apart are not easily modeled by 2D protein structure modeling systems. They are also very useful for 3D and 4D protein structure prediction becausae they provide constraints on potential protein (similar to NMR data) folds which makes protein folding a more tractable problem for modelers.
* Any/All - All categories can be considered at once, this provides a summary value, but is often skewed by one particularly good category of predictions.

### Metrics of Success
* AUROC - This measures the True Positive Rate vs the False Positive Rate of prediction, it can be considered a measure of the accuracy of the measure. This can be strongly influenced by the class imbalance which is present when predicting structural contacts since there are many fewer contacts than non-contacts. The True Positive case is if the C-beta of two amino acids is within 8.0 Angstroms of one another (as is done in the CASP competitions).

In [36]:
method_order = ['DCA', 'EVC Standard', 'EVC Mean Field', 'ET-MIp', 'cET-MIp']

In [None]:
auroc_columns = ['Protein', 'Method', 'Sequence_Separation', 'AUROC']
protein_auroc_groups = comparable_method_proteins[auroc_columns].drop_duplicates().groupby('Protein')['AUROC'].apply(
    lambda x: not x.isnull().any())
complete_proteins = protein_auroc_groups.index[protein_auroc_groups.values]
comparable_auroc_proteins = small_comparison_df[small_comparison_df['Protein'].isin(complete_proteins)]
auroc_protein_length_order = [x for x in protein_length_order if x in complete_proteins]
auroc_protein_alignment_order = [x for x in protein_alignment_order if x in complete_proteins]
auroc_subset_df = comparable_auroc_proteins.loc[:, auroc_columns].drop_duplicates()
# Plot the methods vs AUROC per protein ordered by protein length
auroc_subset_df.to_csv(os.path.join(small_set_out_dir, 'Small_AUROC_Comaprison_Data.csv'), sep='\t', header=True, index=False,
                       columns=auroc_columns)
protein_order_auroc_plot = sns.catplot(x="Protein", y="AUROC", hue="Method", row="Sequence_Separation", data=auroc_subset_df, kind="bar",
                                       ci=None, order=auroc_protein_length_order, hue_order=method_order, legend=True, legend_out=True)
protein_order_auroc_plot.set_xticklabels(auroc_protein_length_order, rotation=90)
protein_order_auroc_plot.savefig(os.path.join(small_set_out_dir, 'Protein_Method_AUROC_Comparison_Protein_Length_Order.png'),
                                 bbox_inches='tight', transparent=True, dpi=300)
plt.close()
# Plot the methods vs AUROC per protein ordered by protein alignment size
alignment_order_auroc_plot = sns.catplot(x="Protein", y="AUROC", hue="Method", row="Sequence_Separation", data=auroc_subset_df, kind="bar",
                                       ci=None, order=auroc_protein_alignment_order, hue_order=method_order, legend=True, legend_out=True)
alignment_order_auroc_plot.set_xticklabels(auroc_protein_alignment_order, rotation=90)
alignment_order_auroc_plot.savefig(os.path.join(small_set_out_dir, 'Protein_Method_AUROC_Comparison_Alignment_Size_Order.png'),
                                   bbox_inches='tight', transparent=True, dpi=300)
plt.close()
# Plot the methods vs AUROC grouped together to see overall trends
overall_auroc_plot = sns.boxplot(x="Sequence_Separation", y="AUROC", hue="Method", data=auroc_subset_df,
                                 order=sequence_separation_order, hue_order=method_order)
overall_auroc_plot.set_xticklabels(sequence_separation_order, rotation=90)
overall_auroc_plot.legend(loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1)
plt.savefig(os.path.join(small_set_out_dir, 'Protein_Method_AUROC_Comparison.png'), bbox_inches='tight', transparent=True, dpi=300)
plt.close()
# Compute statistics comparing methods at each sequence separation
auroc_statistics = {'Sequence Separation': [], 'Method 1': [], 'Method 2': [], 'Statistic': [], 'P-Value': []}
for sep in sequence_separation_order:
    sep_auroc_subset_df = auroc_subset_df.loc[auroc_subset_df['Sequence_Separation'] == sep, :]
    for i in range(len(method_order)):
        m1_sep_auroc_subset_df = sep_auroc_subset_df.loc[sep_auroc_subset_df['Method'] == method_order[i], :]
        for j in range(i + 1, len(method_order)):
            m2_sep_auroc_subset_df = sep_auroc_subset_df.loc[sep_auroc_subset_df['Method'] == method_order[j], :]
            stat, p_val = wilcoxon(x=m1_sep_auroc_subset_df['AUROC'], y=m2_sep_auroc_subset_df['AUROC'], zero_method='wilcox')
            auroc_statistics['Sequence Separation'].append(sep)
            auroc_statistics['Method 1'].append(method_order[i])
            auroc_statistics['Method 2'].append(method_order[j])
            auroc_statistics['Statistic'].append(stat)
            auroc_statistics['P-Value'].append(p_val)
pd.DataFrame(auroc_statistics).to_csv(os.path.join(small_set_out_dir, 'Small_AUROC_Comaprison_Statistics.csv'), sep='\t', header=True,
                                      index=False, columns=['Sequence Separation', 'Method 1', 'Method 2', 'Statistic', 'P-Value'])

### Metrics of Success (Continued)
* AUPRC - This measures the Precision vs the Recall of the predictions, it can be considered a measure of the accuracy of the measure. This is less strongly influenced by the class imbalance which is present when predicting structural contacts since there are many fewer contacts than non-contacts. The True Positive case is if the C-beta of two amino acids is within 8.0 Angstroms of one another (as is done in the CASP competitions).

In [48]:
auprc_columns = ['Protein', 'Method', 'Sequence_Separation', 'AUPRC']
protein_auprc_groups = comparable_method_proteins[auprc_columns].drop_duplicates().groupby('Protein')['AUPRC'].apply(
    lambda x: not x.isnull().any())
complete_proteins = protein_auprc_groups.index[protein_auprc_groups.values]
comparable_auprc_proteins = small_comparison_df[small_comparison_df['Protein'].isin(complete_proteins)]
auprc_protein_length_order = [x for x in protein_length_order if x in complete_proteins]
auprc_protein_alignment_order = [x for x in protein_alignment_order if x in complete_proteins]
auprc_subset_df = comparable_auprc_proteins.loc[:, auprc_columns].drop_duplicates()
# Plot the methods vs AUPRC per protein ordered by protein length
auprc_subset_df.to_csv(os.path.join(small_set_out_dir, 'Small_AUPRC_Comaprison_Data.csv'), sep='\t', header=True, index=False,
                       columns=auprc_columns)
protein_order_auprc_plot = sns.catplot(x="Protein", y="AUPRC", hue="Method", row="Sequence_Separation", data=auprc_subset_df, kind="bar",
                                       ci=None, order=auprc_protein_length_order, hue_order=method_order, legend=True, legend_out=True)
protein_order_auprc_plot.set_xticklabels(auprc_protein_length_order, rotation=90)
protein_order_auprc_plot.savefig(os.path.join(small_set_out_dir, 'Protein_Method_AUPRC_Comparison_Protein_Length_Order.png'),
                                 bbox_inches='tight', transparent=True, dpi=300)
plt.close()
# Plot the methods vs AUPRC per protein ordered by protein alignment size
alignment_order_auprc_plot = sns.catplot(x="Protein", y="AUPRC", hue="Method", row="Sequence_Separation", data=auprc_subset_df, kind="bar",
                                       ci=None, order=auprc_protein_alignment_order, hue_order=method_order, legend=True, legend_out=True)
alignment_order_auprc_plot.set_xticklabels(auprc_protein_alignment_order, rotation=90)
alignment_order_auprc_plot.savefig(os.path.join(small_set_out_dir, 'Protein_Method_AUPRC_Comparison_Alignment_Size_Order.png'),
                                   bbox_inches='tight', transparent=True, dpi=300)
plt.close()
# Plot the methods vs AUPRC grouped together to see overall trends
overall_auprc_plot = sns.boxplot(x="Sequence_Separation", y="AUPRC", hue="Method", data=auprc_subset_df,
                                 order=sequence_separation_order, hue_order=method_order)
overall_auprc_plot.set_xticklabels(sequence_separation_order, rotation=90)
overall_auprc_plot.legend(loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1)
plt.savefig(os.path.join(small_set_out_dir, 'Protein_Method_AUPRC_Comparison.png'), bbox_inches='tight', transparent=True, dpi=300)
plt.close()
# Compute statistics comparing methods at each sequence separation
auprc_statistics = {'Sequence Separation': [], 'Method 1': [], 'Method 2': [], 'Statistic': [], 'P-Value': []}
for sep in sequence_separation_order:
    sep_auprc_subset_df = auprc_subset_df.loc[auprc_subset_df['Sequence_Separation'] == sep, :]
    for i in range(len(method_order)):
        m1_sep_auprc_subset_df = sep_auprc_subset_df.loc[sep_auprc_subset_df['Method'] == method_order[i], :]
        for j in range(i + 1, len(method_order)):
            m2_sep_auprc_subset_df = sep_auprc_subset_df.loc[sep_auprc_subset_df['Method'] == method_order[j], :]
            stat, p_val = wilcoxon(x=m1_sep_auprc_subset_df['AUPRC'], y=m2_sep_auprc_subset_df['AUPRC'], zero_method='wilcox')
            auprc_statistics['Sequence Separation'].append(sep)
            auprc_statistics['Method 1'].append(method_order[i])
            auprc_statistics['Method 2'].append(method_order[j])
            auprc_statistics['Statistic'].append(stat)
            auprc_statistics['P-Value'].append(p_val)
pd.DataFrame(auprc_statistics).to_csv(os.path.join(small_set_out_dir, 'Small_AUPRC_Comaprison_Statistics.csv'), sep='\t', header=True,
                                      index=False, columns=['Sequence Separation', 'Method 1', 'Method 2', 'Statistic', 'P-Value'])

* Precision at K - 

* Recall at K - 

* F1 at K - 

* Structural Cluster Weighting Z-Score - 