# Small Test Set #
## Goal ##
The goal of this test set is to perform proof of concept testing on a small number of proteins with a wide range of sizes and available homologs, orthologs, and paralogs. By doing so it should be possible to test the best parameterization for this tool as well as identifying the strengths and weaknesses of the tool using various measurments as end points.
## Warning ##
Before attempting to use this notebook make sure that your .env file has been properly setup to reflect the correct locations of command line tools and the location of files and directories needed for execution.
### Initial Import###
This first cell performs the necessary imports required to begin this notebook.

In [1]:
from dotenv import find_dotenv, load_dotenv
try:
    dotenv_path = find_dotenv(raise_error_if_not_found=True)
except IOError:
    dotenv_path = find_dotenv(raise_error_if_not_found=True, usecwd=True)
load_dotenv(dotenv_path)
import os
import sys
sys.path.append(os.path.join(os.environ.get('PROJECT_PATH'), 'src'))
sys.path.append(os.path.join(os.environ.get('PROJECT_PATH'), 'src', 'SupportingClasses'))
input_dir = os.environ.get('INPUT_PATH')

## Data Set Construction ##
The first task required to test the data set is to download the required data and construct any necessary input files for all down stream analyses.
In this case that means:
* Downloading PDB files for the proteins in our small test set.
* Extracting a query sequence from each PDB file.
* Searching for paralogs, homologs, and orthologs in a custom BLAST database built by filtering the Uniref90 database.
* Filtering the hits from the BLAST search to meet minimum and maximum length requirements, as well as minimum and maximum identity requirements.
* Building alignments using CLUSTALW in both the fasta and msf formats since some of the tools which will be used for comparison need different formats.
* Filtering the alignment for maximum identity similarity between seqeunces.
* Re-aligning the filtered sequences using CLUSTALW.
This is all handeled by the DataSetGenerator class found in the src/SupportingClasses folder

In [2]:
from time import time
from DataSetGenerator import DataSetGenerator
protein_list_dir = os.path.join(input_dir, 'ProteinLists')
if not os.path.isdir(protein_list_dir):
    os.makedirs(protein_list_dir)
small_list_fn = os.path.join(protein_list_dir, 'SmallDataSet.txt')
if not os.path.isfile(small_list_fn):
    proteins_of_interest = ['2ysdA', '1c17A', '3tnuA', '7hvpA', '135lA', '206lA', '2werA', '1bolA', '3q05A', '1axbA',
                            '2rh1A', '1hckA', '3b6vA', '2z0eA', '1jwlA', '1a26A', '1c0kA', '4lliA', '4ycuA', '2iopA',
                            '2zxeA', '2b59B', '1h1vG']
    with open(small_list_fn, 'w') as small_list_handle:
        for p_id in proteins_of_interest:
            small_list_handle.write('{}\n'.format(p_id))
generator = DataSetGenerator(input_dir)
start = time()
summary = generator.build_pdb_alignment_dataset(protein_list_fn=os.path.basename(small_list_fn), processes=10,
                                                database='customuniref90.fasta', max_target_seqs=2500, remote=False,
                                                verbose=False)
summary['Chain'] = summary['Protein_ID'].apply(lambda x: generator.protein_data[x]['Chain'])
summary['Accession'] = summary['Protein_ID'].apply(lambda x: generator.protein_data[x]['Accession'])
summary['Length'] = summary['Protein_ID'].apply(lambda x: generator.protein_data[x]['Length'])
summary['Total_Size'] = summary.apply(lambda x: float(x['Length']) * float(x['Filtered_Alignment']), axis=1)
summary.sort_values(by=['Filtered_Alignment', 'Length'], axis=0, inplace=True)
summary_columns = ['Protein_ID', 'Chain', 'Accession', 'BLAST_Hits', 'Filtered_BLAST',
                   'Filtered_Alignment', 'Length', 'Total_Size']
print(summary[summary_columns])
end = time()
print('It took {} min to generate the data set.'.format((end - start) / 60.0))
summary.to_csv(os.path.join(input_dir, 'small_data_set_summary.tsv'), sep='\t', index=False, header=True,
               columns=summary_columns)

Importing protein list
Downloading structures and parsing in query sequences
Unique Sequences Found: 23!
BLASTing query sequences
Filtering BLAST hits, aligning, filtering by identity, and re-aligning
   Protein_ID Chain     Accession  BLAST_Hits  Filtered_BLAST  \
19       2b59     B    CIPA_CLOTM        1606               4   
18       7hvp     A     POL_HV1A2        2500              33   
13       1c0k     A    OXDA_RHOTO        2500              51   
14       206l     A      LYS_BPT4        1039             131   
16       1bol     A    RNRH_RHINI        2500             131   
5        3q05     A     P53_HUMAN         932             306   
3        1jwl     A    LACI_ECOLI        2500             228   
7        1a26     A   PARP1_CHICK        2500             267   
9        2ysd     A   MAGI1_HUMAN        2500             412   
12       2z0e     A   ATG4B_HUMAN        2111             360   
4        4lli     A   MYO5A_HUMAN        2500             441   
22       2rh1     A

Create a location to store the output of this method comparison.

In [3]:
output_dir = os.environ.get('OUTPUT_PATH')
small_set_out_dir = os.path.join(output_dir, 'SmallTestSet')
if not os.path.isdir(small_set_out_dir):
    os.makedirs(small_set_out_dir)

## Setting Up Scoring For Each Method
To reduce memory load during prediction and evaluation, the scoring objects needed to compute the metrics used to compare methods will be created ahead of time so they are available to each method when it computes its predictions for a given protein. This will ensure that results do not need to be kept in memory while waiting for all other results to be computed, only the metrics measured for each method will be recorded.

In [4]:
import pandas as pd
from SeqAlignment import SeqAlignment
from PDBReference import PDBReference
from ContactScorer import ContactScorer, plot_z_scores
protein_order = list(summary['Protein_ID'])
method_order = ['DCA', 'EVC Standard', 'EVC Mean Field', 'ET-MIp', 'cET-MIp', 'ET-MIp_MAX']
sequence_separation_order = ['Any', 'Neighbors', 'Short', 'Medium', 'Long']
protein_scorers = {}
small_comparison_df = None
small_comparison_fn = os.path.join(small_set_out_dir, 'Small_Comparision_Data.csv')
if os.path.isfile(small_comparison_fn):
    small_comparison_df = pd.read_csv(small_comparison_fn, sep='\t', header=0, index_col=False)
else:
    for p_id in summary['Protein_ID']:
        protein_scorers[p_id] = {}
        # Import alignment and remove gaps
        full_aln = SeqAlignment(file_name=generator.protein_data[p_id]['Final_FA_Aln'], query_id=p_id)
        full_aln.import_alignment()
        non_gap_aln = full_aln.remove_gaps()
        # Import structure
        pdb_structure = PDBReference(pdb_file=generator.protein_data[p_id]['PDB'])
        pdb_structure.import_pdb(structure_id=p_id)
        protein_scorers[p_id]['Structure'] = pdb_structure
        # Initialize Beta Carbon distance scorer
        contact_scorer_cb = ContactScorer(query=p_id, seq_alignment=non_gap_aln,
                                          pdb_reference=pdb_structure, cutoff=8.0)
        contact_scorer_cb.best_chain = generator.protein_data[p_id]['Chain']
        contact_scorer_cb.fit()
        contact_scorer_cb.measure_distance(method='CB')
        protein_scorers[p_id]['Scorer_CB'] = contact_scorer_cb
        # Initialize distance scorer minimizing distance between any atoms
        contact_scorer_any = ContactScorer(query=p_id, seq_alignment=non_gap_aln,
                                           pdb_reference=pdb_structure, cutoff=8.0)
        contact_scorer_any.best_chain = generator.protein_data[p_id]['Chain']
        contact_scorer_any.fit()
        contact_scorer_any.measure_distance(method='Any')
        protein_scorers[p_id]['Scorer_Any'] = contact_scorer_any
        # Initialize z-scoring subproblems
        protein_scorers[p_id]['biased_w2_ave'] = None
        protein_scorers[p_id]['unbiased_w2_ave'] = None
output_columns = ['Protein', 'Protein Length', 'Alignment Size', 'Method', 'Distance', 'Init Time', 'Import Time', 'Dist Tree Time', 'Trace Time', 'Total Time', 
                  'Sequence_Separation', 'AUROC', 'AUPRC', 'AUTPRFDRC',
                  'Top K Predictions', 'Precision', 'Recall', 'F1 Score',
                  'Biased Z-Score at 10%', 'Biased Z-Score at 30%', 'Max Biased Z-Score', 'AUC Biased Z-Score',
                  'Unbiased Z-Score at 10%', 'Biased Z-Score at 30%', 'Max Unbiased Z-Score', 'AUC Unbiased Z-Score']

Removing gaps took 0.00035761197408040366 min
Importing the PDB file took 0.0029230634371439617 min




Removing gaps took 0.0005341450373331706 min
Importing the PDB file took 0.0012839953104654947 min
Mapping query sequence and pdb took 0.0023559808731079103 min
Computing the distance matrix based on the PDB file took 0.0029813647270202637 min
Removing gaps took 0.00022375186284383138 min
Importing the PDB file took 0.0007433017094930013 min
Mapping query sequence and pdb took 0.001366718610127767 min




Computing the distance matrix based on the PDB file took 0.003194904327392578 min
Removing gaps took 0.0012140790621439615 min
Importing the PDB file took 0.0010810414950052896 min
Removing gaps took 0.0006622950236002604 min




Importing the PDB file took 0.0007343888282775879 min
Mapping query sequence and pdb took 0.0016265432039896646 min
Computing the distance matrix based on the PDB file took 0.0012339472770690918 min
Removing gaps took 0.0006877462069193522 min




Importing the PDB file took 0.0019040981928507487 min
Mapping query sequence and pdb took 0.002870655059814453 min
Computing the distance matrix based on the PDB file took 0.0012443224589029948 min
Removing gaps took 0.0006171027819315592 min
Importing the PDB file took 0.0020156065622965497 min
Removing gaps took 0.0006418267885843913 min
Importing the PDB file took 0.0015513340632120768 min
Mapping query sequence and pdb took 0.0023674686749776204 min
Computing the distance matrix based on the PDB file took 0.01543415387471517 min
Removing gaps took 0.0005736827850341796 min
Importing the PDB file took 0.002214447657267253 min
Mapping query sequence and pdb took 0.0029568831125895184 min
Computing the distance matrix based on the PDB file took 0.015213024616241456 min
Removing gaps took 0.00043114821116129555 min
Importing the PDB file took 0.000867907206217448 min
Removing gaps took 0.00037651856740315756 min
Importing the PDB file took 0.0004058718681335449 min
Mapping query sequen



Removing gaps took 0.005387806892395019 min
Importing the PDB file took 0.0028868714968363443 min
Mapping query sequence and pdb took 0.008743973573048909 min




Computing the distance matrix based on the PDB file took 0.006230103969573975 min
Removing gaps took 0.0054977178573608395 min
Importing the PDB file took 0.002827322483062744 min
Mapping query sequence and pdb took 0.00878314177195231 min




Computing the distance matrix based on the PDB file took 0.007442951202392578 min
Removing gaps took 0.0021175265312194822 min




Importing the PDB file took 0.0020781397819519044 min
Removing gaps took 0.003196132183074951 min




Importing the PDB file took 0.0018346786499023437 min
Mapping query sequence and pdb took 0.005248645941416423 min
Computing the distance matrix based on the PDB file took 0.011430474122365315 min
Removing gaps took 0.004404115676879883 min
Importing the PDB file took 0.0016899903615315754 min
Mapping query sequence and pdb took 0.006508477528889974 min




Computing the distance matrix based on the PDB file took 0.014408087730407715 min
Removing gaps took 0.017411466439565024 min
Importing the PDB file took 0.0012807687123616537 min
Removing gaps took 0.0181235671043396 min
Importing the PDB file took 0.0006534973780314128 min
Mapping query sequence and pdb took 0.02002082665761312 min
Computing the distance matrix based on the PDB file took 0.013660172621409098 min
Removing gaps took 0.016089022159576416 min
Importing the PDB file took 0.0006278832753499349 min
Mapping query sequence and pdb took 0.018182289600372315 min
Computing the distance matrix based on the PDB file took 0.014986809094746907 min
Removing gaps took 0.027816406885782876 min
Importing the PDB file took 0.005727513631184896 min
Removing gaps took 0.027064164479573567 min
Importing the PDB file took 0.0057994683583577475 min
Mapping query sequence and pdb took 0.04560620387395223 min
Computing the distance matrix based on the PDB file took 0.0003823121388753255 min
Rem



Removing gaps took 0.009760093688964844 min
Importing the PDB file took 0.0008462270100911459 min
Mapping query sequence and pdb took 0.010921068986256917 min




Computing the distance matrix based on the PDB file took 0.010933597882588705 min
Removing gaps took 0.011046934127807616 min
Importing the PDB file took 0.0029242555300394695 min
Mapping query sequence and pdb took 0.01472158432006836 min




Computing the distance matrix based on the PDB file took 0.011497060457865397 min
Removing gaps took 0.05157386859258016 min
Importing the PDB file took 0.0017905195554097494 min




Removing gaps took 0.051024095217386885 min
Importing the PDB file took 0.0014230569203694662 min




Mapping query sequence and pdb took 0.05539208253224691 min
Computing the distance matrix based on the PDB file took 0.016651960213979085 min
Removing gaps took 0.052708768844604494 min
Importing the PDB file took 0.0014222423235575358 min




Mapping query sequence and pdb took 0.057111823558807374 min
Computing the distance matrix based on the PDB file took 0.01781843105951945 min
Removing gaps took 0.009731245040893555 min
Importing the PDB file took 0.001206827163696289 min
Removing gaps took 0.009265279769897461 min
Importing the PDB file took 0.0008852044741312663 min
Mapping query sequence and pdb took 0.01847729682922363 min
Computing the distance matrix based on the PDB file took 0.021158961455027263 min
Removing gaps took 0.009330145517985026 min
Importing the PDB file took 0.0008737921714782714 min
Mapping query sequence and pdb took 0.020247141520182293 min
Computing the distance matrix based on the PDB file took 0.022627011934916178 min
Removing gaps took 0.028204989433288575 min
Importing the PDB file took 0.0014022588729858398 min




Removing gaps took 0.03028901815414429 min
Importing the PDB file took 0.0010670940081278482 min
Mapping query sequence and pdb took 0.03238084316253662 min




Computing the distance matrix based on the PDB file took 0.010306803385416667 min
Removing gaps took 0.028255903720855714 min
Importing the PDB file took 0.0010560274124145508 min
Mapping query sequence and pdb took 0.02994236946105957 min




Computing the distance matrix based on the PDB file took 0.012384398778279623 min
Removing gaps took 0.045598928133646646 min
Importing the PDB file took 0.0021667718887329102 min




Removing gaps took 0.042883690198262533 min
Importing the PDB file took 0.0029928843180338544 min
Mapping query sequence and pdb took 0.04658692280451457 min




Computing the distance matrix based on the PDB file took 0.012373145421346028 min
Removing gaps took 0.04329909880956014 min
Importing the PDB file took 0.0013120333353678385 min
Mapping query sequence and pdb took 0.04528274138768514 min




Computing the distance matrix based on the PDB file took 0.012838935852050782 min
Removing gaps took 0.085567307472229 min
Importing the PDB file took 0.002627249558766683 min




Removing gaps took 0.0866447408994039 min
Importing the PDB file took 0.0022523880004882814 min




Mapping query sequence and pdb took 0.09087692101796468 min
Computing the distance matrix based on the PDB file took 0.1012054721514384 min
Removing gaps took 0.08756890296936035 min
Importing the PDB file took 0.0022269129753112794 min




Mapping query sequence and pdb took 0.09173396825790406 min
Computing the distance matrix based on the PDB file took 0.11261076927185058 min
Removing gaps took 0.004369878768920898 min
Importing the PDB file took 0.0037349343299865724 min
Removing gaps took 0.0042858044306437176 min
Importing the PDB file took 0.003286643822987874 min
Mapping query sequence and pdb took 0.007710385322570801 min
Computing the distance matrix based on the PDB file took 0.000721124807993571 min
Removing gaps took 0.0045958757400512695 min
Importing the PDB file took 0.00322189728418986 min
Mapping query sequence and pdb took 0.007970778147379558 min
Computing the distance matrix based on the PDB file took 0.0010088562965393066 min
Removing gaps took 0.011838038762410482 min
Importing the PDB file took 0.0007766087849934896 min
Removing gaps took 0.012195146083831787 min
Importing the PDB file took 0.0002846240997314453 min
Mapping query sequence and pdb took 0.01274776856104533 min
Computing the distance 



Removing gaps took 0.10475523074467977 min
Importing the PDB file took 0.0009712457656860351 min
Mapping query sequence and pdb took 0.1069344719250997 min




Computing the distance matrix based on the PDB file took 0.004884541034698486 min
Removing gaps took 0.10564639170964558 min
Importing the PDB file took 0.0009310801823933919 min
Mapping query sequence and pdb took 0.10788713693618775 min




Computing the distance matrix based on the PDB file took 0.005985097090403239 min
Removing gaps took 0.07256233294804891 min
Importing the PDB file took 0.0009890039761861166 min




Removing gaps took 0.07266640663146973 min
Importing the PDB file took 0.0029077887535095214 min
Mapping query sequence and pdb took 0.0765071431795756 min




Computing the distance matrix based on the PDB file took 0.0008477648099263509 min
Removing gaps took 0.070345405737559 min
Importing the PDB file took 0.0005972822507222494 min
Mapping query sequence and pdb took 0.07190616130828857 min
Computing the distance matrix based on the PDB file took 0.0009584228197733561 min




Removing gaps took 0.07161407868067424 min
Importing the PDB file took 0.0007194916407267253 min
Removing gaps took 0.0713257114092509 min
Importing the PDB file took 0.000567928949991862 min
Mapping query sequence and pdb took 0.07296269337336223 min
Computing the distance matrix based on the PDB file took 0.009139927228291829 min
Removing gaps took 0.06906507809956869 min
Importing the PDB file took 0.000557871659596761 min
Mapping query sequence and pdb took 0.07077186107635498 min
Computing the distance matrix based on the PDB file took 0.010944620768229166 min
Removing gaps took 0.045420360565185544 min
Importing the PDB file took 0.0030422210693359375 min
Removing gaps took 0.045657014846801756 min
Importing the PDB file took 0.0004896203676859538 min
Mapping query sequence and pdb took 0.0467659592628479 min
Computing the distance matrix based on the PDB file took 0.007254894574483236 min
Removing gaps took 0.04579707384109497 min
Importing the PDB file took 0.002964206536610921



Importing the PDB file took 0.004579714934031169 min
Removing gaps took 0.16463130712509155 min




Importing the PDB file took 0.0034204403559366862 min
Mapping query sequence and pdb took 0.17005398273468017 min
Computing the distance matrix based on the PDB file took 0.03083344300587972 min
Removing gaps took 0.1584663947423299 min




Importing the PDB file took 0.006429104010264078 min
Mapping query sequence and pdb took 0.16995570659637452 min
Computing the distance matrix based on the PDB file took 0.03356578747431437 min
Removing gaps took 0.10802408456802368 min




Importing the PDB file took 0.007451208432515463 min
Removing gaps took 0.10627371470133463 min




Importing the PDB file took 0.0041050593058268225 min
Mapping query sequence and pdb took 0.11167031129201253 min
Computing the distance matrix based on the PDB file took 0.042313822110493976 min
Removing gaps took 0.10394990841547648 min




Importing the PDB file took 0.007165094216664632 min
Mapping query sequence and pdb took 0.11242412726084391 min
Computing the distance matrix based on the PDB file took 0.04454504251480103 min


# Generating Values For Comparision#
To determine the effectiveness of the new method and implementation the covariation of the same proteins will be computed using the previous Evolutionary Trace covariation method (ET-MIp) and other methods in the field.

## ET-MIp##
Scoring the the covariation of the proteins using the previous Evolutionary Trace covariation method (ET-MIp).

In [5]:
# from ETMIPWrapper import ETMIPWrapper
# etmip_out_dir = os.path.join(small_set_out_dir, 'ET-MIp')
# if not os.path.isdir(etmip_out_dir):
#     os.makedirs(etmip_out_dir)
# etmip_scores = {}
# counts = {'success':0, 'value': 0, 'attribute':0}
# for p_id in generator.protein_data:
#     print('Attempting to calculate ET-MIp covariance for: {}'.format(p_id))
#     try:
#         protein_out_dir = os.path.join(etmip_out_dir, p_id)
#         if not os.path.isdir(protein_out_dir):
#             os.makedirs(protein_out_dir)
#         curr_aln = SeqAlignment(file_name=generator.protein_data[p_id]['Final_FA_Aln'], query_id=p_id, polymer_type='Protein')
#         curr_aln.import_alignment()
#         curr_etmip = ETMIPWrapper(alignment=curr_aln)
#         curr_etmip.calculate_scores(out_dir=protein_out_dir, delete_files=False)
#         etmip_scores[p_id] = curr_etmip
#         print('Successfully computed ET-MIp covariance for: {}'.format(p_id))
#         counts['success'] += 1
#     except ValueError:
#         print('Could not compute ET-MIp covariance for: {} with seq_length: {} and size: {}'.format(
#             p_id, curr_aln.seq_length, curr_aln.size))
#         counts['value'] += 1
#     except AttributeError:
#         print('Could not compute ET-MIp covariance for: {} no alignment'.format(p_id))
#         counts['attribute'] += 1
# print('{}\tSuccesses\n{}\tValue Errors\n{}\tAttribute Errors'.format(counts['success'], counts['value'],
#                                                                      counts['attribute']))

## ET-MIp (Continued)
The previous implementation is not able to run for alignments of the size used here. Instead we use the new implementation with the same parameterization used by the previous implementation (Distance Model - blosum62 similarity, Tree - ET UPGMA variant, Scoring Metric - filtered average product corrected mutual information, Ranks - all).

In [5]:
from EvolutionaryTrace import EvolutionaryTrace
import numpy as np
import pandas as pd
if not os.path.isfile(small_comparison_fn):
    etmip_out_dir = os.path.join(small_set_out_dir, 'ET-MIp')
    if not os.path.isdir(etmip_out_dir):
        os.makedirs(etmip_out_dir)
    etmip_method_fn = os.path.join(etmip_out_dir, 'ET-MIp_Method_Data.csv')
    if os.path.isfile(etmip_method_fn):
        etmip_method_df = pd.read_csv(etmip_method_fn, sep='\t', header=0, index_col=False)
    else:    
        etmip_method_df = None
        counts = {'success':0, 'value': 0, 'attribute':0}
        for p_id in summary['Protein_ID']:
            print('Attempting to calculate ET-MIp covariance for: {}'.format(p_id))
            protein_dir = os.path.join(etmip_out_dir, p_id)
            if not os.path.isdir(protein_dir):
                os.makedirs(protein_dir)
            protein_fn = os.path.join(protein_dir, '{}_Protein_Data.csv'.format(p_id))
            if os.path.isfile(protein_fn):
                protein_df = pd.read_csv(protein_fn, sep='\t', header=0, index_col=False)
            else:
                try:

                    start_time = time()
                    curr_etmip = EvolutionaryTrace(query_id=p_id, polymer_type='Protein',
                                                   aln_fn=generator.protein_data[p_id]['Final_FA_Aln'], et_distance=True,
                                                   distance_model='blosum62', tree_building_method='et', tree_building_options={},
                                                   ranks=None, position_type='pair',
                                                   scoring_metric='filtered_average_product_corrected_mutual_information',
                                                   gap_correction=None, maximize=False, out_dir=protein_dir,
                                                   output_files={'original_aln', 'non_gap_aln', 'tree', 'scores'},
                                                   processors=10, low_memory=True)
                    init_time = time()
                    curr_etmip.import_and_process_aln()
                    import_time = time()
                    curr_etmip.compute_distance_matrix_tree_and_assignments()
                    dist_tree_time = time()
                    curr_etmip.perform_trace()
                    end_time = time()
                    print('Successfully computed ET-MIp covariance for: {}'.format(p_id))
                    # Compute statistics for the final scores of the ET-MIp model
                    protein_df, _, _ = protein_scorers[p_id]['Scorer_CB'].evaluate_predictor(
                        predictor=curr_etmip, verbosity=2, out_dir=protein_dir, dist='CB', biased_w2_ave=None,
                        unbiased_w2_ave=None, processes=10, threshold=0.5, pos_size=curr_etmip.scorer.position_size,
                        rank_type=curr_etmip.scorer.rank_type, file_prefix='ET-MIp_Scores_', plots=True)
                    # Score Prediction Clustering
                    z_score_fn = os.path.join(protein_dir, 'ET-MIp_Scores_Dist-Any_{}_ZScores.tsv')
                    z_score_plot_fn = os.path.join(protein_dir, 'ET-MIp_Scores_Dist-Any_{}_ZScores.png')
                    z_score_biased, biased_w2_ave, biased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
                        1.0 - curr_etmip.coverage, bias=True, file_path=z_score_fn.format('Biased'),
                        w2_ave_sub=protein_scorers[p_id]['biased_w2_ave'], processes=10)
                    if protein_scorers[p_id]['biased_w2_ave'] is None:
                        protein_scorers[p_id]['biased_w2_ave'] = biased_w2_ave
                    biased_z_score_array = np.array(pd.to_numeric(z_score_biased['Z-Score'], errors='coerce'))
                    protein_df['Max Biased Z-Score'] = np.nanmax(biased_z_score_array)
                    protein_df['Biased Z-Score at 10%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
                    protein_df['Biased Z-Score at 30%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
                    protein_df['AUC Biased Z-Score'] = biased_scw_z_auc
                    plot_z_scores(z_score_biased, z_score_plot_fn.format('Biased'))
                    z_score_unbiased, unbiased_w2_ave, unbiased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
                        1.0 - curr_etmip.coverage, bias=False, file_path=z_score_fn.format('Unbiased'),
                        w2_ave_sub=protein_scorers[p_id]['unbiased_w2_ave'], processes=10)
                    if protein_scorers[p_id]['unbiased_w2_ave'] is None:
                        protein_scorers[p_id]['unbiased_w2_ave'] = unbiased_w2_ave
                    unbiased_z_score_array = np.array(pd.to_numeric(z_score_unbiased['Z-Score'], errors='coerce'))
                    protein_df['Max Unbiased Z-Score'] = np.nanmax(unbiased_z_score_array)
                    protein_df['Unbiased Z-Score at 10%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
                    protein_df['Unbiased Z-Score at 30%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
                    protein_df['AUC Unbiased Z-Score'] = unbiased_scw_z_auc
                    plot_z_scores(z_score_unbiased, z_score_plot_fn.format('Unbiased'))
                    # Record execution times
                    protein_df['Init Time'] = init_time - start_time
                    protein_df['Import Time'] = import_time - init_time
                    protein_df['Dist Tree Time'] = dist_tree_time - import_time
                    protein_df['Trace Time'] = end_time - dist_tree_time
                    protein_df['Total Time'] = end_time - start_time
                    # Record static data for this protein
                    protein_df['Protein'] = p_id
                    protein_df['Method'] = 'ET-MIp'
                    protein_df['Alignment Size'] = summary['Filtered_Alignment'].values[summary['Protein_ID'] == p_id][0]
                    protein_df.to_csv(protein_fn, sep='\t', header=True, index=False, columns=output_columns)
                    temp_data = os.path.join(protein_dir, 'unique_node_data')
                    for temp_fn in os.listdir(temp_data):
                        if not temp_fn.endswith("_pair_rank_filtered_average_product_corrected_mutual_information_score.npz"):
                            os.remove(os.path.join(temp_data, temp_fn))
                    print('Metrics meastured for ET-MIp covariance for: {}'.format(p_id))
                    counts['success'] += 1
                except ValueError:
                    print('Could not compute ET-MIp covariance for: {} with seq_length: {} and size: {}'.format(
                        p_id, curr_etmip.original_aln.seq_length, curr_etmip.original_aln.size))
                    counts['value'] += 1
                    continue
                except AttributeError:
                    print('Could not compute ET-MIp covariance for: {} no alignment'.format(p_id))
                    counts['attribute'] += 1
                    continue
            if etmip_method_df is None:
                etmip_method_df = protein_df
            else:
                etmip_method_df = etmip_method_df.append(protein_df)
        print('{}\tSuccesses\n{}\tValue Errors\n{}\tAttribute Errors'.format(counts['success'], counts['value'],
                                                                             counts['attribute']))
        etmip_method_df.to_csv(etmip_method_fn, sep='\t', header=True, index=False, columns=output_columns)
    if small_comparison_df is None:
        small_comparison_df = etmip_method_df
    else:
        small_comparison_df = small_comparison_df.append(etmip_method_df)

## cET-MIp
This segment the ET-MIp method, when constrained to an arbitrary set of nodes (1, 2, 3, 5, 7, 10, 25) at the top of the phylogenetic tree.

In [6]:
if not os.path.isfile(small_comparison_fn):
    cetmip_out_dir = os.path.join(small_set_out_dir, 'cET-MIp')
    if not os.path.isdir(cetmip_out_dir):
        os.makedirs(cetmip_out_dir)
    cetmip_method_fn = os.path.join(cetmip_out_dir, 'cET-MIp_Method_Data.csv')
    if os.path.isfile(cetmip_method_fn):
        cetmip_method_df = pd.read_csv(cetmip_method_fn, sep='\t', header=0, index_col=False)
    else:
        cetmip_method_df = None
        counts = {'success':0, 'value': 0, 'attribute':0, 'key': 0}
        for p_id in summary['Protein_ID']:
            print('Attempting to calculate cET-MIp covariance for: {}'.format(p_id))
            protein_dir = os.path.join(cetmip_out_dir, p_id)
            if not os.path.isdir(protein_dir):
                os.makedirs(protein_dir)
            protein_fn = os.path.join(protein_dir, '{}_Protein_Data.csv'.format(p_id))
            if os.path.isfile(protein_fn):
                protein_df = pd.read_csv(protein_fn, sep='\t', header=0, index_col=False)
            else:
                try:
                    start_time = time()
                    curr_cetmip = EvolutionaryTrace(query_id=p_id, polymer_type='Protein',
                                                   aln_fn=generator.protein_data[p_id]['Final_FA_Aln'], et_distance=True,
                                                   distance_model='blosum62', tree_building_method='et', tree_building_options={},
                                                   ranks=[1, 2, 3, 5, 7, 10, 25], position_type='pair',
                                                   scoring_metric='filtered_average_product_corrected_mutual_information',
                                                   gap_correction=None, maximize=False, out_dir=protein_dir,
                                                   output_files={'original_aln', 'non_gap_aln', 'tree', 'scores'},
                                                   processors=10, low_memory=True)
                    init_time = time()
                    curr_cetmip.import_and_process_aln()
                    import_time = time()
                    curr_cetmip.compute_distance_matrix_tree_and_assignments()
                    dist_tree_time = time()
                    curr_cetmip.perform_trace()
                    end_time = time()
                    # Compute statistics for the final scores of the ET-MIp model
                    protein_df, _, _ = protein_scorers[p_id]['Scorer_CB'].evaluate_predictor(
                        predictor=curr_cetmip, verbosity=2, out_dir=protein_dir, dist='CB', biased_w2_ave=None,
                        unbiased_w2_ave=None, processes=10, threshold=0.5, pos_size=curr_cetmip.scorer.position_size,
                        rank_type=curr_cetmip.scorer.rank_type, file_prefix='cET-MIp_Scores_', plots=True)
                    # Score Prediction Clustering
                    z_score_fn = os.path.join(protein_dir, 'cET-MIp_Scores_Dist-Any_{}_ZScores.tsv')
                    z_score_plot_fn = os.path.join(protein_dir, 'cET-MIp_Scores_Dist-Any_{}_ZScores.png')
                    z_score_biased, biased_w2_ave, biased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
                        1.0 - curr_cetmip.coverage, bias=True, file_path=z_score_fn.format('Biased'),
                        w2_ave_sub=protein_scorers[p_id]['biased_w2_ave'], processes=10)
                    if protein_scorers[p_id]['biased_w2_ave'] is None:
                        protein_scorers[p_id]['biased_w2_ave'] = biased_w2_ave
                    biased_z_score_array = np.array(pd.to_numeric(z_score_biased['Z-Score'], errors='coerce'))
                    protein_df['Max Biased Z-Score'] = np.nanmax(biased_z_score_array)
                    protein_df['Biased Z-Score at 10%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
                    protein_df['Biased Z-Score at 30%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
                    protein_df['AUC Biased Z-Score'] = biased_scw_z_auc
                    plot_z_scores(z_score_biased, z_score_plot_fn.format('Biased'))
                    z_score_unbiased, unbiased_w2_ave, unbiased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
                        1.0 - curr_cetmip.coverage, bias=False, file_path=z_score_fn.format('Unbiased'),
                        w2_ave_sub=protein_scorers[p_id]['unbiased_w2_ave'], processes=10)
                    if protein_scorers[p_id]['unbiased_w2_ave'] is None:
                        protein_scorers[p_id]['unbiased_w2_ave'] = unbiased_w2_ave
                    unbiased_z_score_array = np.array(pd.to_numeric(z_score_unbiased['Z-Score'], errors='coerce'))
                    protein_df['Max Unbiased Z-Score'] = np.nanmax(unbiased_z_score_array)
                    protein_df['Unbiased Z-Score at 10%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
                    protein_df['Unbiased Z-Score at 30%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
                    protein_df['AUC Unbiased Z-Score'] = unbiased_scw_z_auc
                    plot_z_scores(z_score_unbiased, z_score_plot_fn.format('Unbiased'))
                    # Record execution times
                    protein_df['Init Time'] = init_time - start_time
                    protein_df['Import Time'] = import_time - init_time
                    protein_df['Dist Tree Time'] = dist_tree_time - import_time
                    protein_df['Trace Time'] = end_time - dist_tree_time
                    protein_df['Total Time'] = end_time - start_time
                    # Record static data for this protein
                    protein_df['Protein'] = p_id
                    protein_df['Method'] = 'cET-MIp'
                    protein_df['Alignment Size'] = summary['Filtered_Alignment'].values[summary['Protein_ID'] == p_id][0]
                    protein_df.to_csv(protein_fn, sep='\t', header=True, index=False, columns=output_columns)
                    temp_data = os.path.join(protein_dir, 'unique_node_data')
                    for temp_fn in os.listdir(temp_data):
                        if not temp_fn.endswith("_pair_rank_filtered_average_product_corrected_mutual_information_score.npz"):
                            os.remove(os.path.join(temp_data, temp_fn))
                    print('Successfully computed cET-MIp covariance for: {}'.format(p_id))
                    counts['success'] += 1
                except ValueError:
                    print('Could not compute cET-MIp covariance for: {} with seq_length: {} and size: {}'.format(
                        p_id, curr_cetmip.original_aln.seq_length, curr_etmip.original_aln.size))
                    counts['value'] += 1
                    continue
                except AttributeError:
                    print('Could not compute cET-MIp covariance for: {} no alignment'.format(p_id))
                    counts['attribute'] += 1
                    continue
                except KeyError:
                    print('Could not compute cET-MIp covariance for: {} not enough sequences'.format('p_ied'))
                    counts['key'] += 1
                    continue
            if cetmip_method_df is None:
                cetmip_method_df = protein_df
            else:
                cetmip_method_df = cetmip_method_df.append(protein_df)
        print('{}\tSuccesses\n{}\tValue Errors\n{}\tAttribute Errors'.format(counts['success'], counts['value'],
                                                                             counts['attribute']))
        cetmip_method_df.to_csv(cetmip_method_fn, sep='\t', header=True, index=False, columns=output_columns)
    if small_comparison_df is None:
        small_comparison_df = cetmip_method_df
    else:
        small_comparison_df = small_comparison_df.append(cetmip_method_df)

### ET-MIp with Group Maximiziation ###
This cell generates data for and tests the effect of maximizing the group score when moving from a parent node to child nodes.

In [7]:
from EvolutionaryTrace import EvolutionaryTrace
import numpy as np
import pandas as pd
if not os.path.isfile(small_comparison_fn):
    etmip_out_dir = os.path.join(small_set_out_dir, 'ET-MIp_MAX')
    if not os.path.isdir(etmip_out_dir):
        os.makedirs(etmip_out_dir)
    etmip_method_fn = os.path.join(etmip_out_dir, 'ET-MIp_MAX_Method_Data.csv')
    if os.path.isfile(etmip_method_fn):
        etmip_method_df = pd.read_csv(etmip_method_fn, sep='\t', header=0, index_col=False)
    else:    
        etmip_method_df = None
        counts = {'success':0, 'value': 0, 'attribute':0}
        for p_id in summary['Protein_ID']:
            print('Attempting to calculate ET-MIp MAXIMIZED covariance for: {}'.format(p_id))
            protein_dir = os.path.join(etmip_out_dir, p_id)
            if not os.path.isdir(protein_dir):
                os.makedirs(protein_dir)
            protein_fn = os.path.join(protein_dir, '{}_Protein_Data.csv'.format(p_id))
            if os.path.isfile(protein_fn):
                protein_df = pd.read_csv(protein_fn, sep='\t', header=0, index_col=False)
            else:
                try:

                    start_time = time()
                    curr_etmip = EvolutionaryTrace(query_id=p_id, polymer_type='Protein',
                                                   aln_fn=generator.protein_data[p_id]['Final_FA_Aln'], et_distance=True,
                                                   distance_model='blosum62', tree_building_method='et', tree_building_options={},
                                                   ranks=None, position_type='pair',
                                                   scoring_metric='filtered_average_product_corrected_mutual_information',
                                                   gap_correction=None, maximize=True, out_dir=protein_dir,
                                                   output_files={'original_aln', 'non_gap_aln', 'tree', 'scores'},
                                                   processors=10, low_memory=True)
                    init_time = time()
                    curr_etmip.import_and_process_aln()
                    import_time = time()
                    curr_etmip.compute_distance_matrix_tree_and_assignments()
                    dist_tree_time = time()
                    curr_etmip.perform_trace()
                    end_time = time()
                    print('Successfully computed ET-MIp MAX covariance for: {}'.format(p_id))
                    # Compute statistics for the final scores of the ET-MIp model
                    protein_df, _, _ = protein_scorers[p_id]['Scorer_CB'].evaluate_predictor(
                        predictor=curr_etmip, verbosity=2, out_dir=protein_dir, dist='CB', biased_w2_ave=None,
                        unbiased_w2_ave=None, processes=10, threshold=0.5, pos_size=curr_etmip.scorer.position_size,
                        rank_type=curr_etmip.scorer.rank_type, file_prefix='ET-MIp_MAX_Scores_', plots=True)
                    # Score Prediction Clustering
                    z_score_fn = os.path.join(protein_dir, 'ET-MIp_MAX_Scores_Dist-Any_{}_ZScores.tsv')
                    z_score_plot_fn = os.path.join(protein_dir, 'ET-MIp_MAX_Scores_Dist-Any_{}_ZScores.png')
                    z_score_biased, biased_w2_ave, biased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
                        1.0 - curr_etmip.coverage, bias=True, file_path=z_score_fn.format('Biased'),
                        w2_ave_sub=protein_scorers[p_id]['biased_w2_ave'], processes=10)
                    if protein_scorers[p_id]['biased_w2_ave'] is None:
                        protein_scorers[p_id]['biased_w2_ave'] = biased_w2_ave
                    biased_z_score_array = np.array(pd.to_numeric(z_score_biased['Z-Score'], errors='coerce'))
                    protein_df['Max Biased Z-Score'] = np.nanmax(biased_z_score_array)
                    protein_df['Biased Z-Score at 10%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
                    protein_df['Biased Z-Score at 30%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
                    protein_df['AUC Biased Z-Score'] = biased_scw_z_auc
                    plot_z_scores(z_score_biased, z_score_plot_fn.format('Biased'))
                    z_score_unbiased, unbiased_w2_ave, unbiased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
                        1.0 - curr_etmip.coverage, bias=False, file_path=z_score_fn.format('Unbiased'),
                        w2_ave_sub=protein_scorers[p_id]['unbiased_w2_ave'], processes=10)
                    if protein_scorers[p_id]['unbiased_w2_ave'] is None:
                        protein_scorers[p_id]['unbiased_w2_ave'] = unbiased_w2_ave
                    unbiased_z_score_array = np.array(pd.to_numeric(z_score_unbiased['Z-Score'], errors='coerce'))
                    protein_df['Max Unbiased Z-Score'] = np.nanmax(unbiased_z_score_array)
                    protein_df['Unbiased Z-Score at 10%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
                    protein_df['Unbiased Z-Score at 30%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
                    protein_df['AUC Unbiased Z-Score'] = unbiased_scw_z_auc
                    plot_z_scores(z_score_unbiased, z_score_plot_fn.format('Unbiased'))
                    # Record execution times
                    protein_df['Init Time'] = init_time - start_time
                    protein_df['Import Time'] = import_time - init_time
                    protein_df['Dist Tree Time'] = dist_tree_time - import_time
                    protein_df['Trace Time'] = end_time - dist_tree_time
                    protein_df['Total Time'] = end_time - start_time
                    # Record static data for this protein
                    protein_df['Protein'] = p_id
                    protein_df['Method'] = 'ET-MIp_MAX'
                    protein_df['Alignment Size'] = summary['Filtered_Alignment'].values[summary['Protein_ID'] == p_id][0]
                    protein_df.to_csv(protein_fn, sep='\t', header=True, index=False, columns=output_columns)
                    temp_data = os.path.join(protein_dir, 'unique_node_data')
                    for temp_fn in os.listdir(temp_data):
                        if not temp_fn.endswith("_pair_rank_filtered_average_product_corrected_mutual_information_score.npz"):
                            os.remove(os.path.join(temp_data, temp_fn))
                    print('Metrics meastured for ET-MIp MAX covariance for: {}'.format(p_id))
                    counts['success'] += 1
                except ValueError:
                    print('Could not compute ET-MIp MAX covariance for: {} with seq_length: {} and size: {}'.format(
                        p_id, curr_etmip.original_aln.seq_length, curr_etmip.original_aln.size))
                    counts['value'] += 1
                    continue
                except AttributeError:
                    print('Could not compute ET-MIp MAX covariance for: {} no alignment'.format(p_id))
                    counts['attribute'] += 1
                    continue
            if etmip_method_df is None:
                etmip_method_df = protein_df
            else:
                etmip_method_df = etmip_method_df.append(protein_df)
        print('{}\tSuccesses\n{}\tValue Errors\n{}\tAttribute Errors'.format(counts['success'], counts['value'],
                                                                             counts['attribute']))
        etmip_method_df.to_csv(etmip_method_fn, sep='\t', header=True, index=False, columns=output_columns)
    if small_comparison_df is None:
        small_comparison_df = etmip_method_df
    else:
        small_comparison_df = small_comparison_df.append(etmip_method_df)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort,


## DCA##
Scoring the the covariation of the proteins using a DCA julia implementation.

In [9]:
from DCAWrapper import DCAWrapper
from utils import compute_rank_and_coverage
if not os.path.isfile(small_comparison_fn):
    dca_out_dir = os.path.join(small_set_out_dir, 'DCA')
    if not os.path.isdir(dca_out_dir):
        os.makedirs(dca_out_dir)
    dca_method_fn = os.path.join(dca_out_dir, 'DCA_Method_Data.csv')
    if os.path.isfile(dca_method_fn):
        dca_method_df = pd.read_csv(dca_method_fn, sep='\t', header=0, index_col=False)
    else:
        dca_method_df = None
        counts = {'success':0, 'value': 0, 'attribute':0}
        for p_id in generator.protein_data:
            print('Attempting to calculate DCA covariance for: {}'.format(p_id))
            protein_dir = os.path.join(dca_out_dir, p_id)
            if not os.path.isdir(protein_dir):
                os.makedirs(protein_dir)
            protein_fn = os.path.join(protein_dir, '{}_Protein_Data.csv'.format(p_id))
            if os.path.isfile(protein_fn):
                protein_df = pd.read_csv(protein_fn, sep='\t', header=0, index_col=False)
            else:
                try:
                    curr_aln = SeqAlignment(file_name=generator.protein_data[p_id]['Final_FA_Aln'], query_id=p_id,
                                            polymer_type='Protein')
                    curr_aln.import_alignment()
                    # Since the DCA implementation used here does not provide a way to specify the query sequence we remove the gaps
                    # from the query sequences so positions will be referenced correctly for that sequence (and unnecessary
                    # computations can be avoided).
                    curr_aln = curr_aln.remove_gaps()
                    new_aln_fn = os.path.join(protein_dir, '{}_no_gap.fasta'.format(p_id))
                    curr_aln.write_out_alignment(new_aln_fn)
                    curr_aln.file_name = new_aln_fn
                    curr_dca = DCAWrapper(alignment=curr_aln)
                    curr_dca.calculate_scores(out_dir=protein_dir, delete_file=False)
                    # Compute statistics for the final scores of the ET-MIp model
                    protein_df, _, _ = protein_scorers[p_id]['Scorer_CB'].evaluate_predictor(
                        predictor=curr_dca, verbosity=2, out_dir=protein_dir, dist='CB', biased_w2_ave=None,
                        unbiased_w2_ave=None, processes=10, threshold=0.5, pos_size=2, rank_type='max', file_prefix='DCA_Scores_', plots=True)
                    # Score Prediction Clustering
                    _, dca_coverage  = compute_rank_and_coverage(seq_length=curr_dca.alignment.seq_length, scores=curr_dca.scores, pos_size=2,
                        rank_type='max')
                    z_score_fn = os.path.join(protein_dir, 'DCA_Scores_Dist-Any_{}_ZScores.tsv')
                    z_score_plot_fn = os.path.join(protein_dir, 'DCA_Scores_Dist-Any_{}_ZScores.png')
                    z_score_biased, biased_w2_ave, biased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
                        1.0 - dca_coverage, bias=True, file_path=z_score_fn.format('Biased'),
                        w2_ave_sub=protein_scorers[p_id]['biased_w2_ave'], processes=10)
                    if protein_scorers[p_id]['biased_w2_ave'] is None:
                        protein_scorers[p_id]['biased_w2_ave'] = biased_w2_ave
                    biased_z_score_array = np.array(pd.to_numeric(z_score_biased['Z-Score'], errors='coerce'))
                    protein_df['Max Biased Z-Score'] = np.nanmax(biased_z_score_array)
                    protein_df['Biased Z-Score at 10%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
                    protein_df['Biased Z-Score at 30%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
                    protein_df['AUC Biased Z-Score'] = biased_scw_z_auc
                    plot_z_scores(z_score_biased, z_score_plot_fn.format('Biased'))
                    z_score_unbiased, unbiased_w2_ave, unbiased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
                        1.0 - dca_coverage, bias=False, file_path=z_score_fn.format('Unbiased'),
                        w2_ave_sub=protein_scorers[p_id]['unbiased_w2_ave'], processes=10)
                    if protein_scorers[p_id]['unbiased_w2_ave'] is None:
                        protein_scorers[p_id]['unbiased_w2_ave'] = unbiased_w2_ave
                    unbiased_z_score_array = np.array(pd.to_numeric(z_score_unbiased['Z-Score'], errors='coerce'))
                    protein_df['Max Unbiased Z-Score'] = np.nanmax(unbiased_z_score_array)
                    protein_df['Unbiased Z-Score at 10%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
                    protein_df['Unbiased Z-Score at 30%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
                    protein_df['AUC Unbiased Z-Score'] = unbiased_scw_z_auc
                    plot_z_scores(z_score_unbiased, z_score_plot_fn.format('Unbiased'))
                    # Record execution times
                    protein_df['Init Time'] = None
                    protein_df['Import Time'] = None
                    protein_df['Dist Tree Time'] = None
                    protein_df['Trace Time'] = None
                    protein_df['Total Time'] = None
                    # Record static data for this protein
                    protein_df['Protein'] = p_id
                    protein_df['Method'] = 'DCA'
                    protein_df['Alignment Size'] = summary['Filtered_Alignment'].values[summary['Protein_ID'] == p_id][0]
                    protein_df.to_csv(protein_fn, sep='\t', header=True, index=False, columns=output_columns)
                    print('Successfully computed DCA covariance for: {}'.format(p_id))
                    counts['success'] += 1
                except ValueError:
                    print('Could not compute DCA covariance for: {} with seq_length: {} and size: {}'.format(
                        p_id, curr_aln.seq_length, curr_aln.size))
                    counts['value'] += 1
                    continue
                except AttributeError:
                    print('Could not compute DCA covariance for: {} no alignment'.format(p_id))
                    counts['attribute'] += 1
                    continue
            if dca_method_df is None:
                dca_method_df = protein_df
            else:
                dca_method_df = dca_method_df.append(protein_df)
        print('{}\tSuccesses\n{}\tValue Errors\n{}\tAttribute Errors'.format(counts['success'], counts['value'],
                                                                             counts['attribute']))
        dca_method_df.to_csv(dca_method_fn, sep='\t', header=True, index=False, columns=output_columns)
    if small_comparison_df is None:
        small_comparison_df = dca_method_df
    else:
        small_comparison_df = small_comparison_df.append(dca_method_df)

## EVCouplings##
Scoring the the covariation of the proteins using the EVCouplings method standard protocol.

In [11]:
from EVCouplingsWrapper import EVCouplingsWrapper
if not os.path.isfile(small_comparison_fn):
    evc_standard_out_dir = os.path.join(small_set_out_dir, 'EVCouplings_Standard')
    if not os.path.isdir(evc_standard_out_dir):
        os.makedirs(evc_standard_out_dir)
    evc_standard_method_fn = os.path.join(evc_standard_out_dir, 'EVCouplings_Standard_Method_Data.csv')
    if os.path.isfile(evc_standard_method_fn):
        evc_standard_method_df = pd.read_csv(evc_standard_method_fn, sep='\t', header=0, index_col=False)
    else:
        evc_standard_method_df = None
        counts = {'success':0, 'value': 0, 'attribute':0}
        for p_id in generator.protein_data:
            print('Attempting to calculate EV couplings standard protocol covariance for: {}'.format(p_id))
            protein_dir = os.path.join(evc_standard_out_dir, p_id)
            if not os.path.isdir(protein_dir):
                os.makedirs(protein_dir)
            protein_fn = os.path.join(protein_dir, '{}_Protein_Data.csv'.format(p_id))
            if os.path.isfile(protein_fn):
                protein_df = pd.read_csv(protein_fn, sep='\t', header=0, index_col=False)
            else:
                try:
                    curr_aln = SeqAlignment(file_name=generator.protein_data[p_id]['Final_FA_Aln'], query_id=p_id,
                                            polymer_type='Protein')
                    curr_aln.import_alignment()
                    curr_evc = EVCouplingsWrapper(alignment=curr_aln, protocol='standard')
                    curr_evc.calculate_scores(out_dir=protein_dir, cores=10, delete_files=True)
                    # Compute statistics for the final scores of the ET-MIp model
                    protein_df, _, _ = protein_scorers[p_id]['Scorer_CB'].evaluate_predictor(
                        predictor=curr_evc, verbosity=2, out_dir=protein_dir, dist='CB', biased_w2_ave=None,
                        unbiased_w2_ave=None, processes=10, threshold=0.5, pos_size=2,
                        rank_type='max', file_prefix='EVC_Standard_Scores_', plots=True)
                    # Score Prediction Clustering
                    _, evc_standard_coverage  = compute_rank_and_coverage(seq_length=curr_evc.alignment.seq_length, scores=curr_evc.scores, pos_size=2,
                        rank_type='max')
                    z_score_fn = os.path.join(protein_dir, 'EVC_Standard_Scores_Dist-Any_{}_ZScores.tsv')
                    z_score_plot_fn = os.path.join(protein_dir, 'EVC_Standard_Scores_Dist-Any_{}_ZScores.png')
                    z_score_biased, biased_w2_ave, biased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
                        1.0 - evc_standard_coverage, bias=True, file_path=z_score_fn.format('Biased'),
                        w2_ave_sub=protein_scorers[p_id]['biased_w2_ave'], processes=10)
                    if protein_scorers[p_id]['biased_w2_ave'] is None:
                        protein_scorers[p_id]['biased_w2_ave'] = biased_w2_ave
                    biased_z_score_array = np.array(pd.to_numeric(z_score_biased['Z-Score'], errors='coerce'))
                    protein_df['Max Biased Z-Score'] = np.nanmax(biased_z_score_array)
                    protein_df['Biased Z-Score at 10%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
                    protein_df['Biased Z-Score at 30%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
                    protein_df['AUC Biased Z-Score'] = biased_scw_z_auc
                    plot_z_scores(z_score_biased, z_score_plot_fn.format('Biased'))
                    z_score_unbiased, unbiased_w2_ave, unbiased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
                        1.0 - evc_standard_coverage, bias=False, file_path=z_score_fn.format('Unbiased'),
                        w2_ave_sub=protein_scorers[p_id]['unbiased_w2_ave'], processes=10)
                    if protein_scorers[p_id]['unbiased_w2_ave'] is None:
                        protein_scorers[p_id]['unbiased_w2_ave'] = unbiased_w2_ave
                    unbiased_z_score_array = np.array(pd.to_numeric(z_score_unbiased['Z-Score'], errors='coerce'))
                    protein_df['Max Unbiased Z-Score'] = np.nanmax(unbiased_z_score_array)
                    protein_df['Unbiased Z-Score at 10%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
                    protein_df['Unbiased Z-Score at 30%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
                    protein_df['AUC Unbiased Z-Score'] = unbiased_scw_z_auc
                    plot_z_scores(z_score_unbiased, z_score_plot_fn.format('Unbiased'))
                    # Record execution times
                    protein_df['Init Time'] = None
                    protein_df['Import Time'] = None
                    protein_df['Dist Tree Time'] = None
                    protein_df['Trace Time'] = None
                    protein_df['Total Time'] = None
                    # Record static data for this protein
                    protein_df['Protein'] = p_id
                    protein_df['Method'] = 'EVC Standard'
                    protein_df['Alignment Size'] = summary['Filtered_Alignment'].values[summary['Protein_ID'] == p_id][0]
                    protein_df.to_csv(protein_fn, sep='\t', header=True, index=False, columns=output_columns)
                    print('Successfully computed EV couplings standard protocol covariance for: {}'.format(p_id))
                    counts['success'] += 1
                except ValueError:
                    print('Could not compute EV couplings standard protocol covariance for: {} with seq_length: {} and size: {}'.format(
                        p_id, curr_aln.seq_length, curr_aln.size))
                    counts['value'] += 1
                    continue
                except AttributeError:
                    print('Could not compute EV couplings standard protocol covariance for: {} no alignment'.format(p_id))
                    counts['attribute'] += 1
                    continue
            if evc_standard_method_df is None:
                evc_standard_method_df = protein_df
            else:
                evc_standard_method_df = evc_standard_method_df.append(protein_df)
        print('{}\tSuccesses\n{}\tValue Errors\n{}\tAttribute Errors'.format(counts['success'], counts['value'],
                                                                             counts['attribute']))
        evc_standard_method_df.to_csv(evc_standard_method_fn, sep='\t', header=True, index=False, columns=output_columns)
    if small_comparison_df is None:
        small_comparison_df = evc_standard_method_df
    else:
        small_comparison_df = small_comparison_df.append(evc_standard_method_df)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort,


Scoring the covariation of the proteins using the EVCouplings method mean field protocol.

In [12]:
if not os.path.isfile(small_comparison_fn):
    evc_mf_out_dir = os.path.join(small_set_out_dir, 'EVCouplings_Mean_Field')
    if not os.path.isdir(evc_mf_out_dir):
        os.makedirs(evc_mf_out_dir)
    evc_mf_method_fn = os.path.join(evc_mf_out_dir, 'EVCouplings_Mean_Field_Method_Data.csv')
    if os.path.isfile(evc_mf_method_fn):
        evc_mf_method_df = pd.read_csv(evc_mf_method_fn, sep='\t', header=0, index_col=False)
    else:
        evc_mf_method_df = None
        counts = {'success':0, 'value': 0, 'attribute':0}
        for p_id in generator.protein_data:
            print('Attempting to calculate EV couplings covariance for: {}'.format(p_id))
            try:
                protein_dir = os.path.join(evc_mf_out_dir, p_id)
                if not os.path.isdir(protein_dir):
                    os.makedirs(protein_dir)
                curr_aln = SeqAlignment(file_name=generator.protein_data[p_id]['Final_FA_Aln'], query_id=p_id,
                                        polymer_type='Protein')
                curr_aln.import_alignment()
                curr_evc = EVCouplingsWrapper(alignment=curr_aln, protocol='mean_field')
                curr_evc.calculate_scores(out_dir=protein_dir, cores=10, delete_files=True)
                # Compute statistics for the final scores of the ET-MIp model
                protein_df, _, _ = protein_scorers[p_id]['Scorer_CB'].evaluate_predictor(
                    predictor=curr_evc, verbosity=2, out_dir=protein_dir, dist='CB', biased_w2_ave=None,
                    unbiased_w2_ave=None, processes=10, threshold=0.5, pos_size=2, rank_type='max',
                    file_prefix='EVC_Standard_Scores_', plots=True)
                # Score Prediction Clustering
                _, evc_mf_coverage  = compute_rank_and_coverage(seq_length=curr_evc.alignment.seq_length, scores=curr_evc.scores, pos_size=2,
                    rank_type='max')
                z_score_fn = os.path.join(protein_dir, 'EVC_Mean_Field_Scores_Dist-Any_{}_ZScores.tsv')
                z_score_plot_fn = os.path.join(protein_dir, 'EVC_Mean_Field_Scores_Dist-Any_{}_ZScores.png')
                z_score_biased, biased_w2_ave, biased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
                    1.0 - evc_mf_coverage, bias=True, file_path=z_score_fn.format('Biased'),
                    w2_ave_sub=protein_scorers[p_id]['biased_w2_ave'], processes=10)
                if protein_scorers[p_id]['biased_w2_ave'] is None:
                        protein_scorers[p_id]['biased_w2_ave'] = biased_w2_ave
                biased_z_score_array = np.array(pd.to_numeric(z_score_biased['Z-Score'], errors='coerce'))
                protein_df['Max Biased Z-Score'] = np.nanmax(biased_z_score_array)
                protein_df['Biased Z-Score at 10%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
                protein_df['Biased Z-Score at 30%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
                protein_df['AUC Biased Z-Score'] = biased_scw_z_auc
                plot_z_scores(z_score_biased, z_score_plot_fn.format('Biased'))
                z_score_unbiased, unbiased_w2_ave, unbiased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
                    1.0 - evc_mf_coverage, bias=False, file_path=z_score_fn.format('Unbiased'),
                    w2_ave_sub=protein_scorers[p_id]['unbiased_w2_ave'], processes=10)
                if protein_scorers[p_id]['unbiased_w2_ave'] is None:
                        protein_scorers[p_id]['unbiased_w2_ave'] = unbiased_w2_ave
                unbiased_z_score_array = np.array(pd.to_numeric(z_score_unbiased['Z-Score'], errors='coerce'))
                protein_df['Max Unbiased Z-Score'] = np.nanmax(unbiased_z_score_array)
                protein_df['Unbiased Z-Score at 10%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
                protein_df['Unbiased Z-Score at 30%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
                protein_df['AUC Unbiased Z-Score'] = unbiased_scw_z_auc
                plot_z_scores(z_score_unbiased, z_score_plot_fn.format('Unbiased'))
                # Record execution times
                protein_df['Init Time'] = None
                protein_df['Import Time'] = None
                protein_df['Dist Tree Time'] = None
                protein_df['Trace Time'] = None
                protein_df['Total Time'] = None
                # Record static data for this protein
                protein_df['Protein'] = p_id
                protein_df['Method'] = 'EVC Mean Field'
                protein_df['Alignment Size'] = summary['Filtered_Alignment'].values[summary['Protein_ID'] == p_id][0]
                protein_df.to_csv(protein_fn, sep='\t', header=True, index=False, columns=output_columns)
                print('Successfully computed EV couplings covariance for: {}'.format(p_id))
                counts['success'] += 1
            except ValueError:
                print('Could not compute EV couplings covariance for: {} with seq_length: {} and size: {}'.format(
                    p_id, curr_aln.seq_length, curr_aln.size))
                counts['value'] += 1
                continue
            except AttributeError:
                print('Could not compute EV couplings covariance for: {} no alignment'.format(p_id))
                counts['attribute'] += 1
                continue
            if evc_mf_method_df is None:
                evc_mf_method_df = protein_df
            else:
                evc_mf_method_df = evc_mf_method_df.append(protein_df)
        print('{}\tSuccesses\n{}\tValue Errors\n{}\tAttribute Errors'.format(counts['success'], counts['value'],
                                                                             counts['attribute']))
        evc_mf_method_df.to_csv(evc_mf_method_fn, sep='\t', header=True, index=False, columns=output_columns)
    if small_comparison_df is None:
        small_comparison_df = evc_mf_method_df
    else:
        small_comparison_df = small_comparison_df.append(evc_mf_method_df)

In [13]:
# Write out final comparison data so it can be loaded later for generating figures.
if not os.path.isfile(small_comparison_fn):
    small_comparison_df['Protein Length'] = small_comparison_df['Protein'].apply(lambda x: generator.protein_data[x]['Length'])
    small_comparison_df.to_csv(small_comparison_fn, sep='\t', header=True, index=False, columns=output_columns)

# Comparing Execution Time for ET-MIp and cET-MIp
The time to compute the trace for the full phylogenetic tree and the trace constrained to a subset of the top levels should take significantly less time to compute, here we evaluate if that is in fact the case or not.

## Data Cleaning
At least one protein in this data set has a very small alignment and could not be evaluated by cET-MIp because the tree was too small to each the levels set for other proteins. Here we remove those proteins.

In addition since this analysis focuses on times and there are many other types of data (some of which cause redundancies in the time data), we will use this opportunity to subset the data and drop duplicates.

Finally, for some it will be more informative to view execution time in terms of minutes or hours, as opposed to the originally reported seconds, so we will add columns for these units as well.

In [14]:
protein_method_groups = small_comparison_df[['Protein', 'Method']].drop_duplicates().groupby('Protein').count()
method_max = protein_method_groups['Method'].max()
proteins_to_keep = protein_method_groups.index[protein_method_groups['Method'] == method_max]
comparable_method_proteins = small_comparison_df[small_comparison_df['Protein'].isin(proteins_to_keep)]
time_columns = ['Protein', 'Protein Length', 'Alignment Size', 'Method', 'Init Time', 'Import Time', 'Dist Tree Time',
                'Trace Time', 'Total Time']
time_subset_df = comparable_method_proteins.loc[comparable_method_proteins['Method'].isin(['ET-MIp', 'cET-MIp']), time_columns]
time_subset_df['Total Time (min)'] = time_subset_df['Total Time'].apply(lambda x: x / 60.0)
time_subset_df['Total Time (hr)'] = time_subset_df['Total Time (min)'].apply(lambda x: x / 60.0)
time_columns += ['Total Time (min)', 'Total Time (hr)']
time_subset_df = time_subset_df.drop_duplicates(subset=None, inplace=False, keep='first')
time_subset_df.to_csv(os.path.join(small_set_out_dir, 'Small_Time_Comaprison_Data.csv'), sep='\t', header=True, index=False,
                      columns=time_columns)

## Time Comparison
Now that only comparable proteins are present in the data we compare the runtime of individual proteins by method, ordered by their length and the size of their alignemnts.

In [15]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# Displayed by protein length order
protein_length_order = comparable_method_proteins.sort_values('Protein Length')['Protein'].unique()
protein_length_time_plot = sns.barplot(x='Protein', y='Total Time', hue='Method', order=protein_length_order,
                                       hue_order=['ET-MIp', 'cET-MIp'], data=time_subset_df)
protein_length_time_plot.set_xticklabels(protein_length_time_plot.get_xticklabels(), rotation=90)
protein_length_time_plot.legend(loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1)
plt.savefig(os.path.join(small_set_out_dir, 'Protein_Length_Time_Comparison_Sec.png'), bbox_inches='tight', transparent=True, dpi=300)
plt.close()
protein_length_time_plot = sns.barplot(x='Protein', y='Total Time (min)', hue='Method', order=protein_length_order,
                                       hue_order=['ET-MIp', 'cET-MIp'], data=time_subset_df)
protein_length_time_plot.set_xticklabels(protein_length_time_plot.get_xticklabels(), rotation=90)
protein_length_time_plot.legend(loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1)
plt.savefig(os.path.join(small_set_out_dir, 'Protein_Length_Time_Comparison_Min.png'), bbox_inches='tight', transparent=True, dpi=300)
plt.close()
protein_length_time_plot = sns.barplot(x='Protein', y='Total Time (hr)', hue='Method', order=protein_length_order,
                                       hue_order=['ET-MIp', 'cET-MIp'], data=time_subset_df)
protein_length_time_plot.set_xticklabels(protein_length_time_plot.get_xticklabels(), rotation=90)
protein_length_time_plot.legend(loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1)
plt.savefig(os.path.join(small_set_out_dir, 'Protein_Length_Time_Comparison_Hr.png'), bbox_inches='tight', transparent=True, dpi=300)
plt.close()
# Displayed by alignment size order
protein_alignment_order = comparable_method_proteins.sort_values('Alignment Size')['Protein'].unique()
protein_alignment_time_plot = sns.barplot(x='Protein', y='Total Time', hue='Method', order=protein_alignment_order,
                                          hue_order=['ET-MIp', 'cET-MIp'], data=time_subset_df)
protein_alignment_time_plot.set_xticklabels(protein_alignment_time_plot.get_xticklabels(), rotation=90)
protein_alignment_time_plot.legend(loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1)
plt.savefig(os.path.join(small_set_out_dir, 'Protein_Alignment_Time_Comparison_Sec.png'), bbox_inches='tight', transparent=True, dpi=300)
plt.close()
protein_alignment_time_plot = sns.barplot(x='Protein', y='Total Time (min)', hue='Method', order=protein_alignment_order,
                                          hue_order=['ET-MIp', 'cET-MIp'], data=time_subset_df)
protein_alignment_time_plot.set_xticklabels(protein_alignment_time_plot.get_xticklabels(), rotation=90)
protein_alignment_time_plot.legend(loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1)
plt.savefig(os.path.join(small_set_out_dir, 'Protein_Alignment_Time_Comparison_Min.png'), bbox_inches='tight', transparent=True, dpi=300)
plt.close()
protein_alignment_time_plot = sns.barplot(x='Protein', y='Total Time (hr)', hue='Method', order=protein_alignment_order,
                                          hue_order=['ET-MIp', 'cET-MIp'], data=time_subset_df)
protein_alignment_time_plot.set_xticklabels(protein_alignment_time_plot.get_xticklabels(), rotation=90)
protein_alignment_time_plot.legend(loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1)
plt.savefig(os.path.join(small_set_out_dir, 'Protein_Alignment_Time_Comparison_Hr.png'), bbox_inches='tight', transparent=True, dpi=300)
plt.close()

Overall comparison of time by method.

In [16]:
# Comparison of ET-MIp and cET-MIp total computation time (sec).
protein_method_comp_plot = sns.boxplot(x='Method', y='Total Time', order=['ET-MIp', 'cET-MIp'], data=time_subset_df)
protein_method_comp_plot.set_xticklabels(protein_method_comp_plot.get_xticklabels(), rotation=90)
plt.savefig(os.path.join(small_set_out_dir, 'Protein_Method_Time_Comparison_Sec.png'), bbox_inches='tight', transparent=True, dpi=300)
plt.close()
# Comparison of ET-MIp and cET-MIp total computation time (sec).
protein_method_comp_plot = sns.boxplot(x='Method', y='Total Time (min)', order=['ET-MIp', 'cET-MIp'], data=time_subset_df)
protein_method_comp_plot.set_xticklabels(protein_method_comp_plot.get_xticklabels(), rotation=90)
plt.savefig(os.path.join(small_set_out_dir, 'Protein_Method_Time_Comparison_Min.png'), bbox_inches='tight', transparent=True, dpi=300)
plt.close()
# Comparison of ET-MIp and cET-MIp total computation time (sec).
protein_method_comp_plot = sns.boxplot(x='Method', y='Total Time (hr)', order=['ET-MIp', 'cET-MIp'], data=time_subset_df)
protein_method_comp_plot.set_xticklabels(protein_method_comp_plot.get_xticklabels(), rotation=90)
plt.savefig(os.path.join(small_set_out_dir, 'Protein_Method_Time_Comparison_Hr.png'), bbox_inches='tight', transparent=True, dpi=300)
plt.close()
# Statistical comparison
from scipy.stats import wilcoxon
et_mip_sub_df = time_subset_df[time_subset_df['Method'] == 'ET-MIp']
cet_mip_sub_df = time_subset_df[time_subset_df['Method'] == 'cET-MIp']
sec_stat, sec_p_val = wilcoxon(x=et_mip_sub_df['Total Time'], y=cet_mip_sub_df['Total Time'], zero_method='wilcox')
min_stat, min_p_val = wilcoxon(x=et_mip_sub_df['Total Time (min)'], y=cet_mip_sub_df['Total Time (min)'], zero_method='wilcox')
hr_stat, hr_p_val = wilcoxon(x=et_mip_sub_df['Total Time (hr)'], y=cet_mip_sub_df['Total Time (hr)'], zero_method='wilcox')
time_statistics = {'Time Unit': ['sec', 'min', 'hr'], 'Statistic': [sec_stat, min_stat, hr_stat], 'P-Value': [sec_p_val, min_p_val, hr_p_val]}
pd.DataFrame(time_statistics).to_csv(os.path.join(small_set_out_dir, 'Small_Time_Comaprison_Statistics.csv'), sep='\t', header=True,
                                     index=False, columns=['Time Unit', 'Statistic', 'P-Value'])

## Method Comparison
We now begin comparing methods based on their ability to predict the structural contacts in the proteins in this test set. There is an important consideration in the case of sequence separation and different measures by which to compare the methods.

### Data Cleaning
These data need an additional cleaning step beyond what was performed for the timing comparison. Since there are multiple categories of sequence separation and some proteins may not have any True Positive contacts for a category the scoring for that protein is incomplete. We will remove all such proteins from the comparison, performing the clean up separate for each metric of success. Another contributing factor which necessitates this kind of cleaning is assessment of the top K predictions for a protein or best L/K predictions which for poor predictions may not include any predictions of True Positives.

### Sequence Separation
One important consideration for the difficulty of prediction and interest in predictions is the distance between the residues for which coupling was predicted. As has been documented in the literature, especially in the CASP competitions, there are several categories of prediction:
* Neighbors (1 - 5 residues apart) - This is the least interesting category of predictions. It is highly likely that residues this close together will show covariance signal. Predicting two residues are in contact that are this close together is trivial and uninformative.
* Short (6 - 12 residues apart) - This is also not a very interesting type of prediction. Residues this close in proximity can be more easily modeled by alogrithms which focus on 2D protein structure modeling (identifying beta sheets, alpha helices, etc.).
* Medium (13 - 24 residues apart) - This is a more interesting type of prediction. The resiudes in this range of separation are on the edge of the 2D protein structure prediction range.
* Long (24 and more residues apart) - The most interesting category of predictions. Resiudes this far apart are not easily modeled by 2D protein structure modeling systems. They are also very useful for 3D and 4D protein structure prediction becausae they provide constraints on potential protein (similar to NMR data) folds which makes protein folding a more tractable problem for modelers.
* Any/All - All categories can be considered at once, this provides a summary value, but is often skewed by one particularly good category of predictions.

### Metrics of Success
* AUROC - This measures the True Positive Rate vs the False Positive Rate of prediction, it can be considered a measure of the accuracy of the measure. This can be strongly influenced by the class imbalance which is present when predicting structural contacts since there are many fewer contacts than non-contacts. The True Positive case is if the C-beta of two amino acids is within 8.0 Angstroms of one another (as is done in the CASP competitions).

In [19]:
auroc_columns = ['Protein', 'Method', 'Sequence_Separation', 'AUROC']
protein_auroc_groups = comparable_method_proteins[auroc_columns].drop_duplicates().groupby('Protein')['AUROC'].apply(
    lambda x: not x.isnull().any())
complete_proteins = protein_auroc_groups.index[protein_auroc_groups.values]
comparable_auroc_proteins = small_comparison_df[small_comparison_df['Protein'].isin(complete_proteins)]
auroc_protein_length_order = [x for x in protein_length_order if x in complete_proteins]
auroc_protein_alignment_order = [x for x in protein_alignment_order if x in complete_proteins]
auroc_subset_df = comparable_auroc_proteins.loc[:, auroc_columns].drop_duplicates()
# Plot the methods vs AUROC per protein ordered by protein length
auroc_subset_df.to_csv(os.path.join(small_set_out_dir, 'Small_AUROC_Comaprison_Data.csv'), sep='\t', header=True, index=False,
                       columns=auroc_columns)
protein_order_auroc_plot = sns.catplot(x="Protein", y="AUROC", hue="Method", row="Sequence_Separation", data=auroc_subset_df, kind="bar",
                                       ci=None, order=auroc_protein_length_order, hue_order=method_order, legend=True, legend_out=True)
protein_order_auroc_plot.set_xticklabels(auroc_protein_length_order, rotation=90)
protein_order_auroc_plot.savefig(os.path.join(small_set_out_dir, 'Protein_Method_AUROC_Comparison_Protein_Length_Order.png'),
                                 bbox_inches='tight', transparent=True, dpi=300)
plt.close()
# Plot the methods vs AUROC per protein ordered by protein alignment size
alignment_order_auroc_plot = sns.catplot(x="Protein", y="AUROC", hue="Method", row="Sequence_Separation", data=auroc_subset_df, kind="bar",
                                       ci=None, order=auroc_protein_alignment_order, hue_order=method_order, legend=True, legend_out=True)
alignment_order_auroc_plot.set_xticklabels(auroc_protein_alignment_order, rotation=90)
alignment_order_auroc_plot.savefig(os.path.join(small_set_out_dir, 'Protein_Method_AUROC_Comparison_Alignment_Size_Order.png'),
                                   bbox_inches='tight', transparent=True, dpi=300)
plt.close()
# Plot the methods vs AUROC grouped together to see overall trends
overall_auroc_plot = sns.boxplot(x="Sequence_Separation", y="AUROC", hue="Method", data=auroc_subset_df,
                                 order=sequence_separation_order, hue_order=method_order)
overall_auroc_plot.set_xticklabels(sequence_separation_order, rotation=90)
overall_auroc_plot.legend(loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1)
plt.savefig(os.path.join(small_set_out_dir, 'Protein_Method_AUROC_Comparison.png'), bbox_inches='tight', transparent=True, dpi=300)
plt.close()
# Compute statistics comparing methods at each sequence separation
auroc_statistics = {'Sequence Separation': [], 'Method 1': [], 'Method 2': [], 'Statistic': [], 'P-Value': []}
for sep in sequence_separation_order:
    sep_auroc_subset_df = auroc_subset_df.loc[auroc_subset_df['Sequence_Separation'] == sep, :]
    for i in range(len(method_order)):
        m1_sep_auroc_subset_df = sep_auroc_subset_df.loc[sep_auroc_subset_df['Method'] == method_order[i], :]
        for j in range(i + 1, len(method_order)):
            m2_sep_auroc_subset_df = sep_auroc_subset_df.loc[sep_auroc_subset_df['Method'] == method_order[j], :]
            stat, p_val = wilcoxon(x=m1_sep_auroc_subset_df['AUROC'], y=m2_sep_auroc_subset_df['AUROC'], zero_method='wilcox')
            auroc_statistics['Sequence Separation'].append(sep)
            auroc_statistics['Method 1'].append(method_order[i])
            auroc_statistics['Method 2'].append(method_order[j])
            auroc_statistics['Statistic'].append(stat)
            auroc_statistics['P-Value'].append(p_val)
pd.DataFrame(auroc_statistics).to_csv(os.path.join(small_set_out_dir, 'Small_AUROC_Comaprison_Statistics.csv'), sep='\t', header=True,
                                      index=False, columns=['Sequence Separation', 'Method 1', 'Method 2', 'Statistic', 'P-Value'])

### Metrics of Success (Continued)
* AUPRC - This measures the Precision vs the Recall of the predictions, it can be considered a measure of the accuracy of the measure. This is less strongly influenced by the class imbalance which is present when predicting structural contacts since there are many fewer contacts than non-contacts. The True Positive case is if the C-beta of two amino acids is within 8.0 Angstroms of one another (as is done in the CASP competitions).

In [20]:
auprc_columns = ['Protein', 'Method', 'Sequence_Separation', 'AUPRC']
protein_auprc_groups = comparable_method_proteins[auprc_columns].drop_duplicates().groupby('Protein')['AUPRC'].apply(
    lambda x: not x.isnull().any())
complete_proteins = protein_auprc_groups.index[protein_auprc_groups.values]
comparable_auprc_proteins = small_comparison_df[small_comparison_df['Protein'].isin(complete_proteins)]
auprc_protein_length_order = [x for x in protein_length_order if x in complete_proteins]
auprc_protein_alignment_order = [x for x in protein_alignment_order if x in complete_proteins]
auprc_subset_df = comparable_auprc_proteins.loc[:, auprc_columns].drop_duplicates()
# Plot the methods vs AUPRC per protein ordered by protein length
auprc_subset_df.to_csv(os.path.join(small_set_out_dir, 'Small_AUPRC_Comaprison_Data.csv'), sep='\t', header=True, index=False,
                       columns=auprc_columns)
protein_order_auprc_plot = sns.catplot(x="Protein", y="AUPRC", hue="Method", row="Sequence_Separation", data=auprc_subset_df, kind="bar",
                                       ci=None, order=auprc_protein_length_order, hue_order=method_order, legend=True, legend_out=True)
protein_order_auprc_plot.set_xticklabels(auprc_protein_length_order, rotation=90)
protein_order_auprc_plot.savefig(os.path.join(small_set_out_dir, 'Protein_Method_AUPRC_Comparison_Protein_Length_Order.png'),
                                 bbox_inches='tight', transparent=True, dpi=300)
plt.close()
# Plot the methods vs AUPRC per protein ordered by protein alignment size
alignment_order_auprc_plot = sns.catplot(x="Protein", y="AUPRC", hue="Method", row="Sequence_Separation", data=auprc_subset_df, kind="bar",
                                       ci=None, order=auprc_protein_alignment_order, hue_order=method_order, legend=True, legend_out=True)
alignment_order_auprc_plot.set_xticklabels(auprc_protein_alignment_order, rotation=90)
alignment_order_auprc_plot.savefig(os.path.join(small_set_out_dir, 'Protein_Method_AUPRC_Comparison_Alignment_Size_Order.png'),
                                   bbox_inches='tight', transparent=True, dpi=300)
plt.close()
# Plot the methods vs AUPRC grouped together to see overall trends
overall_auprc_plot = sns.boxplot(x="Sequence_Separation", y="AUPRC", hue="Method", data=auprc_subset_df,
                                 order=sequence_separation_order, hue_order=method_order)
overall_auprc_plot.set_xticklabels(sequence_separation_order, rotation=90)
overall_auprc_plot.legend(loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1)
plt.savefig(os.path.join(small_set_out_dir, 'Protein_Method_AUPRC_Comparison.png'), bbox_inches='tight', transparent=True, dpi=300)
plt.close()
# Compute statistics comparing methods at each sequence separation
auprc_statistics = {'Sequence Separation': [], 'Method 1': [], 'Method 2': [], 'Statistic': [], 'P-Value': []}
for sep in sequence_separation_order:
    sep_auprc_subset_df = auprc_subset_df.loc[auprc_subset_df['Sequence_Separation'] == sep, :]
    for i in range(len(method_order)):
        m1_sep_auprc_subset_df = sep_auprc_subset_df.loc[sep_auprc_subset_df['Method'] == method_order[i], :]
        for j in range(i + 1, len(method_order)):
            m2_sep_auprc_subset_df = sep_auprc_subset_df.loc[sep_auprc_subset_df['Method'] == method_order[j], :]
            stat, p_val = wilcoxon(x=m1_sep_auprc_subset_df['AUPRC'], y=m2_sep_auprc_subset_df['AUPRC'], zero_method='wilcox')
            auprc_statistics['Sequence Separation'].append(sep)
            auprc_statistics['Method 1'].append(method_order[i])
            auprc_statistics['Method 2'].append(method_order[j])
            auprc_statistics['Statistic'].append(stat)
            auprc_statistics['P-Value'].append(p_val)
pd.DataFrame(auprc_statistics).to_csv(os.path.join(small_set_out_dir, 'Small_AUPRC_Comaprison_Statistics.csv'), sep='\t', header=True,
                                      index=False, columns=['Sequence Separation', 'Method 1', 'Method 2', 'Statistic', 'P-Value'])

* Precision at K - 

* Recall at K - 

* F1 at K - 

* Structural Cluster Weighting Z-Score - 