# Small Test Set #
## Goal ##
The goal of this test set is to perform proof of concept testing on a small number of proteins with a wide range of sizes and available homologs, orthologs, and paralogs. By doing so it should be possible to test the best parameterization for this tool as well as identifying the strengths and weaknesses of the tool using various measurments as end points.
## Warning ##
Before attempting to use this notebook make sure that your .env file has been properly setup to reflect the correct locations of command line tools and the location of files and directories needed for execution.
### Initial Import###
This first cell performs the necessary imports required to begin this notebook.

In [1]:
from dotenv import find_dotenv, load_dotenv
try:
    dotenv_path = find_dotenv(raise_error_if_not_found=True)
except IOError:
    dotenv_path = find_dotenv(raise_error_if_not_found=True, usecwd=True)
load_dotenv(dotenv_path)
import os
import sys
sys.path.append(os.path.join(os.environ.get('PROJECT_PATH'), 'src'))
sys.path.append(os.path.join(os.environ.get('PROJECT_PATH'), 'src', 'SupportingClasses'))
input_dir = os.environ.get('INPUT_PATH')

## Data Set Construction ##
The first task required to test the data set is to download the required data and construct any necessary input files for all down stream analyses.
In this case that means:
* Downloading PDB files for the proteins in our small test set.
* Extracting a query sequence from each PDB file.
* Searching for paralogs, homologs, and orthologs in a custom BLAST database built by filtering the Uniref90 database.
* Filtering the hits from the BLAST search to meet minimum and maximum length requirements, as well as minimum and maximum identity requirements.
* Building alignments using CLUSTALW in both the fasta and msf formats since some of the tools which will be used for comparison need different formats.
* Filtering the alignment for maximum identity similarity between seqeunces.
* Re-aligning the filtered sequences using CLUSTALW.
This is all handeled by the DataSetGenerator class found in the src/SupportingClasses folder

In [2]:
from time import time
from DataSetGenerator import DataSetGenerator
protein_list_dir = os.path.join(input_dir, 'ProteinLists')
if not os.path.isdir(protein_list_dir):
    os.makedirs(protein_list_dir)
small_list_fn = os.path.join(protein_list_dir, 'SmallDataSet.txt')
if not os.path.isfile(small_list_fn):
    proteins_of_interest = ['2ysdA', '1c17A', '3tnuA', '7hvpA', '135lA', '206lA', '2werA', '1bolA', '3q05A', '1axbA',
                            '2rh1A', '1hckA', '3b6vA', '2z0eA', '1jwlA', '1a26A', '1c0kA', '4lliA', '4ycuA', '2iopA',
                            '2zxeA', '2b59B', '1h1vG']
    with open(small_list_fn, 'w') as small_list_handle:
        for p_id in proteins_of_interest:
            small_list_handle.write('{}\n'.format(p_id))
generator = DataSetGenerator(input_dir)
start = time()
summary = generator.build_pdb_alignment_dataset(protein_list_fn=os.path.basename(small_list_fn), num_threads=10,
                                                database='customuniref90.fasta', max_target_seqs=2500, remote=False,
                                                verbose=False)
summary['Accession'] = summary['Protein_ID'].apply(lambda x: generator.protein_data[x]['Accession'])
summary['Length'] = summary['Protein_ID'].apply(lambda x: generator.protein_data[x]['Length'])
summary['Total_Size'] = summary.apply(lambda x: float(x['Length']) * float(x['Filtered_Alignment']), axis=1)
summary.sort_values(by=['Filtered_Alignment', 'Length'], axis=0, inplace=True)
summary_columns = ['Protein_ID', 'Accession', 'BLAST_Hits', 'Filtered_BLAST', 'Filtered_Alignment', 'Length',
                   'Total_Size']
print(summary[summary_columns])
end = time()
print('It took {} min to generate the data set.'.format((end - start) / 60.0))
summary.to_csv(os.path.join(input_dir, 'small_data_set_summary.tsv'), sep='\t', index=False, header=True,
               columns=summary_columns)

Importing protein list
Downloading structures and parsing in query sequences
Unique Sequences Found: 23!
BLASTing query sequences
Filtering BLAST hits, aligning, filtering by identity, and re-aligning
   Protein_ID     Accession  BLAST_Hits  Filtered_BLAST  Filtered_Alignment  \
18       2b59    CIPA_CLOTM        1606               4                   4   
7        206l      LYS_BPT4        1039             131                  92   
16       1c0k    OXDA_RHOTO        2500              51                  51   
13       1bol    RNRH_RHINI        2500             131                 127   
17       7hvp     POL_HV1A2        2500              33                  33   
12       1c17    ATPL_ECOLI        1671             850                 813   
2        1jwl    LACI_ECOLI        2500             228                 226   
1        3q05     P53_HUMAN         932             306                 210   
8        135l    LYSC_MELGA        1913             853                 818   
4        

Create a location to store the output of this method comparison.

In [3]:
output_dir = os.environ.get('OUTPUT_PATH')
small_set_out_dir = os.path.join(output_dir, 'SmallTestSet')
if not os.path.isdir(small_set_out_dir):
    os.makedirs(small_set_out_dir)

## Setting Up Scoring For Each Method
To reduce memory load during prediction and evaluation, the scoring objects needed to compute the metrics used to compare methods will be created ahead of time so they are available to each method when it computes its predictions for a given protein. This will ensure that results do not need to be kept in memory while waiting for all other results to be computed, only the metrics measured for each method will be recorded.

In [None]:
from SeqAlignment import SeqAlignment
from PDBReference import PDBReference
from ContactScorer import ContactScorer, plot_z_scores
protein_order = list(summary['Protein_ID'])
method_order = ['DCA', 'EV Couplings', 'EV Couplings MF', 'ET-MIp', 'cET-MIp']
sequence_separation_order = ['Any', 'Neighbors', 'Short', 'Medium', 'Long']
protein_scorers = {}
for p_id in summary['Protein_ID']:
    protein_scorers[p_id] = {}
    # Import alignment and remove gaps
    full_aln = SeqAlignment(file_name=generator.protein_data[p_id]['Final_FA_Aln'], query_id=p_id)
    full_aln.import_alignment()
    non_gap_aln = full_aln.remove_gaps()
    # Import structure
    pdb_structure = PDBReference(pdb_file=generator.protein_data[p_id]['PDB'])
    pdb_structure.import_pdb(structure_id=p_id)
    protein_scorers[p_id]['Structure'] = pdb_structure
    # Initialize Beta Carbon distance scorer
    contact_scorer_cb = ContactScorer(query=p_id, seq_alignment=non_gap_aln,
                                      pdb_reference=pdb_structure, cutoff=8.0)
    contact_scorer_cb.best_chain = generator.protein_data[p_id]['Chain']
    contact_scorer_cb.fit()
    contact_scorer_cb.measure_distance(method='CB')
    protein_scorers[p_id]['Scorer_CB'] = contatct_scorer_cb
    # Initialize distance scorer minimizing distance between any atoms
    contact_scorer_any = ContactScorer(query=p_id, seq_alignment=non_gap_aln,
                                       pdb_reference=pdb_structure, cutoff=8.0)
    contact_scorer_any.best_chain = generator.protein_data[p_id]['Chain']
    contact_scorer_any.fit()
    contact_scorer_any.measure_distance(method='Any')
    protein_scorers[p_id]['Scorer_Any'] = contatct_scorer_any
    # Initialize z-scoring subproblems
    protein_scorers[p_id]['biased_w2_ave'] = None
    protein_scorers[p_id]['unbiased_w2_ave'] = None
output_columns = ['Protein', 'Alignment Size', 'Method', 'Distance', 'Init Time', 'Import Time', 'Dist Tree Time', 'Trace Time', 'Total Time', 
                  'Sequence_Separation', 'AUROC', 'AUPRC', 'AUTPRFDRC',
                  'Top K Predictions', 'Precision', 'Recall', 'F1 Score',
                  'Top 10% Biased Z-Score', 'Top 20% Biased Z-Score', 'Top 30% Biased Z-Score', 'Max Biased Z-Score', 'AUC Biased Z-Score',
                  'Top 10% Unbiased Z-Score', 'Top 20% Unbiased Z-Score', 'Top 30% Unbiased Z-Score', 'Max Unbiased Z-Score', 'AUC Unbiased Z-Score']
small_comparison_df = None
small_comparision_fn = os.path.join(small_set_out_dir, 'Large_Comparision_Data.csv')
if os.path.isfile(small_comparison_fn):
    small_comparison_df = pd.read_csv(small_comparision_fn, sep='\t', header=0, index_col=False)

# Generating Values For Comparision#
To determine the effectiveness of the new method and implementation the covariation of the same proteins will be computed using the previous Evolutionary Trace covariation method (ET-MIp) and other methods in the field.

## ET-MIp##
Scoring the the covariation of the proteins using the previous Evolutionary Trace covariation method (ET-MIp).

In [None]:
# from ETMIPWrapper import ETMIPWrapper
# etmip_out_dir = os.path.join(small_set_out_dir, 'ET-MIp')
# if not os.path.isdir(etmip_out_dir):
#     os.makedirs(etmip_out_dir)
# etmip_scores = {}
# counts = {'success':0, 'value': 0, 'attribute':0}
# for p_id in generator.protein_data:
#     print('Attempting to calculate ET-MIp covariance for: {}'.format(p_id))
#     try:
#         protein_out_dir = os.path.join(etmip_out_dir, p_id)
#         if not os.path.isdir(protein_out_dir):
#             os.makedirs(protein_out_dir)
#         curr_aln = SeqAlignment(file_name=generator.protein_data[p_id]['Final_FA_Aln'], query_id=p_id, polymer_type='Protein')
#         curr_aln.import_alignment()
#         curr_etmip = ETMIPWrapper(alignment=curr_aln)
#         curr_etmip.calculate_scores(out_dir=protein_out_dir, delete_files=False)
#         etmip_scores[p_id] = curr_etmip
#         print('Successfully computed ET-MIp covariance for: {}'.format(p_id))
#         counts['success'] += 1
#     except ValueError:
#         print('Could not compute ET-MIp covariance for: {} with seq_length: {} and size: {}'.format(
#             p_id, curr_aln.seq_length, curr_aln.size))
#         counts['value'] += 1
#     except AttributeError:
#         print('Could not compute ET-MIp covariance for: {} no alignment'.format(p_id))
#         counts['attribute'] += 1
# print('{}\tSuccesses\n{}\tValue Errors\n{}\tAttribute Errors'.format(counts['success'], counts['value'],
#                                                                      counts['attribute']))

## ET-MIp (Continued)
The previous implementation is not able to run for alignments of the size used here. Instead we use the new implementation with the same parameterization used by the previous implementation (Distance Model - blosum62 similarity, Tree - ET UPGMA variant, Scoring Metric - filtered average product corrected mutual information, Ranks - all).

In [None]:
from EvolutionaryTrace import EvolutionaryTrace
etmip_out_dir = os.path.join(small_set_out_dir, 'ET-MIp')
if not os.path.isdir(etmip_out_dir):
    os.makedirs(etmip_out_dir)
etmip_method_df = None
counts = {'success':0, 'value': 0, 'attribute':0}
for p_id in generator.protein_data:
    print('Attempting to calculate ET-MIp covariance for: {}'.format(p_id))
    try:
        protein_out_dir = os.path.join(etmip_out_dir, p_id)
        if not os.path.isdir(protein_out_dir):
            os.makedirs(protein_out_dir)
        start_time = time()
        curr_etmip = EvolutionaryTrace(query_id=p_id, polymer_type='Protein',
                                       aln_fn=generator.protein_data[p_id]['Final_FA_Aln'], et_distance=True,
                                       distance_model='blosum62', tree_building_method='et', tree_building_options={},
                                       ranks=None, position_type='pair',
                                       scoring_metric='filtered_average_product_corrected_mutual_information',
                                       gap_correction=None, out_dir=protein_dir,
                                       output_files={'original_aln', 'non_gap_aln', 'tree', 'scores'},
                                       processors=10, low_memory=True)
        init_time = time()
        curr_etmip.import_and_process_aln()
        import_time = time()
        curr_etmip.compute_distance_matrix_tree_and_assignments()
        dist_tree_time = time()
        curr_etmip.perform_trace()
        end_time = time()
        print('Successfully computed ET-MIp covariance for: {}'.format(p_id))
        # Compute statistics for the final scores of the ET-MIp model
        protein_df, _, _ = protein_scorers[p_id]['Scorer_CB'].evaluate_predictor(
            predictor=curr_etmip, verbostiy=2, out_dir=protein_dir, dist='CB', biased_w2_ave=None,
            unbiased_w2_ave=None, processes=10, threshold=0.5, pos_size=curr_etmip.scorer.position_size,
            rank_type=curr_etmip.scorer.rank_type, file_prefix='ET-MIp_Scores_', plots=True)
        # Score Prediction Clustering
        z_score_fn = os.path.join(protein_dir, 'ET-MIp_Scores_Dist-Any_{}_ZScores.tsv')
        z_score_plot_fn = os.path.join(protein_dir, 'ET-MIp_Scores_Dist-Any_{}_ZScores.png')
        z_score_biased, biased_w2_ave, biased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
            1.0 - curr_etmip.coverage, bias=True, file_path=z_score_fn.format('Biased'),
            w2_ave_sub=protein_scorers[p_id]['biased_w2_ave'], processes=10)
        biased_z_score_array = np.array(pd.to_numeric(z_score_biased['Z-Score'], errors='coerce'))
        protein_df['Max Biased Z-Score'] = np.nanmax(biased_z_score_array)
        protein_df['Biased Z-Score at 10%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
        protein_df['Biased Z-Score at 30%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
        protein_df['AUC Biased Z-Score'] = biased_scw_z_auc
        plot_z_scores(z_score_biased, z_score_plot_fn.format('Biased'))
        z_score_unbiased, unbiased_w2_ave, unbiased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
            1.0 - curr_etmip.coverage, bias=False, file_path=z_score_fn.format('Unbiased'),
            w2_ave_sub=protein_scorers[p_id]['unbiased_w2_ave'], processes=10)
        unbiased_z_score_array = np.array(pd.to_numeric(z_score_unbiased['Z-Score'], errors='coerce'))
        protein_df['Max Unbiased Z-Score'] = np.nanmax(unbiased_z_score_array)
        protein_df['Unbiased Z-Score at 10%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
        protein_df['Unbiased Z-Score at 30%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
        protein_df['AUC Unbiased Z-Score'] = unbiased_scw_z_auc
        plot_z_scores(z_score_unbiased, z_score_plot_fn.format('Unbiased'))
        # Record execution times
        protein_df['Init Time'] = init_time - start_time
        protein_df['Import Time'] = import_time - init_time
        protein_df['Dist Tree Time'] = dist_tree_time - import_time
        protein_df['Trace Time'] = end_time - dist_tree_time
        protein_df['Total Time'] = end_time - start_time
        # Record static data for this protein
        protein_df['Protein'] = p_id
        protein_df['Method'] = 'ET-MIp'
        protein_df['Alignment Size'] = generator.protein_data[p_id]['Filtered_Alignment']
        if etmip_method_df is None:
            etmip_method_df = protein_df
        else:
            etmip_method_df = etmip_method_df.append(protein_df)
        print('Metrics meastured for ET-MIp covariance for: {}'.format(p_id))
        counts['success'] += 1
    except ValueError:
        print('Could not compute ET-MIp covariance for: {} with seq_length: {} and size: {}'.format(
            p_id, curr_etmip.original_aln.seq_length, curr_etmip.original_aln.size))
        counts['value'] += 1
    except AttributeError:
        print('Could not compute ET-MIp covariance for: {} no alignment'.format(p_id))
        counts['attribute'] += 1
print('{}\tSuccesses\n{}\tValue Errors\n{}\tAttribute Errors'.format(counts['success'], counts['value'],
                                                                     counts['attribute']))
if small_comparison_df is None:
    small_comparison_df = etmip_method_df
else:
    small_comparison_df = small_comparison_df.append(etmip_method_df)

## cET-MIp
This segment the ET-MIp method, when constrained to an arbitrary set of nodes (1, 2, 3, 5, 7, 10, 25) at the top of the phylogenetic tree.

In [None]:
cetmip_out_dir = os.path.join(small_set_out_dir, 'cET-MIp')
if not os.path.isdir(cetmip_out_dir):
    os.makedirs(cetmip_out_dir)
cetmip_method_df = None
counts = {'success':0, 'value': 0, 'attribute':0}
for p_id in generator.protein_data:
    print('Attempting to calculate cET-MIp covariance for: {}'.format(p_id))
    try:
        protein_out_dir = os.path.join(cetmip_out_dir, p_id)
        if not os.path.isdir(protein_out_dir):
            os.makedirs(protein_out_dir)
        start_time = time()
        curr_cetmip = EvolutionaryTrace(query_id=p_id, polymer_type='Protein',
                                       aln_fn=generator.protein_data[p_id]['Final_FA_Aln'], et_distance=True,
                                       distance_model='blosum62', tree_building_method='et', tree_building_options={},
                                       ranks=[1, 2, 3, 5, 7, 10, 25], position_type='pair',
                                       scoring_metric='filtered_average_product_corrected_mutual_information',
                                       gap_correction=None, out_dir=protein_dir,
                                       output_files={'original_aln', 'non_gap_aln', 'tree', 'scores'},
                                       processors=10, low_memory=True)
        init_time = time()
        curr_cetmip.import_and_process_aln()
        import_time = time()
        curr_cetmip.compute_distance_matrix_tree_and_assignments()
        dist_tree_time = time()
        curr_cetmip.perform_trace()
        end_time = time()
        # Compute statistics for the final scores of the ET-MIp model
        protein_df, _, _ = protein_scorers[p_id]['Scorer_CB'].evaluate_predictor(
            predictor=curr_cetmip, verbostiy=2, out_dir=protein_dir, dist='CB', biased_w2_ave=None,
            unbiased_w2_ave=None, processes=10, threshold=0.5, pos_size=curr_etmip.scorer.position_size,
            rank_type=curr_etmip.scorer.rank_type, file_prefix='cET-MIp_Scores_', plots=True)
        # Score Prediction Clustering
        z_score_fn = os.path.join(protein_dir, 'cET-MIp_Scores_Dist-Any_{}_ZScores.tsv')
        z_score_plot_fn = os.path.join(protein_dir, 'cET-MIp_Scores_Dist-Any_{}_ZScores.png')
        z_score_biased, biased_w2_ave, biased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
            1.0 - curr_cetmip.coverage, bias=True, file_path=z_score_fn.format('Biased'),
            w2_ave_sub=protein_scorers[p_id]['biased_w2_ave'], processes=10)
        biased_z_score_array = np.array(pd.to_numeric(z_score_biased['Z-Score'], errors='coerce'))
        protein_df['Max Biased Z-Score'] = np.nanmax(biased_z_score_array)
        protein_df['Biased Z-Score at 10%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
        protein_df['Biased Z-Score at 30%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
        protein_df['AUC Biased Z-Score'] = biased_scw_z_auc
        plot_z_scores(z_score_biased, z_score_plot_fn.format('Biased'))
        z_score_unbiased, unbiased_w2_ave, unbiased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
            1.0 - curr_cetmip.coverage, bias=False, file_path=z_score_fn.format('Unbiased'),
            w2_ave_sub=protein_scorers[p_id]['unbiased_w2_ave'], processes=10)
        unbiased_z_score_array = np.array(pd.to_numeric(z_score_unbiased['Z-Score'], errors='coerce'))
        protein_df['Max Unbiased Z-Score'] = np.nanmax(unbiased_z_score_array)
        protein_df['Unbiased Z-Score at 10%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
        protein_df['Unbiased Z-Score at 30%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
        protein_df['AUC Unbiased Z-Score'] = unbiased_scw_z_auc
        plot_z_scores(z_score_unbiased, z_score_plot_fn.format('Unbiased'))
        # Record execution times
        protein_df['Init Time'] = init_time - start_time
        protein_df['Import Time'] = import_time - init_time
        protein_df['Dist Tree Time'] = dist_tree_time - import_time
        protein_df['Trace Time'] = end_time - dist_tree_time
        protein_df['Total Time'] = end_time - start_time
        # Record static data for this protein
        protein_df['Protein'] = p_id
        protein_df['Method'] = 'cET-MIp'
        protein_df['Alignment Size'] = generator.protein_data[p_id]['Filtered_Alignment']
        if cetmip_method_df is None:
            cetmip_method_df = protein_df
        else:
            cetmip_method_df = cetmip_method_df.append(protein_df)
        print('Successfully computed cET-MIp covariance for: {}'.format(p_id))
        counts['success'] += 1
    except ValueError:
        print('Could not compute cET-MIp covariance for: {} with seq_length: {} and size: {}'.format(
            p_id, curr_cetmip.original_aln.seq_length, curr_etmip.original_aln.size))
        counts['value'] += 1
    except AttributeError:
        print('Could not compute cET-MIp covariance for: {} no alignment'.format(p_id))
        counts['attribute'] += 1
print('{}\tSuccesses\n{}\tValue Errors\n{}\tAttribute Errors'.format(counts['success'], counts['value'],
                                                                     counts['attribute']))
if small_comparison_df is None:
    small_comparison_df = cetmip_method_df
else:
    small_comparison_df = small_comparison_df.append(cetmip_method_df)

## DCA##
Scoring the the covariation of the proteins using a DCA julia implementation.

In [None]:
from DCAWrapper import DCAWrapper
from utils import compute_rank_and_coverage
dca_out_dir = os.path.join(small_set_out_dir, 'DCA')
if not os.path.isdir(dca_out_dir):
    os.makedirs(dca_out_dir)
dca_method_df = None
counts = {'success':0, 'value': 0, 'attribute':0}
for p_id in generator.protein_data:
    print('Attempting to calculate DCA covariance for: {}'.format(p_id))
    try:
        protein_out_dir = os.path.join(dca_out_dir, p_id)
        if not os.path.isdir(protein_out_dir):
            os.makedirs(protein_out_dir)
        curr_aln = SeqAlignment(file_name=generator.protein_data[p_id]['Final_FA_Aln'], query_id=p_id,
                                polymer_type='Protein')
        curr_aln.import_alignment()
        # Since the DCA implementation used here does not provide a way to specify the query sequence we remove the gaps
        # from the query sequences so positions will be referenced correctly for that sequence (and unnecessary
        # computations can be avoided).
        curr_aln = curr_aln.remove_gaps()
        new_aln_fn = os.path.join(protein_out_dir, '{}_no_gap.fasta'.format(p_id))
        curr_aln.write_out_alignment(new_aln_fn)
        curr_aln.file_name = new_aln_fn
        curr_dca = DCAWrapper(alignment=curr_aln)
        curr_dca.calculate_scores(out_dir=protein_out_dir, delete_file=False)
        # Compute statistics for the final scores of the ET-MIp model
        protein_df, _, _ = protein_scorers[p_id]['Scorer_CB'].evaluate_predictor(
            predictor=curr_dca, verbostiy=2, out_dir=protein_dir, dist='CB', biased_w2_ave=None,
            unbiased_w2_ave=None, processes=10, threshold=0.5, pos_size=curr_etmip.scorer.position_size,
            rank_type=curr_etmip.scorer.rank_type, file_prefix='DCA_Scores_', plots=True)
        # Score Prediction Clustering
        _, dca_coverage  = compute_rank_and_coverage(seq_length=curr_dca.alignment.seq_length, scores=curr_dca.scores, pos_size=2,
            rank_type='max')
        z_score_fn = os.path.join(protein_dir, 'DCA_Scores_Dist-Any_{}_ZScores.tsv')
        z_score_plot_fn = os.path.join(protein_dir, 'DCA_Scores_Dist-Any_{}_ZScores.png')
        z_score_biased, biased_w2_ave, biased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
            1.0 - dca_coverage, bias=True, file_path=z_score_fn.format('Biased'),
            w2_ave_sub=protein_scorers[p_id]['biased_w2_ave'], processes=10)
        biased_z_score_array = np.array(pd.to_numeric(z_score_biased['Z-Score'], errors='coerce'))
        protein_df['Max Biased Z-Score'] = np.nanmax(biased_z_score_array)
        protein_df['Biased Z-Score at 10%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
        protein_df['Biased Z-Score at 30%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
        protein_df['AUC Biased Z-Score'] = biased_scw_z_auc
        plot_z_scores(z_score_biased, z_score_plot_fn.format('Biased'))
        z_score_unbiased, unbiased_w2_ave, unbiased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
            1.0 - dca_coverage, bias=False, file_path=z_score_fn.format('Unbiased'),
            w2_ave_sub=protein_scorers[p_id]['unbiased_w2_ave'], processes=10)
        unbiased_z_score_array = np.array(pd.to_numeric(z_score_unbiased['Z-Score'], errors='coerce'))
        protein_df['Max Unbiased Z-Score'] = np.nanmax(unbiased_z_score_array)
        protein_df['Unbiased Z-Score at 10%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
        protein_df['Unbiased Z-Score at 30%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
        protein_df['AUC Unbiased Z-Score'] = unbiased_scw_z_auc
        plot_z_scores(z_score_unbiased, z_score_plot_fn.format('Unbiased'))
        # Record execution times
        protein_df['Init Time'] = None
        protein_df['Import Time'] = None
        protein_df['Dist Tree Time'] = None
        protein_df['Trace Time'] = None
        protein_df['Total Time'] = None
        # Record static data for this protein
        protein_df['Protein'] = p_id
        protein_df['Method'] = 'DCA'
        protein_df['Alignment Size'] = generator.protein_data[p_id]['Filtered_Alignment']
        if dca_method_df is None:
            dca_method_df = protein_df
        else:
            dca_method_df = dca_method_df.append(protein_df)
        print('Successfully computed DCA covariance for: {}'.format(p_id))
        counts['success'] += 1
    except ValueError:
        print('Could not compute DCA covariance for: {} with seq_length: {} and size: {}'.format(
            p_id, curr_aln.seq_length, curr_aln.size))
        counts['value'] += 1
    except AttributeError:
        print('Could not compute DCA covariance for: {} no alignment'.format(p_id))
        counts['attribute'] += 1
print('{}\tSuccesses\n{}\tValue Errors\n{}\tAttribute Errors'.format(counts['success'], counts['value'],
                                                                     counts['attribute']))
if small_comparison_df is None:
    small_comparison_df = dca_method_df
else:
    small_comparison_df = small_comparison_df.append(dca_method_df)

## EVCouplings##
Scoring the the covariation of the proteins using the EVCouplings method standard protocol.

In [None]:
from EVCouplingsWrapper import EVCouplingsWrapper
evc_standard_out_dir = os.path.join(small_set_out_dir, 'EVCouplings_Standard')
if not os.path.isdir(evc_standard_out_dir):
    os.makedirs(evc_standard_out_dir)
evc_standard_method_df = None
counts = {'success':0, 'value': 0, 'attribute':0}
for p_id in generator.protein_data:
    print('Attempting to calculate EV couplings standard protocol covariance for: {}'.format(p_id))
    try:
        protein_out_dir = os.path.join(evc_standard_out_dir, p_id)
        if not os.path.isdir(protein_out_dir):
            os.makedirs(protein_out_dir)
        curr_aln = SeqAlignment(file_name=generator.protein_data[p_id]['Final_FA_Aln'], query_id=p_id,
                                polymer_type='Protein')
        curr_aln.import_alignment()
        curr_evc = EVCouplingsWrapper(alignment=curr_aln, protocol='standard')
        curr_evc.calculate_scores(out_dir=protein_out_dir, cores=10, delete_files=True)
        # Compute statistics for the final scores of the ET-MIp model
        protein_df, _, _ = protein_scorers[p_id]['Scorer_CB'].evaluate_predictor(
            predictor=curr_evc, verbostiy=2, out_dir=protein_dir, dist='CB', biased_w2_ave=None,
            unbiased_w2_ave=None, processes=10, threshold=0.5, pos_size=curr_etmip.scorer.position_size,
            rank_type=curr_etmip.scorer.rank_type, file_prefix='EVC_Standard_Scores_', plots=True)
        # Score Prediction Clustering
        _, evc_standard_coverage  = compute_rank_and_coverage(seq_length=curr_evc.alignment.seq_length, scores=curr_evc.scores, pos_size=2,
            rank_type='max')
        z_score_fn = os.path.join(protein_dir, 'EVC_Standard_Scores_Dist-Any_{}_ZScores.tsv')
        z_score_plot_fn = os.path.join(protein_dir, 'EVC_Standard_Scores_Dist-Any_{}_ZScores.png')
        z_score_biased, biased_w2_ave, biased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
            1.0 - evc_standard_coverage, bias=True, file_path=z_score_fn.format('Biased'),
            w2_ave_sub=protein_scorers[p_id]['biased_w2_ave'], processes=10)
        biased_z_score_array = np.array(pd.to_numeric(z_score_biased['Z-Score'], errors='coerce'))
        protein_df['Max Biased Z-Score'] = np.nanmax(biased_z_score_array)
        protein_df['Biased Z-Score at 10%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
        protein_df['Biased Z-Score at 30%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
        protein_df['AUC Biased Z-Score'] = biased_scw_z_auc
        plot_z_scores(z_score_biased, z_score_plot_fn.format('Biased'))
        z_score_unbiased, unbiased_w2_ave, unbiased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
            1.0 - evc_standard_coverage, bias=False, file_path=z_score_fn.format('Unbiased'),
            w2_ave_sub=protein_scorers[p_id]['unbiased_w2_ave'], processes=10)
        unbiased_z_score_array = np.array(pd.to_numeric(z_score_unbiased['Z-Score'], errors='coerce'))
        protein_df['Max Unbiased Z-Score'] = np.nanmax(unbiased_z_score_array)
        protein_df['Unbiased Z-Score at 10%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
        protein_df['Unbiased Z-Score at 30%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
        protein_df['AUC Unbiased Z-Score'] = unbiased_scw_z_auc
        plot_z_scores(z_score_unbiased, z_score_plot_fn.format('Unbiased'))
        # Record execution times
        protein_df['Init Time'] = None
        protein_df['Import Time'] = None
        protein_df['Dist Tree Time'] = None
        protein_df['Trace Time'] = None
        protein_df['Total Time'] = None
        # Record static data for this protein
        protein_df['Protein'] = p_id
        protein_df['Method'] = 'EVC Standard'
        protein_df['Alignment Size'] = generator.protein_data[p_id]['Filtered_Alignment']
        if evc_standard_method_df is None:
            evc_standard_method_df = protein_df
        else:
            evc_standard_method_df = evc_standrad_method_df.append(protein_df)
        print('Successfully computed EV couplings standard protocol covariance for: {}'.format(p_id))
        counts['success'] += 1
    except ValueError:
        print('Could not compute EV couplings standard protocol covariance for: {} with seq_length: {} and size: {}'.format(
            p_id, curr_aln.seq_length, curr_aln.size))
        counts['value'] += 1
    except AttributeError:
        print('Could not compute EV couplings standard protocol covariance for: {} no alignment'.format(p_id))
        counts['attribute'] += 1
print('{}\tSuccesses\n{}\tValue Errors\n{}\tAttribute Errors'.format(counts['success'], counts['value'],
                                                                     counts['attribute']))
if small_comparison_df is None:
    small_comparison_df = evc_standard_method_df
else:
    small_comparison_df = small_comparison_df.append(evc_standard_method_df)

Scoring the covariation of the proteins using the EVCouplings method mean field protocol.

In [None]:
evc_mf_out_dir = os.path.join(small_set_out_dir, 'EVCouplings_Mean_Field')
if not os.path.isdir(evc_mf_out_dir):
    os.makedirs(evc_mf_out_dir)
evc_mean_field_method_df = None
counts = {'success':0, 'value': 0, 'attribute':0}
for p_id in generator.protein_data:
    print('Attempting to calculate EV couplings covariance for: {}'.format(p_id))
    try:
        protein_out_dir = os.path.join(evc_mf_out_dir, p_id)
        if not os.path.isdir(protein_out_dir):
            os.makedirs(protein_out_dir)
        curr_aln = SeqAlignment(file_name=generator.protein_data[p_id]['Final_FA_Aln'], query_id=p_id,
                                polymer_type='Protein')
        curr_aln.import_alignment()
        curr_evc = EVCouplingsWrapper(alignment=curr_aln, protocol='mean_field')
        curr_evc.calculate_scores(out_dir=protein_out_dir, cores=10, delete_files=True)
        # Compute statistics for the final scores of the ET-MIp model
        protein_df, _, _ = protein_scorers[p_id]['Scorer_CB'].evaluate_predictor(
            predictor=curr_evc, verbostiy=2, out_dir=protein_dir, dist='CB', biased_w2_ave=None,
            unbiased_w2_ave=None, processes=10, threshold=0.5, pos_size=curr_etmip.scorer.position_size,
            rank_type=curr_etmip.scorer.rank_type, file_prefix='EVC_Standard_Scores_', plots=True)
        # Score Prediction Clustering
        _, evc_mf_coverage  = compute_rank_and_coverage(seq_length=curr_evc.alignment.seq_length, scores=curr_evc.scores, pos_size=2,
            rank_type='max')
        z_score_fn = os.path.join(protein_dir, 'EVC_Mean_Field_Scores_Dist-Any_{}_ZScores.tsv')
        z_score_plot_fn = os.path.join(protein_dir, 'EVC_Mean_Field_Scores_Dist-Any_{}_ZScores.png')
        z_score_biased, biased_w2_ave, biased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
            1.0 - evc_mf_coverage, bias=True, file_path=z_score_fn.format('Biased'),
            w2_ave_sub=protein_scorers[p_id]['biased_w2_ave'], processes=10)
        biased_z_score_array = np.array(pd.to_numeric(z_score_biased['Z-Score'], errors='coerce'))
        protein_df['Max Biased Z-Score'] = np.nanmax(biased_z_score_array)
        protein_df['Biased Z-Score at 10%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
        protein_df['Biased Z-Score at 30%'] = biased_z_score_array[z_score_biased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
        protein_df['AUC Biased Z-Score'] = biased_scw_z_auc
        plot_z_scores(z_score_biased, z_score_plot_fn.format('Biased'))
        z_score_unbiased, unbiased_w2_ave, unbiased_scw_z_auc = protein_scorers[p_id]['Scorer_Any'].score_clustering_of_contact_predictions(
            1.0 - evc_mf_coverage, bias=False, file_path=z_score_fn.format('Unbiased'),
            w2_ave_sub=protein_scorers[p_id]['unbiased_w2_ave'], processes=10)
        unbiased_z_score_array = np.array(pd.to_numeric(z_score_unbiased['Z-Score'], errors='coerce'))
        protein_df['Max Unbiased Z-Score'] = np.nanmax(unbiased_z_score_array)
        protein_df['Unbiased Z-Score at 10%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.10][0]
        protein_df['Unbiased Z-Score at 30%'] = unbiased_z_score_array[z_score_unbiased['Num_Residues'] >= float(len(protein_scorers[p_id]['Scorer_Any'].query_structure.seq[protein_scorers[p_id]['Scorer_Any'].best_chain])) * 0.30][0]
        protein_df['AUC Unbiased Z-Score'] = unbiased_scw_z_auc
        plot_z_scores(z_score_unbiased, z_score_plot_fn.format('Unbiased'))
        # Record execution times
        protein_df['Init Time'] = None
        protein_df['Import Time'] = None
        protein_df['Dist Tree Time'] = None
        protein_df['Trace Time'] = None
        protein_df['Total Time'] = None
        # Record static data for this protein
        protein_df['Protein'] = p_id
        protein_df['Method'] = 'EVC Mean Field'
        protein_df['Alignment Size'] = generator.protein_data[p_id]['Filtered_Alignment']
        if evc_mean_field_method_df is None:
            evc_mean_field_method_df = protein_df
        else:
            evc_mean_field_method_df = evc_mean_field_method_df.append(protein_df)
        print('Successfully computed EV couplings covariance for: {}'.format(p_id))
        counts['success'] += 1
    except ValueError:
        print('Could not compute EV couplings covariance for: {} with seq_length: {} and size: {}'.format(
            p_id, curr_aln.seq_length, curr_aln.size))
        counts['value'] += 1
    except AttributeError:
        print('Could not compute EV couplings covariance for: {} no alignment'.format(p_id))
        counts['attribute'] += 1
print('{}\tSuccesses\n{}\tValue Errors\n{}\tAttribute Errors'.format(counts['success'], counts['value'],
                                                                     counts['attribute']))
if small_comparison_df is None:
    small_comparison_df = evc_mean_field_method_df
else:
    small_comparison_df = small_comparison_df.append(evc_mean_field_method_df)

In [None]:
# Write out final comparison data so it can be loaded later for generating figures.
small_comparison_df.to_csv(small_comparison_fn, sep='\t', header=True, index=False, columns=output_columns)