# Method Characteriation #
## Goal ##
The goal of this test set is to perform proof of concept testing on a small number of proteins with a wide range of sizes and available homologs, orthologs, and paralogs. By doing so it should be possible to test the best parameterization for this tool as well as identifying the strengths and weaknesses of the tool using various measurments as end points.
## Warning ##
Before attempting to use this notebook make sure that your .env file has been properly setup to reflect the correct locations of command line tools and the location of files and directories needed for execution.
### Initial Import###
This first cell performs the necessary imports required to begin this notebook.

In [1]:
from dotenv import find_dotenv, load_dotenv
try:
    dotenv_path = find_dotenv(raise_error_if_not_found=True)
except IOError:
    dotenv_path = find_dotenv(raise_error_if_not_found=True, usecwd=True)
load_dotenv(dotenv_path)
import os
import sys
sys.path.append(os.path.join(os.environ.get('PROJECT_PATH'), 'src'))
sys.path.append(os.path.join(os.environ.get('PROJECT_PATH'), 'src', 'SupportingClasses'))
input_dir = os.environ.get('INPUT_PATH')

## Data Set Construction ##
The first task required to test the data set is to download the required data and construct any necessary input files for all down stream analyses.
In this case that means:
* Downloading PDB files for the proteins in our small test set.
* Extracting a query sequence from each PDB file.
* Searching for paralogs, homologs, and orthologs in a custom BLAST database built by filtering the Uniref90 database.
* Filtering the hits from the BLAST search to meet minimum and maximum length requirements, as well as minimum and maximum identity requirements.
* Building alignments using CLUSTALW in both the fasta and msf formats since some of the tools which will be used for comparison need different formats.
* Filtering the alignment for maximum identity similarity between seqeunces.
* Re-aligning the filtered sequences using CLUSTALW.
This is all handeled by the DataSetGenerator class found in the src/SupportingClasses folder

In [2]:
from time import time
from DataSetGenerator import DataSetGenerator
protein_list_dir = os.path.join(input_dir, 'ProteinLists')
if not os.path.isdir(protein_list_dir):
    os.makedirs(protein_list_dir)
small_list_fn = os.path.join(protein_list_dir, 'SmallDataSet.txt')
if not os.path.isfile(small_list_fn):
    proteins_of_interest = ['2ysdA', '1c17A', '3tnuA', '7hvpA', '135lA', '206lA', '2werA', '1bolA', '3q05A', '1axbA',
                            '2rh1A', '1hckA', '3b6vA', '2z0eA', '1jwlA', '1a26A', '1c0kA', '4lliA', '4ycuA', '2iopA',
                            '2zxeA', '2b59B', '1h1vG']
    with open(small_list_fn, 'w') as small_list_handle:
        for p_id in proteins_of_interest:
            small_list_handle.write('{}\n'.format(p_id))
generator = DataSetGenerator(input_dir)
start = time()
summary = generator.build_pdb_alignment_dataset(protein_list_fn=os.path.basename(small_list_fn), num_threads=10,
                                                database='customuniref90.fasta', max_target_seqs=2500, remote=False, verbose=False)
summary['Accession'] = summary['Protein_ID'].apply(lambda x: generator.protein_data[x]['Accession'])
summary['Length'] = summary['Protein_ID'].apply(lambda x: generator.protein_data[x]['Length'])
summary['Total_Size'] = summary.apply(lambda x: float(x['Length']) * float(x['Filtered_Alignment']), axis=1)
summary.sort_values(by=['Total_Size', 'Length', 'Filtered_Alignment'], axis=0, inplace=True)
print(summary[['Protein_ID', 'Accession', 'BLAST_Hits', 'Filtered_BLAST', 'Filtered_Alignment', 'Length', 'Total_Size']])
end = time()
print('It took {} min to generate the data set.'.format((end - start) / 60.0))
summary.to_csv(os.path.join(input_dir, 'small_data_set_summary.tsv'), sep='\t', index=False, header=True,
               columns=['Protein_ID', 'Accession', 'BLAST_Hits', 'Filtered_BLAST', 'Filtered_Alignment', 'Length', 'Total_Size'])

Importing protein list
Downloading structures and parsing in query sequences
BLASTing query sequences
Filtering BLAST hits, aligning, filtering by identity, and re-aligning
   Protein_ID       UniProt  BLAST_Hits  Filtered_BLAST  Filtered_Alignment  \
21       2b59    CIPA_CLOTM        1606               4                   4   
5        206l      LYS_BPT4        1039             131                  92   
16       1c0k    OXDA_RHOTO        2500              51                  51   
7        1bol    RNRH_RHINI        2500             131                 127   
3        7hvp     POL_HV1A2        2500              33                  33   
1        1c17    ATPL_ECOLI        1671             850                 813   
14       1jwl    LACI_ECOLI        2500             228                 226   
8        3q05     P53_HUMAN         932             306                 210   
4        135l    LYSC_MELGA        1913             853                 818   
13       2z0e   ATG4B_HUMAN        21

Create a location to store the output of this parameter tuning.

In [3]:
output_dir = os.environ.get('OUTPUT_PATH')
characterization_out_dir = os.path.join(output_dir, 'Characterization')
if not os.path.isdir(characterization_out_dir):
    os.makedirs(characterization_out_dir)

## Method Characterization##
This section performs the Evoluationary Trace method for covariation ('pair's of residues) with different parameters to help determine which provide the optimal behavior (in this case measured by ability to predict structural contacts: AUROC, Precision, and clustering: Biased SCW Z-Score, Unbiased SCW Z-Score).
### Distance Metric Parameters###
* identity', False - Uses the identity metric to compute the distance between the sequences in the provided alignment.
* 'blosum62', True - Uses the similarity metric (as defined by the off diagonal values from the 'blosum62' distance matrix) to compute the distance between the sequences in the provided alignment.
* 'blosum62', False - Uses the 'blosum62' scoring matrix to compute the edit distance between sequences in the provided alignment.
### Tree Construction Parameters###
* 'et' - A phylogenetic tree related to the UPGMA tree, but where the distance update does not use the average of columns from the current step, but rather the average distance between all contributing terminal nodes. This has been used in previous methods by our group.
* 'upgma' - A phylogenetic tree constructed using the standard UPGMA algorithm.
* 'agglomerative' (affinity='euclidean', linkage='ward') - A tree constructed based on agglomerative/hierarchical clustering over the distance matrix using the specified affinity and linkage.
### Scoring Parameters###
* 'identity' - A binary scoring of invariance at each level and node in the tree, which yields an integer score for each pair of positions (lower score means the pair position was fixed higher in the tree and therefore more important, higher score means the pair of positions became fixed lower down in the score and is therefore less important).
* 'plain_entropy' - The joint entropy between a pair of positions. This provides a real valued (floating point score) for each pair of positions, which can be interpreted the same way as the 'identity' scoring metric but with greater resolution/ability to separate positions.
* 'mutual_information' - The mutual informaiton score for each pair of positions within a node of the phylogenetic tree. This is built up over levels of the tree using the trace methodology, in this case the higher the score the better.
* 'normalized_mutual_information' - The normalized mutual informaiton score for each pair of positions within a node of the phylogenetic tree. This is built up over levels of the tree using the trace methodology, in this case the higher the score the better.
* 'average_product_corrected_mutual_information' - The average product corrected mutual informaiton (MIp) score for each pair of positions within a node of the phylogenetic tree. This is built up over levels of the tree using the trace methodology, in this case the higher the score the better (some low scores my be negative).
* 'filtered_average_product_corrected_mutual_information' - The average product corrected mutual informaiton (MIp) score for each pair of positions within a node of the phylogenetic tree. This is built up over levels of the tree using the trace methodology, in this case the higher the score the better (some low scores my be negative). This score is additionally filtered at the node level so that positions with a mutual information <= 0.0001 are set to 0 (this was done in a previously released paper by our lab on this topic).

In [6]:
%matplotlib inline
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from EvolutionaryTrace import EvolutionaryTrace
from SupportingClasses.PDBReference import PDBReference
from SupportingClasses.ContactScorer import ContactScorer
characterization_fn = os.path.join(characterization_out_dir, 'Characterization_Data.csv')
if os.path.isfile(characterization_fn):
    characterization_df = pd.read_csv(characterization_fn, sep='\t', header=0, index_col=False)
else:
    characterization_df = None
    for p_id in summary['Protein_ID']:
        protein_df = None
        contact_scorer = None
        biased_w2_ave = None
        unbiased_w2_ave = None
        print('Characterizing protein: {}'.format(p_id))
        protein_dir = os.path.join(characterization_out_dir, p_id)
        if not os.path.isdir(protein_dir):
            os.mkdir(protein_dir)
        for dist_model, et_dist in [('identity', False), ('blosum62', True), ('blosum62', False)]:
            print('Distance model: {}, ET dist: {}'.format(dist_model, et_dist))
            for tree_building, tree_options in [('et', {}), ('upgma', {}),
                                                ('agglomerative', {'affinity': 'euclidean', 'linkage': 'ward'})]:
                print('Tree construction: {}'.format(tree_building))
                dist_tree_dir = os.path.join(
                    protein_dir, '{}{}{}{}'.format(dist_model, ('_ET' if et_dist else ''), tree_building,
                    ('_'.join(['{}_{}'.format(k, v) for k,v in tree_options.items()]) if tree_options else '')))
                if not os.path.isdir(dist_tree_dir):
                    os.mkdir(dist_tree_dir)
                for scoring_metric in ['identity', 'plain_entropy', 'mutual_information', 'normalized_mutual_information',
                                       'average_product_corrected_mutual_information',
                                       'filtered_average_product_corrected_mutual_information']:
                    print('Scoring metric: {}'.format(scoring_metric))
                    curr_et = EvolutionaryTrace(query_id=p_id, polymer_type='Protein',
                                                aln_fn=generator.protein_data[p_id]['Final_FA_Aln'], et_distance=et_dist,
                                                distance_model=dist_model, tree_building_method=tree_building,
                                                tree_building_options=tree_options, ranks=None, position_type='pair',
                                                scoring_metric=scoring_metric, gap_correction=None, out_dir=protein_dir,
                                                output_files={'original_aln', 'non_gap_aln', 'tree', 'scores'},
                                                processors=10, low_memory=True)
                    curr_et.import_and_process_aln()
                    curr_et.out_dir = dist_tree_dir
                    curr_et.compute_distance_matrix_tree_and_assignments()
                    curr_et.perform_trace()
                    if contact_scorer is None:
                        pdb_structure = PDBReference(pdb_file=generator.protein_data[p_id]['PDB'])
                        pdb_structure.import_pdb(structure_id=p_id)
                        print(type(curr_et.non_gapped_aln))
                        contact_scorer = ContactScorer(query=p_id, seq_alignment=curr_et.non_gapped_aln,
                                                       pdb_reference=pdb_structure, cutoff=8.0)
                        contact_scorer.best_chain = generator.protein_data[p_id]['Chain']
                        contact_scorer.fit()
                        contact_scorer.measure_distance(method='Any')
                    curr_df, biased_dict, unbiased_dict = contact_scorer.evaluate_predictor(predictor=curr_et, verbosity=1, out_dir=dist_tree_dir, dist='Any', biased_w2_ave=biased_w2_ave,
                                                                                            unbiased_w2_ave=unbiased_w2_ave, processes=10, threshold=0.5, pos_size=curr_et.scorer.position_size,
                                                                                            rank_type=curr_et.scorer.rank_type, file_prefix='{}_Scores_'.format(scoring_metric))
                    curr_df['Distance Model'] = '{}{}'.format(dist_model, '_similarity' if et_dist else '')
                    curr_df['Tree Type'] = tree_building
                    curr_df['Scoring Metric'] = scoring_metric
                    curr_df['Method'] = '{}{}_{}_{}'.format(dist_model, '_similarity' if et_dist else '', tree_building, scoring_metric)
                    if protein_df is None:
                        protein_df = curr_df
                    else:
                        protein_df = protein_df.append(curr_df)
        protein_df['Protein'] = p_id
        distance_model_order = ['identity', 'blosum62_similarity', 'blosum62']
        tree_type_order = ['et', 'upgma', 'agglomerative']
        scoring_metric_order = ['identity', 'plain_entropy', 'mutual_information', 'normalized_mutual_information', 'average_product_corrected_mutual_information', 'filtered_average_product_corrected_mutual_information']
        sequence_separation_order = ['Any', 'Neighbors', 'Short', 'Medium', 'Long']
        #########################################################################################################################################################################################################################
        # Plot AUROC Data by Distance model
        fn1a = os.path.join(protein_dir, 'AUROC_Distance_Comparison.png')
        if not os.path.isfile(fn1a):
            auroc = sns.catplot(data=protein_df, x='Distance Model', y='AUROC', legend=True, legend_out=True, sharex=True, sharey=False)
            auroc.set(ylim=(0, 1)).set_xticklabels(rotation=30).set_titles("Distance Model Comparison")
            plt.show()
            auroc.savefig(fn1a, dpi=600, bbox_inches='tight', transparent=True)
            plt.close()
        # Plot AUROC Data by Distance model add hue (sequence separation)
        fn2a = os.path.join(protein_dir, 'AUROC_Distance_Comparison_Hue.png')
        if not os.path.isfile(fn2a):
            auroc = sns.catplot(data=protein_df, x='Distance Model', y='AUROC', hue='Sequence_Separation', hue_order=sequence_separation_order, legend=True, legend_out=True, sharex=True, sharey=False)
            auroc.set(ylim=(0, 1)).set_xticklabels(rotation=30).set_titles("Distance Model Comparison")
            auroc.savefig(fn2a, dpi=600, bbox_inches='tight', transparent=True)
            plt.close()
        # Plot AUROC Data by Distance model add col (tree type)
        fn3a = os.path.join(protein_dir, 'AUROC_Distance_Comparison_Col.png')
        if not os.path.isfile(fn3a):
            auroc = sns.catplot(data=protein_df, x='Distance Model', y='AUROC', col='Tree Type', col_order=tree_type_order,
                                hue='Sequence_Separation', hue_order=sequence_separation_order, legend=True, legend_out=True, sharex=True, sharey=False)
            auroc.set(ylim=(0, 1)).set_xticklabels(rotation=30).set_titles("{col_var}:{col_name}")
            auroc.savefig(fn3a, dpi=600, bbox_inches='tight', transparent=True)
            plt.close()
        # Plot AUROC Data by Distance model add row (scoring metric)
        fn4a = os.path.join(protein_dir, 'AUROC_Distance_Comparison_Row.png')
        if not os.path.isfile(fn4a):
            auroc = sns.catplot(data=protein_df, x='Distance Model', y='AUROC', row='Scoring Metric', row_order=scoring_metric_order, col='Tree Type', col_order=tree_type_order,
                                hue='Sequence_Separation', hue_order=sequence_separation_order, legend=True, legend_out=True, sharex=True, sharey=False)
            auroc.set(ylim=(0, 1)).set_xticklabels(rotation=30).set_titles("{col_name}:{row_name}")
            auroc.savefig(fn4a, dpi=600, bbox_inches='tight', transparent=True)
            plt.close()
        #########################################################################################################################################################################################################################
        # Plot AUROC Data by Tree type
        fn1b = os.path.join(protein_dir, 'AUROC_Tree_Comparison.png')
        if not os.path.isfile(fn1b):
            auroc = sns.catplot(data=protein_df, x='Tree Type', y='AUROC', legend=True, legend_out=True, sharex=True, sharey=False)
            auroc.set(ylim=(0, 1)).set_xticklabels(rotation=30).set_titles("Tree Type Comparison")
            plt.show()
            auroc.savefig(fn1b, dpi=600, bbox_inches='tight', transparent=True)
            plt.close()
        # Plot AUROC Data by Tree type add hue (sequence separation)
        fn2b = os.path.join(protein_dir, 'AUROC_Tree_Comparison_Hue.png')
        if not os.path.isfile(fn2b):
            auroc = sns.catplot(data=protein_df, x='Tree Type', y='AUROC', hue='Sequence_Separation', hue_order=sequence_separation_order, legend=True, legend_out=True, sharex=True, sharey=False)
            auroc.set(ylim=(0, 1)).set_xticklabels(rotation=30).set_titles("Tree Type Comparison")
            auroc.savefig(fn2b, dpi=600, bbox_inches='tight', transparent=True)
            plt.close()
        # Plot AUROC Data by Tree type add col (distance model)
        fn3b = os.path.join(protein_dir, 'AUROC_Tree_Comparison_Col.png')
        if not os.path.isfile(fn3b):
            auroc = sns.catplot(data=protein_df, x='Tree Type', y='AUROC', col='Distance Model', col_order=distance_model_order,
                                hue='Sequence_Separation', hue_order=sequence_separation_order, legend=True, legend_out=True, sharex=True, sharey=False)
            auroc.set(ylim=(0, 1)).set_xticklabels(rotation=30).set_titles("{col_var}:{col_name}")
            auroc.savefig(fn3b, dpi=600, bbox_inches='tight', transparent=True)
            plt.close()
        # Plot AUROC Data by Tree type add row (scoring metric)
        fn4b = os.path.join(protein_dir, 'AUROC_Tree_Comparison_Row.png')
        if not os.path.isfile(fn4b):
            auroc = sns.catplot(data=protein_df, x='Tree Type', y='AUROC', row='Scoring Metric', row_order=scoring_metric_order, col='Distance Model', col_order=distance_model_order,
                                hue='Sequence_Separation', hue_order=sequence_separation_order, legend=True, legend_out=True, sharex=True, sharey=False)
            auroc.set(ylim=(0, 1)).set_xticklabels(rotation=30).set_titles("{col_name}:{row_name}")
            auroc.savefig(fn4b, dpi=600, bbox_inches='tight', transparent=True)
            plt.close()
        #########################################################################################################################################################################################################################
        # Plot AUROC Data by Scoring Metric
        fn1c = os.path.join(protein_dir, 'AUROC_Metric_Comparison.png')
        if not os.path.isfile(fn1c):
            auroc = sns.catplot(data=protein_df, x='Scoring Metric', y='AUROC', legend=True, legend_out=True, sharex=True, sharey=False)
            auroc.set(ylim=(0, 1)).set_xticklabels(rotation=30).set_titles("Scoring Metric Comparison")
            plt.show()
            auroc.savefig(fn1c, dpi=600, bbox_inches='tight', transparent=True)
            plt.close()
        # Plot AUROC Data by Scoring Metric add hue (sequence separation)
        fn2c = os.path.join(protein_dir, 'AUROC_Metric_Comparison_Hue.png')
        if not os.path.isfile(fn2c):
            auroc = sns.catplot(data=protein_df, x='Scoring Metric', y='AUROC', hue='Sequence_Separation', hue_order=sequence_separation_order, legend=True, legend_out=True, sharex=True, sharey=False)
            auroc.set(ylim=(0, 1)).set_xticklabels(rotation=30).set_titles("Scoring Metric Comparison")
            auroc.savefig(fn2c, dpi=600, bbox_inches='tight', transparent=True)
            plt.close()
        # Plot AUROC Data by Scoring Metric add col (distance model)
        fn3c = os.path.join(protein_dir, 'AUROC_Metric_Comparison_Col.png')
        if not os.path.isfile(fn3c):
            auroc = sns.catplot(data=protein_df, x='Scoring Metric', y='AUROC', col='Distance Model', col_order=distance_model_order,
                                hue='Sequence_Separation', hue_order=sequence_separation_order, legend=True, legend_out=True, sharex=True, sharey=False)
            auroc.set(ylim=(0, 1)).set_xticklabels(rotation=30).set_titles("{col_var}:{col_name}")
            auroc.savefig(fn3c, dpi=600, bbox_inches='tight', transparent=True)
            plt.close()
        # Plot AUROC Data by Scoring Metric add row (tree type)
        fn4c = os.path.join(protein_dir, 'AUROC_Metric_Comparison_Row.png')
        if not os.path.isfile(fn4c):
            auroc = sns.catplot(data=protein_df, x='Scoring Metric', y='AUROC', row='Tree Type', row_order=tree_type_order, col='Distance Model', col_order=distance_model_order,
                                hue='Sequence_Separation', hue_order=sequence_separation_order, legend=True, legend_out=True, sharex=True, sharey=False)
            auroc.set(ylim=(0, 1)).set_xticklabels(rotation=30).set_titles("{col_name}:{row_name}")
            auroc.savefig(fn4c, dpi=600, bbox_inches='tight', transparent=True)
            plt.close()
        #########################################################################################################################################################################################################################
        if characterization_df is None:
            characterization_df = protein_df
        else:
            characterization_df = characterization_df.append(protein_df)
    characterization_df.to_csv(os.path.join(characterization_out_dir, 'Characterization_Data.csv'), sep='\t', header=True, index=False,
                               columns=['Protein', 'Distance Model', 'Tree Type', 'Scoring Metric', 'Method', 'Distance', 'Sequence_Separation', 'AUROC'])

Characterizing protein: 2b59
Distance model: identity, ET dist: False
Tree construction: et
Scoring metric: identity
Evolutionary Trace analysis with the same parameters already saved to this location.
Importing the PDB file took 0.0017962058385213215 min
<class 'SupportingClasses.SeqAlignment.SeqAlignment'>
Removing gaps took 0.0005907098452250163 min




Importing the PDB file took 0.0013850212097167968 min
Mapping query sequence and pdb took 0.0027697404225667317 min
Computing the distance matrix based on the PDB file took 0.0039414207140604654 min
Scoring metric: plain_entropy
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: normalized_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: average_product_corrected_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: filtered_average_product_corrected_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Tree construction: upgma
Scoring metric: identity
Evolutionary Trace analysis with the same parameters already saved to this location.



Mapping query sequence and pdb took 0.0021351774533589682 min
Computing the distance matrix based on the PDB file took 0.0015633900960286458 min
Scoring metric: plain_entropy
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: normalized_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: average_product_corrected_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: filtered_average_product_corrected_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Tree construction: upgma
Scoring metric: identity
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: plain_entropy
Evolutionary Trace anal



Removing gaps took 0.002110143502553304 min
Importing the PDB file took 0.0015636205673217774 min
Mapping query sequence and pdb took 0.00416949192682902 min




Computing the distance matrix based on the PDB file took 0.014155419667561848 min
Scoring metric: plain_entropy
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: normalized_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: average_product_corrected_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: filtered_average_product_corrected_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Tree construction: upgma
Scoring metric: identity
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: plain_entropy
Evolutionary Trace analysis with the same parameters already saved to this location.
S



Importing the PDB file took 0.003706196943918864 min
<class 'SupportingClasses.SeqAlignment.SeqAlignment'>
Removing gaps took 0.00602957010269165 min
Importing the PDB file took 0.0019708553949991862 min
Mapping query sequence and pdb took 0.008848496278127034 min
Computing the distance matrix based on the PDB file took 0.007801071802775065 min




Scoring metric: plain_entropy
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: normalized_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: average_product_corrected_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: filtered_average_product_corrected_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Tree construction: upgma
Scoring metric: identity
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: plain_entropy
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: mutual_information
Evolutionary Trace analysis with the same param



Removing gaps took 0.010118826230367025 min
Importing the PDB file took 0.0007925192515055338 min
Mapping query sequence and pdb took 0.01166535218556722 min




Computing the distance matrix based on the PDB file took 0.011488986015319825 min
Scoring metric: plain_entropy
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: normalized_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: average_product_corrected_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: filtered_average_product_corrected_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Tree construction: upgma
Scoring metric: identity
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: plain_entropy
Evolutionary Trace analysis with the same parameters already saved to this location.
S



Removing gaps took 0.028707818190256754 min
Importing the PDB file took 0.0010918021202087403 min
Mapping query sequence and pdb took 0.03124358654022217 min




Computing the distance matrix based on the PDB file took 0.01274248758951823 min
Scoring metric: plain_entropy
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: normalized_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: average_product_corrected_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: filtered_average_product_corrected_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Tree construction: upgma
Scoring metric: identity
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: plain_entropy
Evolutionary Trace analysis with the same parameters already saved to this location.
Sc



Removing gaps took 0.03830460707346598 min
Importing the PDB file took 0.0012879610061645509 min
Mapping query sequence and pdb took 0.04090494712193807 min




Computing the distance matrix based on the PDB file took 0.01266096035639445 min
Scoring metric: plain_entropy
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: normalized_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: average_product_corrected_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: filtered_average_product_corrected_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Tree construction: upgma
Scoring metric: identity
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: plain_entropy
Evolutionary Trace analysis with the same parameters already saved to this location.
Sc



Removing gaps took 0.0422515074412028 min
Importing the PDB file took 0.001342904567718506 min




Mapping query sequence and pdb took 0.04684976736704508 min
Computing the distance matrix based on the PDB file took 0.021047155062357586 min
Scoring metric: plain_entropy
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: normalized_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: average_product_corrected_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: filtered_average_product_corrected_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Tree construction: upgma
Scoring metric: identity
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: plain_entropy
Evolutionary Trace analysi



Removing gaps took 0.06559314330418904 min
Importing the PDB file took 0.0005948146184285482 min
Mapping query sequence and pdb took 0.06802994807561238 min
Computing the distance matrix based on the PDB file took 0.001495679219563802 min
Scoring metric: plain_entropy




Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: normalized_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: average_product_corrected_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: filtered_average_product_corrected_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Tree construction: upgma
Scoring metric: identity
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: plain_entropy
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: mutual_information
Evolutionary Trace analysis with the same parameters already saved to this loc



Importing the PDB file took 0.0041049599647521974 min
<class 'SupportingClasses.SeqAlignment.SeqAlignment'>
Removing gaps took 0.07236009438832601 min
Importing the PDB file took 0.0021209200223286945 min
Mapping query sequence and pdb took 0.07675887743631998 min




Computing the distance matrix based on the PDB file took 0.11188989082972209 min
Scoring metric: plain_entropy
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: normalized_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: average_product_corrected_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: filtered_average_product_corrected_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Tree construction: upgma
Scoring metric: identity
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: plain_entropy
Evolutionary Trace analysis with the same parameters already saved to this location.
Sc



Removing gaps took 0.08637921412785848 min
Importing the PDB file took 0.0008933305740356446 min
Mapping query sequence and pdb took 0.08946357170740764 min




Computing the distance matrix based on the PDB file took 0.005767917633056641 min
Scoring metric: plain_entropy
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: normalized_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: average_product_corrected_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: filtered_average_product_corrected_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Tree construction: upgma
Scoring metric: identity
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: plain_entropy
Evolutionary Trace analysis with the same parameters already saved to this location.
S



Importing the PDB file took 0.005696499347686767 min
<class 'SupportingClasses.SeqAlignment.SeqAlignment'>
Removing gaps took 0.13378031651178995 min
Importing the PDB file took 0.003286643822987874 min




Mapping query sequence and pdb took 0.13995036284128826 min
Computing the distance matrix based on the PDB file took 0.04038008451461792 min
Scoring metric: plain_entropy
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: normalized_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: average_product_corrected_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: filtered_average_product_corrected_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Tree construction: upgma
Scoring metric: identity
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: plain_entropy
Evolutionary Trace analysis



Importing the PDB file took 0.0062994877497355144 min
<class 'SupportingClasses.SeqAlignment.SeqAlignment'>
Removing gaps took 0.09878825743993123 min




Importing the PDB file took 0.004011897246042887 min
Mapping query sequence and pdb took 0.10498187144597372 min
Computing the distance matrix based on the PDB file took 0.045100049177805586 min
Scoring metric: plain_entropy
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: normalized_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: average_product_corrected_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Scoring metric: filtered_average_product_corrected_mutual_information
Evolutionary Trace analysis with the same parameters already saved to this location.
Tree construction: upgma
Scoring metric: identity
Evolutionary Trace analysis with the same parameters already saved to this location.
Sco

In [100]:
import numpy as np
protein_order = summary['Protein_ID']
# Plot AUROC Data by Scoring Metric
auroc = sns.catplot(data=characterization_df, x='Scoring Metric', y='AUROC', legend=True, legend_out=True, sharex=True, sharey=False)
auroc.set(ylim=(0, 1)).set_xticklabels(rotation=90).set_titles("Scoring Metric Comparison")
plt.show()
auroc.savefig(os.path.join(characterization_out_dir, 'AUROC_Metric_Comparison.png'), dpi=600, bbox_inches='tight', transparent=True)
plt.close()
# Plot AUROC Data by Scoring Metric add hue (sequence separation)
auroc = sns.catplot(data=characterization_df, x='Scoring Metric', y='AUROC', hue='Sequence_Separation', hue_order=sequence_separation_order, legend=True, legend_out=True, sharex=True, sharey=False)
auroc.set(ylim=(0, 1)).set_xticklabels(rotation=90).set_titles("Scoring Metric Comparison")
auroc.savefig(os.path.join(characterization_out_dir, 'AUROC_Metric_Comparison_Hue.png'), dpi=600, bbox_inches='tight', transparent=True)
plt.close()
# Plot AUROC Data by Scoring Metric add col (distance model)
auroc = sns.catplot(data=characterization_df, x='Scoring Metric', y='AUROC', col='Distance Model', col_order=distance_model_order,
                    hue='Sequence_Separation', hue_order=sequence_separation_order, legend=True, legend_out=True, sharex=True, sharey=False)
auroc.set(ylim=(0, 1)).set_xticklabels(rotation=90).set_titles("{col_var}:{col_name}")
auroc.savefig(os.path.join(characterization_out_dir, 'AUROC_Metric_Comparison_Col.png'), dpi=600, bbox_inches='tight', transparent=True)
plt.close()
# Plot AUROC Data by Scoring Metric add row (tree type)
auroc = sns.catplot(data=characterization_df, x='Scoring Metric', y='AUROC', row='Tree Type', row_order=tree_type_order, col='Distance Model', col_order=distance_model_order,
                    hue='Sequence_Separation', hue_order=sequence_separation_order, legend=True, legend_out=True, sharex=True, sharey=False)
auroc.set(ylim=(0, 1)).set_xticklabels(rotation=90).set_titles("{col_name}:{row_name}")
auroc.savefig(os.path.join(characterization_out_dir, 'AUROC_Metric_Comparison_Row.png'), dpi=600, bbox_inches='tight', transparent=True)
plt.close()
for separation in sequence_separation_order:
    sub_df = characterization_df[characterization_df['Sequence_Separation'] == separation]
    # Plot AUROC Data by Scoring Metric
    auroc = sns.catplot(data=sub_df, x='Scoring Metric', y='AUROC', legend=True, legend_out=True, sharex=True, sharey=False)
    auroc.set(ylim=(0, 1)).set_xticklabels(rotation=90).set_titles("Scoring Metric Comparison")
    plt.show()
    auroc.savefig(os.path.join(characterization_out_dir, '{}_AUROC_Metric_Comparison.png'.format(separation)), dpi=600, bbox_inches='tight', transparent=True)
    plt.close()
    # Plot AUROC Data by Scoring metric summarizing by mean/median
    auroc = sns.catplot(data=sub_df[['Scoring Metric', 'AUROC']], x='Scoring Metric', y='AUROC', legend=True, legend_out=True, sharex=True, sharey=False, estimator=np.mean)
    auroc.set(ylim=(0, 1)).set_xticklabels(rotation=90).set_titles("Scoring Metric Comparison")
    plt.show()
    auroc.savefig(os.path.join(characterization_out_dir, '{}_AUROC_Metric_Comparison_Mean.png'.format(separation)), dpi=600, bbox_inches='tight', transparent=True)
    plt.close()
    auroc = sns.catplot(data=sub_df[['Scoring Metric', 'AUROC']], x='Scoring Metric', y='AUROC', legend=True, legend_out=True, sharex=True, sharey=False, estimator=np.median)
    auroc.set(ylim=(0, 1)).set_xticklabels(rotation=90).set_titles("Scoring Metric Comparison")
    plt.show()
    auroc.savefig(os.path.join(characterization_out_dir, '{}_AUROC_Metric_Comparison_Median.png'.format(separation)), dpi=600, bbox_inches='tight', transparent=True)
    plt.close()
    # Plot AUROC Data by Scoring Metric add hue (sequence separation)
    auroc = sns.catplot(data=sub_df, x='Scoring Metric', y='AUROC', hue='Protein', hue_order=protein_order, legend=True, legend_out=True, sharex=True, sharey=False)
    auroc.set(ylim=(0, 1)).set_xticklabels(rotation=90).set_titles("Scoring Metric Comparison")
    auroc.savefig(os.path.join(characterization_out_dir, '{}_AUROC_Metric_Comparison_Hue.png'.format(separation)), dpi=600, bbox_inches='tight', transparent=True)
    plt.close()
    # Plot AUROC Data by Scoring Metric add hue (sequence separation) summarizing by mean/median
    auroc = sns.catplot(data=sub_df[['Scoring Metric', 'AUROC', 'Protein']], x='Scoring Metric', y='AUROC', hue='Protein', hue_order=protein_order, legend=True, legend_out=True, sharex=True, sharey=False, estimator=np.mean)
    auroc.set(ylim=(0, 1)).set_xticklabels(rotation=90).set_titles("Scoring Metric Comparison")
    auroc.savefig(os.path.join(characterization_out_dir, '{}_AUROC_Metric_Comparison_Hue_Mean.png'.format(separation)), dpi=600, bbox_inches='tight', transparent=True)
    plt.close()
    auroc = sns.catplot(data=sub_df[['Scoring Metric', 'AUROC', 'Protein']], x='Scoring Metric', y='AUROC', hue='Protein', hue_order=protein_order, legend=True, legend_out=True, sharex=True, sharey=False, estimator=np.median)
    auroc.set(ylim=(0, 1)).set_xticklabels(rotation=90).set_titles("Scoring Metric Comparison")
    auroc.savefig(os.path.join(characterization_out_dir, '{}_AUROC_Metric_Comparison_Hue_Median.png'.format(separation)), dpi=600, bbox_inches='tight', transparent=True)
    plt.close()
    # Plot AUROC Data by Scoring Metric add col (distance model)
    auroc = sns.catplot(data=sub_df, x='Scoring Metric', y='AUROC', col='Distance Model', col_order=distance_model_order,
                        hue='Protein', hue_order=protein_order, legend=True, legend_out=True, sharex=True, sharey=False)
    auroc.set(ylim=(0, 1)).set_xticklabels(rotation=90).set_titles("{col_var}:{col_name}")
    auroc.savefig(os.path.join(characterization_out_dir, '{}_AUROC_Metric_Comparison_Col.png'.format(separation)), dpi=600, bbox_inches='tight', transparent=True)
    plt.close()
    # Plot AUROC Data by Scoring Metric add col (distance model) summarizing by mean/median
    auroc = sns.catplot(data=sub_df[['Scoring Metric', 'AUROC', 'Protein', 'Distance Model']], x='Scoring Metric', y='AUROC', col='Distance Model', col_order=distance_model_order,
                        hue='Protein', hue_order=protein_order, legend=True, legend_out=True, sharex=True, sharey=False, estimator=np.mean)
    auroc.set(ylim=(0, 1)).set_xticklabels(rotation=90).set_titles("{col_var}:{col_name}")
    auroc.savefig(os.path.join(characterization_out_dir, '{}_AUROC_Metric_Comparison_Col_Mean.png'.format(separation)), dpi=600, bbox_inches='tight', transparent=True)
    plt.close()
    auroc = sns.catplot(data=sub_df[['Scoring Metric', 'AUROC', 'Protein', 'Distance Model']], x='Scoring Metric', y='AUROC', col='Distance Model', col_order=distance_model_order,
                        hue='Protein', hue_order=protein_order, legend=True, legend_out=True, sharex=True, sharey=False, estimator=np.median)
    auroc.set(ylim=(0, 1)).set_xticklabels(rotation=90).set_titles("{col_var}:{col_name}")
    auroc.savefig(os.path.join(characterization_out_dir, '{}_AUROC_Metric_Comparison_Col_Median.png'.format(separation)), dpi=600, bbox_inches='tight', transparent=True)
    plt.close()
    # Plot AUROC Data by Scoring Metric add row (tree type)
    auroc = sns.catplot(data=sub_df, x='Scoring Metric', y='AUROC', row='Tree Type', row_order=tree_type_order, col='Distance Model', col_order=distance_model_order,
                        hue='Protein', hue_order=protein_order, legend=True, legend_out=True, sharex=True, sharey=False)
    auroc.set(ylim=(0, 1)).set_xticklabels(rotation=90).set_titles("{col_name}:{row_name}")
    auroc.savefig(os.path.join(characterization_out_dir, '{}_AUROC_Metric_Comparison_Row.png'.format(separation)), dpi=600, bbox_inches='tight', transparent=True)
    plt.close()

  


In [98]:
import pandas as pd
from scipy.stats import wilcoxon
characterization_df['AUROC_Rank'] = characterization_df.groupby(['Protein', 'Sequence_Separation'])['AUROC'].rank(method='max')
# method_order = ['identity_upgma_identity', 'identity_upgma_plain_entropy', 'identity_upgma_mutual_information', 'identity_upgma_normalized_mutual_information',
#                 'identity_upgma_average_product_corrected_mutual_information', 'identity_upgma_filtered_average_product_corrected_mutual_information', 'identity_et_identity', 'identity_et_plain_entropy',
#                 'identity_et_mutual_information', 'identity_et_normalized_mutual_information', 'identity_et_average_product_corrected_mutual_information',
#                 'identity_et_filtered_average_product_corrected_mutual_information', 'identity_agglomerative_identity', 'identity_agglomerative_plain_entropy', 'identity_agglomerative_mutual_information',
#                 'identity_agglomerative_normalized_mutual_information', 'identity_agglomerative_average_product_corrected_mutual_information',
#                 'identity_agglomerative_filtered_average_product_corrected_mutual_information', 'blosum62_similarity_upgma_identity', 'blosum62_similarity_upgma_plain_entropy',
#                 'blosum62_similarity_upgma_mutual_information', 'blosum62_similarity_upgma_normalized_mutual_information', 'blosum62_similarity_upgma_average_product_corrected_mutual_information',
#                 'blosum62_similarity_upgma_filtered_average_product_corrected_mutual_information', 'blosum62_similarity_et_identity', 'blosum62_similarity_et_plain_entropy', 'blosum62_similarity_et_mutual_information',
#                 'blosum62_similarity_et_normalized_mutual_information', 'blosum62_similarity_et_average_product_corrected_mutual_information',
#                 'blosum62_similarity_et_filtered_average_product_corrected_mutual_information', 'blosum62_similarity_agglomerative_identity', 'blosum62_similarity_agglomerative_plain_entropy',
#                 'blosum62_similarity_agglomerative_mutual_information', 'blosum62_similarity_agglomerative_normalized_mutual_information', 'blosum62_similarity_agglomerative_average_product_corrected_mutual_information',
#                 'blosum62_similarity_agglomerative_filtered_average_product_corrected_mutual_information', 'blosum62_upgma_identity', 'blosum62_upgma_plain_entropy', 'blosum62_upgma_mutual_information',
#                 'blosum62_upgma_normalized_mutual_information', 'blosum62_upgma_average_product_corrected_mutual_information', 'blosum62_upgma_filtered_average_product_corrected_mutual_information',
#                 'blosum62_et_identity', 'blosum62_et_plain_entropy', 'blosum62_et_mutual_information', 'blosum62_et_normalized_mutual_information', 'blosum62_et_average_product_corrected_mutual_information',
#                 'blosum62_et_filtered_average_product_corrected_mutual_information', 'blosum62_agglomerative_identity', 'blosum62_agglomerative_plain_entropy', 'blosum62_agglomerative_mutual_information',
#                 'blosum62_agglomerative_normalized_mutual_information', 'blosum62_agglomerative_average_product_corrected_mutual_information', 'blosum62_agglomerative_filtered_average_product_corrected_mutual_information']
method_order = ['identity_upgma_identity', 'identity_et_identity', 'identity_agglomerative_identity', 'blosum62_similarity_upgma_identity', 'blosum62_similarity_et_identity', 'blosum62_similarity_agglomerative_identity',
                'blosum62_upgma_identity', 'blosum62_et_identity', 'blosum62_agglomerative_identity', 'identity_upgma_plain_entropy', 'identity_et_plain_entropy', 'identity_agglomerative_plain_entropy',
                'blosum62_similarity_upgma_plain_entropy', 'blosum62_similarity_et_plain_entropy', 'blosum62_similarity_agglomerative_plain_entropy', 'blosum62_upgma_plain_entropy', 'blosum62_et_plain_entropy',
                'blosum62_agglomerative_plain_entropy', 'identity_upgma_mutual_information', 'identity_et_mutual_information', 'identity_agglomerative_mutual_information', 'blosum62_similarity_upgma_mutual_information',
                'blosum62_similarity_et_mutual_information', 'blosum62_similarity_agglomerative_mutual_information', 'blosum62_upgma_mutual_information', 'blosum62_et_mutual_information',
                'blosum62_agglomerative_mutual_information', 'identity_upgma_normalized_mutual_information', 'identity_et_normalized_mutual_information', 'identity_agglomerative_normalized_mutual_information',
                'blosum62_similarity_upgma_normalized_mutual_information', 'blosum62_similarity_et_normalized_mutual_information', 'blosum62_similarity_agglomerative_normalized_mutual_information',
                'blosum62_upgma_normalized_mutual_information', 'blosum62_et_normalized_mutual_information', 'blosum62_agglomerative_normalized_mutual_information',
                'identity_upgma_average_product_corrected_mutual_information', 'identity_et_average_product_corrected_mutual_information', 'identity_agglomerative_average_product_corrected_mutual_information',
                'blosum62_similarity_upgma_average_product_corrected_mutual_information', 'blosum62_similarity_et_average_product_corrected_mutual_information',
                'blosum62_similarity_agglomerative_average_product_corrected_mutual_information', 'blosum62_upgma_average_product_corrected_mutual_information', 'blosum62_et_average_product_corrected_mutual_information',
                'blosum62_agglomerative_average_product_corrected_mutual_information', 'identity_upgma_filtered_average_product_corrected_mutual_information',
                'identity_et_filtered_average_product_corrected_mutual_information', 'identity_agglomerative_filtered_average_product_corrected_mutual_information',
                'blosum62_similarity_upgma_filtered_average_product_corrected_mutual_information', 'blosum62_similarity_et_filtered_average_product_corrected_mutual_information',
                'blosum62_similarity_agglomerative_filtered_average_product_corrected_mutual_information', 'blosum62_upgma_filtered_average_product_corrected_mutual_information',
                'blosum62_et_filtered_average_product_corrected_mutual_information', 'blosum62_agglomerative_filtered_average_product_corrected_mutual_information']
dims = (11.25, 6.0)
_, ax = plt.subplots(figsize=dims)
g = sns.boxplot(data=characterization_df, x='AUROC', y='Method', hue='Sequence_Separation', order=method_order, hue_order=sequence_separation_order, width=1.6, orient='h', ax=ax)
# g.set_xticklabels(g.get_xticklabels(),rotation=90)
g.legend(loc='center right', bbox_to_anchor=(1.5, 0.5), ncol=1)
plt.savefig(os.path.join(characterization_out_dir, 'AUROC_Method_Comparison.png'), dpi=600, bbox_inches='tight', transparent=True)
plt.close()
_, ax = plt.subplots(figsize=dims)
g = sns.boxplot(data=characterization_df, x='AUROC_Rank', y='Method', hue='Sequence_Separation', order=method_order, hue_order=sequence_separation_order, width=1.6, orient='h', ax=ax)
# g.set_xticklabels(g.get_xticklabels(),rotation=90)
g.legend(loc='center right', bbox_to_anchor=(1.5, 0.5), ncol=1)
plt.savefig(os.path.join(characterization_out_dir, 'AUROC_Rank_Method_Comparison.png'), dpi=600, bbox_inches='tight', transparent=True)
plt.close()
for separation in sequence_separation_order:
    sub_df = characterization_df[characterization_df['Sequence_Separation'] == separation]
    _, ax = plt.subplots(figsize=dims)
    g = sns.boxplot(data=sub_df, x='AUROC', y='Method', order=method_order, orient='h', ax=ax, color='black')
    # g.set_xticklabels(g.get_xticklabels(),rotation=90)
    plt.savefig(os.path.join(characterization_out_dir, '{}_AUROC_Method_Comparison.png'.format(separation)), dpi=600, bbox_inches='tight', transparent=True)
    plt.close()
    auroc_stats = {'Method1': [], 'Method2': [], 'Statistic': [], 'P-Value': []}
    _, ax = plt.subplots(figsize=dims)
    g = sns.boxplot(data=sub_df, x='AUROC_Rank', y='Method', order=method_order, orient='h', ax=ax, color='black')
    # g.set_xticklabels(g.get_xticklabels(),rotation=90)
    plt.savefig(os.path.join(characterization_out_dir, '{}_AUROC_Rank_Method_Comparison.png'.format(separation)), dpi=600, bbox_inches='tight', transparent=True)
    plt.close()
    
    heatmap_df1 = sub_df[['Method', 'Protein', 'AUROC']].pivot(index='Protein', columns='Method', values='AUROC')
    sns.heatmap(data=heatmap_df1, vmin=0.0, vmax=1.0, cmap='seismic', center=0.5, cbar=True, square=True)
    plt.savefig(os.path.join(characterization_out_dir, '{}_AUROC_Method_Heatmap.png'.format(separation)), dpi=600, bbox_inches='tight', transparent=True)
    plt.close()
    
    heatmap_df2 = sub_df[['Method', 'Protein', 'AUROC_Rank']].pivot(index='Protein', columns='Method', values='AUROC_Rank')
    sns.heatmap(data=heatmap_df2, vmin=0.0, vmax=np.max(sub_df['AUROC_Rank']), cmap='seismic', center=np.mean(sub_df['AUROC_Rank']), cbar=True, square=True)
    plt.savefig(os.path.join(characterization_out_dir, '{}_AUROC_Rank_Method_Heatmap.png'.format(separation)), dpi=600, bbox_inches='tight', transparent=True)
    plt.close()
    
    auroc_stats = {'Method1': [], 'Method2': [], 'AUROC_Statistic': [], 'AUROC_P-Value': [], 'Rank_Statistic': [], 'Rank_P-Value': []}
    for i in range(len(method_order)):
        for j in range(i + 1, len(method_order)):
            auroc_stats['Method1'].append(method_order[i])
            auroc_stats['Method2'].append(method_order[j])
            aurocs1 = sub_df[sub_df['Method'] == method_order[i]]['AUROC'].values
            aurocs2 = sub_df[sub_df['Method'] == method_order[j]]['AUROC'].values
            stat, pval = wilcoxon(aurocs1, aurocs2)
            auroc_stats['AUROC_Statistic'].append(stat)
            auroc_stats['AUROC_P-Value'].append(pval)
            ranks1 = sub_df[sub_df['Method'] == method_order[i]]['AUROC_Rank'].values
            ranks2 = sub_df[sub_df['Method'] == method_order[j]]['AUROC_Rank'].values
            stat2, pval2 = wilcoxon(aurocs1, aurocs2)
            auroc_stats['Rank_Statistic'].append(stat2)
            auroc_stats['Rank_P-Value'].append(pval2)
    pd.DataFrame(auroc_stats).to_csv(os.path.join(characterization_out_dir, '{}_AUROC_Method_Statistics.csv'.format(separation)), sep='\t', index=False, header=True, columns=['Method1', 'Method2', 'AUROC_Statistic', 'AUROC_P-Value', 'RANK_Statistic', 'Rank_P-Value'])

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)
  r_plus = np.sum((d > 0) * r, axis=0)
  r_minus = np.sum((d < 0) * r, axis=0)
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)
  r_plus = np.sum((d > 0) * r,

In [89]:
characterization_df['AUROC_Max'] = characterization_df['AUROC_Rank'] == max(characterization_df['AUROC_Rank'])
print(sum(characterization_df['AUROC_Max']))
# Compare which distance models achieved the highest scores
best_aurocs = characterization_df.groupby(['Distance Model', 'Sequence_Separation'])['AUROC_Max'].sum().reset_index()
for separation in sequence_separation_order:
    sub_df = best_aurocs[best_aurocs['Sequence_Separation'] == separation]
    print(sub_df)
    _, ax = plt.subplots(figsize=dims)
    g = sns.barplot(data=sub_df, x='AUROC_Max', y='Distance Model', order=distance_model_order, orient='h', ax=ax, color='black')
    plt.savefig(os.path.join(characterization_out_dir, '{}_Best_AUROC_Distance_Model_Comparison.png'.format(separation)), dpi=600, bbox_inches='tight', transparent=True)
    plt.close()
# Compare which tree construction method achieved the highest scores
best_aurocs = characterization_df.groupby(['Tree Type', 'Sequence_Separation'])['AUROC_Max'].sum().reset_index()
for separation in sequence_separation_order:
    print(sub_df)
    sub_df = best_aurocs[best_aurocs['Sequence_Separation'] == separation]
    _, ax = plt.subplots(figsize=dims)
    g = sns.barplot(data=sub_df, x='AUROC_Max', y='Tree Type', order=tree_type_order, orient='h', ax=ax, color='black')
    plt.savefig(os.path.join(characterization_out_dir, '{}_Best_AUROC_Tree_Type_Comparison.png'.format(separation)), dpi=600, bbox_inches='tight', transparent=True)
    plt.close()
# Compare which scoring metric achieved the highest scores
best_aurocs = characterization_df.groupby(['Scoring Metric', 'Sequence_Separation'])['AUROC_Max'].sum().reset_index()
for separation in sequence_separation_order:
    print(sub_df)
    sub_df = best_aurocs[best_aurocs['Sequence_Separation'] == separation]
    _, ax = plt.subplots(figsize=dims)
    g = sns.barplot(data=sub_df, x='AUROC_Max', y='Scoring Metric', order=scoring_metric_order, orient='h', ax=ax, color='black')
    plt.savefig(os.path.join(characterization_out_dir, '{}_Best_AUROC_Scoring_Metric_Comparison.png'.format(separation)), dpi=600, bbox_inches='tight', transparent=True)
    plt.close()
# Compare which overall method achieved the highest scores
best_aurocs = characterization_df.groupby(['Method', 'Sequence_Separation'])['AUROC_Max'].sum().reset_index()
for separation in sequence_separation_order:
    print(sub_df)
    sub_df = best_aurocs[best_aurocs['Sequence_Separation'] == separation]
    _, ax = plt.subplots(figsize=dims)
    g = sns.barplot(data=sub_df, x='AUROC_Max', y='Method', order=method_order, orient='h', ax=ax, color='black')
    plt.savefig(os.path.join(characterization_out_dir, '{}_Best_AUROC_Method_Comparison.png'.format(separation)), dpi=600, bbox_inches='tight', transparent=True)
    plt.close()

156
         Distance Model Sequence_Separation  AUROC_Max
0              blosum62                 Any       14.0
5   blosum62_similarity                 Any        9.0
10             identity                 Any        8.0
         Distance Model Sequence_Separation  AUROC_Max
3              blosum62           Neighbors       16.0
8   blosum62_similarity           Neighbors        7.0
13             identity           Neighbors        9.0
         Distance Model Sequence_Separation  AUROC_Max
4              blosum62               Short       10.0
9   blosum62_similarity               Short       12.0
14             identity               Short       11.0
         Distance Model Sequence_Separation  AUROC_Max
2              blosum62              Medium       10.0
7   blosum62_similarity              Medium       11.0
12             identity              Medium        9.0
         Distance Model Sequence_Separation  AUROC_Max
1              blosum62                Long       12.0
6   bl