# Build a model of a protein family starting from a single sequence


1. Retrieve homologous (similar) sequences
2. Build a MSA from retrieved sequences
3. Caclulate a PSSM or HMM from the MSA
4. Evaluate the model against a ground truth dataset


### Implementation
For each step some alternative implementations that can have an effect on the model accuracy

**Step 1**
- Blast against a database (SwissProt, UniProt, RefSeq, NR, ReferenceProteomes, ...)

**Step 2**
- Build the MSA from Blast alignments 
- Build MSA realigning Blast hits (ClustalO, T-Coffee, Muscle, ...)
- Edit the MSA to remove poorly conserved columns
- Remove sequence redundancy, similar sequences, to improve model generalization
- Split the MSA, e.g. when there are multiple domains

**Step 3**
- Build a HMM using the HMMER command line 
- Build a PSSM using the Blast command line
- Skip this step and use the MSA in step 4 with HMMSEARCH

**Step 4**
- Define a ground truth dataset (Pfam, CATH, keywords, manual curation, ...)
- Calculate precision, recall, MCC, balanced accuracy





In [2]:
from Bio import SeqIO
from Bio.Blast import NCBIXML
import pandas as pd
from pathlib import Path

# Generate a MSA

Align P06621 with the [NCBI Blast webservice](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome)

Use to alternative databases
-SwissProt
-RefSeq selected

Set "Max target sequences" to 5000
When finish click on "Download" and select XML format

In [6]:

# Parse the XML output
data = []

# WARNING: if the code below crashes, remove "CREATE VIEW" lines manually from the XML file

# blast_input = 'data/P06621_sprot.xml'  # SwissProt database
blast_input = 'data/P06621_refseq_selected.xml'  # "RefSeq selected" database

with open(blast_input) as f:
    blast_records = NCBIXML.parse(f)

    # Iterate Psiblast rounds
    for blast_record in blast_records:
        
        # Iterate query alignments
        query_id = blast_record.query
        for i, alignment in enumerate(blast_record.alignments):
            subject_id = alignment.title
            
            for hsp in alignment.hsps:
                data.append((query_id,
                                subject_id,
                                blast_record.query_length,
                                hsp.query,
                                hsp.match,
                                hsp.sbjct,
                                hsp.query_start,
                                hsp.query_end,
                                hsp.sbjct_start,
                                hsp.sbjct_end,
                                hsp.identities,
                                hsp.positives,
                                hsp.gaps,
                                hsp.expect,
                                hsp.score))
                
df = pd.DataFrame(data, columns=["query_id", "subject_id", "query_len",
                              "query_seq", "match_seq", "subject_seq",
                              "query_start", "query_end", 
                              "subject_start", "subject_end", "identity", "positive", "gaps", "eval", "bit_score"])
df

Unnamed: 0,query_id,subject_id,query_len,query_seq,match_seq,subject_seq,query_start,query_end,subject_start,subject_end,identity,positive,gaps,eval,bit_score
0,sp|P06621|CBPG_PSES6 Carboxypeptidase G2 OS=Ps...,gi|1545301071|ref|WP_126472623.1| M20/M25/M40 ...,415,MRPSIHRTAIAAVLATAFVAGTALAQKRDNVLFQAATDEQPAVIKT...,MRPSIHRTAIAAVLATAF+AGTALAQKRDNVLFQAATDEQPAVIKT...,MRPSIHRTAIAAVLATAFMAGTALAQKRDNVLFQAATDEQPAVIKT...,1,415,1,415,410,414,0,0.000000e+00,2129.0
1,sp|P06621|CBPG_PSES6 Carboxypeptidase G2 OS=Ps...,gi|1543118946|ref|WP_126023768.1| M20/M25/M40 ...,415,MRPSIHRTAIAAVLATAFVAGTA-LAQKRDNVLFQAATDEQPAVIK...,MRPSIHRTA+AA+LA AFVA A AQKRDNVLFQAATDEQPAVIK...,MRPSIHRTALAALLAAAFVAPAATWAQKRDNVLFQAATDEQPAVIK...,1,415,1,416,407,411,1,0.000000e+00,2097.0
2,sp|P06621|CBPG_PSES6 Carboxypeptidase G2 OS=Ps...,gi|1797811451|ref|WP_159279459.1| M20/M25/M40 ...,415,MRPSIHRTAIAAVLATAFVAGTALAQKRDNVLFQAATDEQPAVIKT...,MRPSIHRTA+AAVLATAFVAG A AQKRDNVLFQAATDEQPAVIKT...,MRPSIHRTAVAAVLATAFVAGGAWAQKRDNVLFQAATDEQPAVIKT...,1,415,1,415,393,408,0,0.000000e+00,1992.0
3,sp|P06621|CBPG_PSES6 Carboxypeptidase G2 OS=Ps...,gi|1480890214|ref|WP_119552493.1| M20/M25/M40 ...,415,MRPSIHRTAIAAVLATAFVAGTALAQKRDNVLFQAATDEQPAVIKT...,MRPSIHRTA+AA+LATAF+A A AQKRDNVLFQAATDEQPAVIKT...,MRPSIHRTALAALLATAFLAPAAWAQKRDNVLFQAATDEQPAVIKT...,1,415,1,415,389,405,0,0.000000e+00,1978.0
4,sp|P06621|CBPG_PSES6 Carboxypeptidase G2 OS=Ps...,gi|1011471139|ref|WP_062367304.1| M20/M25/M40 ...,415,MRPSIHRT-AIAAVLATAFVAGTALAQKRDNVLFQAATDEQPAVIK...,MRP RT A + AQKRDNVLFQAATD QPAV+K...,MRPFNQRTMLAALLATALLAPAAGWAQKRDNVLFQAATDAQPAVLK...,1,415,1,416,374,393,1,0.000000e+00,1938.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4996,sp|P06621|CBPG_PSES6 Carboxypeptidase G2 OS=Ps...,gi|1510926788|ref|WP_122926114.1| ArgE/DapE fa...,415,EQPAVIKTLEKLVNIETGT----GDAEGIAAAGNFLEAELKNLGFT...,E+ +I ++ L+ I++ D G +L+A+L +G ...,ERDELIGLVQDLIRIDSVNPYLDADGPGEREMAAYLQAKLVEMGLE...,39,261,6,223,70,117,19,2.135720e-12,190.0
4997,sp|P06621|CBPG_PSES6 Carboxypeptidase G2 OS=Ps...,gi|1460625799|ref|WP_116468663.1| hydrolase [S...,415,LVVGDNIVGKIKGRGGKNLLLMSHMDTVY-LKGILAKAPFRVEGDK...,L G+++ ++ +L HMDTV+ + +R +G ...,LAHGEHLHLVVRPEAPIQMLFTGHMDTVFGADHVFQHGQWRTDG-T...,89,411,78,396,93,150,18,2.144680e-12,190.0
4998,sp|P06621|CBPG_PSES6 Carboxypeptidase G2 OS=Ps...,gi|1718634206|ref|WP_145030192.1| M20/M25/M40 ...,415,LFQAATDEQPAVIKTLEKLVNIETGTGDAEGIAAAGNFLEAELKNL...,+ +A T E A I + +L++I +G+ +G+A +FL +LK ...,MARAKTSENKAAIDLVLELLSISGKSGEEKGVA---DFLVKKLKGA...,32,237,1,210,73,106,24,2.197150e-12,190.0
4999,sp|P06621|CBPG_PSES6 Carboxypeptidase G2 OS=Ps...,gi|1393435437|ref|WP_109798908.1| M20/M25/M40 ...,415,IAAVLATAFVAGTALAQKRDN----VLFQAATDEQPAVIKTLEKLV...,IA+ LA + A A+A+ ++ + EQ +K LE LV...,IASALALSCSAVQAMAKGPESGPEARMIATIDAEQTRTLKFLETLV...,10,339,7,361,96,151,27,2.322320e-12,190.0


In [7]:
# Extract the sequence of subject proteins (hits) aligned with the query to build the MSA

input_path = Path(blast_input)
msa = []

with open("{}/{}_blast_msa.fasta".format(input_path.parent, input_path.stem), "w") as fout:
    for index, row in df.iterrows():
        mapped_seq = ["-"] * blast_record.query_length  # Empty list of length = query_length
        c = 0
        if row["eval"] < 0.001:
        #     print(row)
            for l_q, l_s in zip(row['query_seq'], row['subject_seq']):
                if l_q != " " and l_q != '-':
                    mapped_seq[row["query_start"] + c -1] = l_s if l_s != " " else "-"
                    c += 1
            fout.write(">{}\n{}\n".format(row["subject_id"], "".join(mapped_seq)))



# Generate a better MSA

Generate a new, possibly better, MSA starting from the full sequences of hits retrieved by Blast using the Clustal Omega method

1. Extract the protein IDs from previous files

```
    # SwissProt IDs
    awk -F"|" '/>/ {print $2}' data/P06621_sprot_blast_msa.fasta > tmp
    
    # RefSeq IDs (ex. WP_094475116.1)
    awk -F"|" '/>/ {print $4}' data/P06621_refseq_selected_blast_msa.fasta > tmp2
   
``` 

2. Download the sequences from UniProt using the **Retrieve/ID mapping** web tool
    * For SwissProt IDs select "From --> UniProtKB, To UniProtKB. Download into **P06621_sprot_blast_mapped.fasta** 
    * For RefSeq selected "From --> RefSeq Protein, To UniProtKB. Select "View UniParc results" to increase the number of retrieved sequences. Download into **P06621_refseq_selected_blast_mapped.fasta**


3. Use [ClustalO](https://www.ebi.ac.uk/Tools/msa/clustalo/) to generate new MSAs. Select the *Fasta* output format and save into
    * **P06621_sprot_blast_clustal.fasta**
    * **P06621_refseq_selected_blast_clustal.fasta**


4. Compare the Blast MSA and the ClustalO MSA in JalView

```
    java -jar jalview-all-2.11.1.4-j1.8.jar
```

# Build PSSMs

To build PSSM from a MSA file it is necesary to use the command line version of Blast

#### Install Blast (~250 Mb)
```
wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.12.0+-x64-linux.tar.gz
tar -xzf ncbi-blast-2.12.0+-x64-linux.tar.gz

```

#### Create a PSSM from a Fasta MSA 

Before running the command remove *gi|1774282670|ref|WP_153585536.1|* from the **data/P06621_refseq_selected_blast_msa.fasta** file, for some reasons it breaks the PSSM build

The content of the file in the -subject option is irrelevant, just use a valid fasta file
```
ncbi-blast-2.12.0+/bin/psiblast -subject data/P06621.fasta -in_msa data/P06621_sprot_blast_msa.fasta -out_pssm data/P06621_sprot_blast_msa.pssm -out_ascii_pssm data/P06621_sprot_blast_msa.pssm_ascii

ncbi-blast-2.12.0+/bin/psiblast -subject data/P06621.fasta -in_msa data/P06621_refseq_selected_blast_msa.fasta -out_pssm data/P06621_refseq_selected_blast_msa.pssm -out_ascii_pssm data/P06621_refseq_selected_blast_msa.pssm_ascii
```

# Build HMM

Install HMMER (~18 Mb)
```
wget http://eddylab.org/software/hmmer/hmmer.tar.gz
tar -xzf hmmer.tar.gz
./configure
make
```

Generate HMMs from MSAs
```
hmmer-3.3.2/src/hmmbuild data/P06621_sprot_blast_msa.hmm data/P06621_sprot_blast_msa.fasta

hmmer-3.3.2/src/hmmbuild data/P06621_refseq_selected_blast_msa.hmm data/P06621_refseq_selected_blast_msa.fasta
```

# Build HMM and PSSM from ClustalO MSAs

Generate PSSMs
```
ncbi-blast-2.12.0+/bin/psiblast -subject data/P06621.fasta -in_msa data/P06621_sprot_blast_clustal.fasta -out_pssm data/P06621_sprot_blast_clustal.pssm -out_ascii_pssm data/P06621_sprot_blast_clustal.pssm_ascii

ncbi-blast-2.12.0+/bin/psiblast -subject data/P06621.fasta -in_msa data/P06621_refseq_selected_blast_clustal.fasta -out_pssm data/P06621_refseq_selected_blast_clustal.pssm -out_ascii_pssm data/P06621_refseq_selected_blast_clustal.pssm_ascii
```

Generate HMMs
```
hmmer-3.3.2/src/hmmbuild data/P06621_sprot_blast_clustal.hmm data/P06621_sprot_blast_clustal.fasta

hmmer-3.3.2/src/hmmbuild data/P06621_refseq_selected_blast_clustal.hmm data/P06621_refseq_selected_blast_clustal.fasta
```

# Evaluation

Compare performance of your models in comparison with Pfam **Peptidase_M20.hmm**. 

1. Evaluate the number of hits retrieved by your models and those retrived by the Pfam domain aginst the SwissProt database.

    * **PSSMs** You can use the web service. When you select Psi-Blast it is possible to provide a PSSM file as input

    * **HMMs** You can use HMMSEARCH from HMMER web site


2. Evaluate the overlap between the hits retrieved by your models and the hits retrived by the Pfam model

    * You need to download sequence IDs of the matched sequences 


3. Evaluate False Positive matches (if any) and decide if they are good family candidates missed by Pfam but not by your model

    * You need to look at alternative information to judge. Statistical parameters, curated annotations, etc...
    


In [3]:
ground_truth = pd.read_csv('data/Peptidase_M20_sprot_hmmsearch.tsv', sep='\t')
ground_truth


Unnamed: 0,Target Name,Target Accession,Target Length,Query Name,Query Accession,Query Length,E-value,Score,Bias,Domain Index,...,Query Ali. Start,Query Ali. End,Target Ali. Start,Target Ali. End,Target Env. Start,Target Env. End,Acc,Description,Mapped PDB(s),Number of Identical Sequences
0,DAPE_CHESB,DAPE_CHESB,395,Peptidase_M20,-,207,1.200000e-47,167.2,0.0,1,...,1,207,70,392,70,392,0.95,Succinyl-diaminopimelate desuccinylase,,0
1,DAPE_RHILO,DAPE_RHILO,397,Peptidase_M20,-,207,8.400000e-47,164.4,0.1,1,...,1,206,70,391,70,392,0.95,Succinyl-diaminopimelate desuccinylase,,0
2,DAPE_BRUA1,DAPE_BRUA1,395,Peptidase_M20,-,207,1.000000e-46,164.1,0.1,1,...,1,206,70,391,70,392,0.94,Succinyl-diaminopimelate desuccinylase,,0
3,DAPE_BRUA2,DAPE_BRUA2,395,Peptidase_M20,-,207,1.000000e-46,164.1,0.1,1,...,1,206,70,391,70,392,0.94,Succinyl-diaminopimelate desuccinylase,,0
4,DAPE_BRUAB,DAPE_BRUAB,395,Peptidase_M20,-,207,1.000000e-46,164.1,0.1,1,...,1,206,70,391,70,392,0.94,Succinyl-diaminopimelate desuccinylase,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1017,PFF1_PYRTT,PFF1_PYRTT,957,Peptidase_M20,-,207,8.700000e-03,20.7,0.0,1,...,5,83,152,210,129,351,0.81,Vacuolar membrane protease,,0
1018,BSAP_BACSU,BSAP_BACSU,455,Peptidase_M20,-,207,9.500000e-03,20.5,0.1,1,...,3,106,248,330,246,433,0.61,Aminopeptidase YwaD,6hc7_B,0
1019,PFF1_KOMPG,PFF1_KOMPG,990,Peptidase_M20,-,207,9.900000e-03,20.5,0.0,1,...,33,105,237,306,225,739,0.76,Vacuolar membrane protease,,0
1020,LAP4_ARTOC,LAP4_ARTOC,372,Peptidase_M20,-,207,9.900000e-03,20.5,0.0,1,...,2,97,172,259,171,365,0.76,Probable leucine aminopeptidase MCYG_03459,,0


In [5]:
# hmmsearch from the Blast MSA (HMM)
hmmsearch_blast_msa = pd.read_csv('data/P06621_refseq_selected_blast_msa_hmmsearch.tsv', sep='\t')
hmmsearch_blast_msa

Unnamed: 0,Target Name,Target Accession,Target Length,Query Name,Query Accession,Query Length,E-value,Score,Bias,Domain Index,...,Query Ali. Start,Query Ali. End,Target Ali. Start,Target Ali. End,Target Env. Start,Target Env. End,Acc,Description,Mapped PDB(s),Number of Identical Sequences
0,CBPG_PSES6,CBPG_PSES6,415,Query,-,368,6.300000e-144,484.3,1.6,1,...,2,368,44,411,43,411,1.00,Carboxypeptidase G2,7m6u_D,0
1,YQJE_BACSU,YQJE_BACSU,371,Query,-,368,4.700000e-68,234.6,9.2,1,...,2,366,8,367,7,369,0.91,Uncharacterized protein YqjE,,0
2,DAPE2_SHELP,DAPE2_SHELP,377,Query,-,368,8.000000e-46,161.5,0.0,1,...,9,366,8,373,2,375,0.87,Succinyl-diaminopimelate desuccinylase 2,,0
3,DAPE_DECAR,DAPE_DECAR,377,Query,-,368,9.100000e-46,161.3,0.0,1,...,2,367,6,373,5,374,0.87,Succinyl-diaminopimelate desuccinylase,,0
4,DAPE_ALKEH,DAPE_ALKEH,375,Query,-,368,1.500000e-45,160.6,0.0,1,...,11,364,14,369,4,373,0.83,Succinyl-diaminopimelate desuccinylase,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1046,LAP1_SCLS1,LAP1_SCLS1,387,Query,-,368,7.500000e-03,20.1,0.1,1,...,35,149,154,260,133,282,0.79,Leucine aminopeptidase 1,,0
1047,PFF1_PARBD,PFF1_PARBD,992,Query,-,368,8.100000e-03,20.0,0.0,1,...,90,165,177,253,161,262,0.87,Vacuolar membrane protease,,0
1048,PFF1_PARBP,PFF1_PARBP,992,Query,-,368,8.300000e-03,19.9,0.0,1,...,90,165,177,253,161,262,0.87,Vacuolar membrane protease,,0
1049,LAP1_PODAN,LAP1_PODAN,395,Query,-,368,8.300000e-03,19.9,0.1,1,...,45,149,163,261,138,274,0.77,Leucine aminopeptidase 1,,0


In [19]:
# PSI-BLAST search with the Blast MSA (PSSM, increase 20k limit)
# Download --> Hit Table (CSV)
psiblast_blast_pssm = pd.read_csv('data/P06621_refseq_selected_blast_msa_psiblast.csv', header=None)

# Map ids with the UniProt id mapping service 
# Alternatively download a mapping file from here (huge file) 
# https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/proteomics_mapping/

psiblast_blast_pssm[psiblast_blast_pssm[10] < 0.01][1].to_csv(
    'data/P06621_refseq_selected_blast_msa_psiblast.ids', index=False, header=False)

In [22]:
psiblast_blast_pssm = pd.read_csv('data/P06621_refseq_selected_blast_msa_psiblast_mapped.tsv', sep='\t')
psiblast_blast_pssm

Unnamed: 0,yourlist:M202111216320BA52A5CE8FCD097CB85A53697A35307ECBV,Entry,Entry name,Status,Protein names,Gene names,Organism,Length
0,P06621,P06621,CBPG_PSES6,reviewed,Carboxypeptidase G2 (CPDG2) (EC 3.4.17.11) (Fo...,cpg2,Pseudomonas sp. (strain RS-16),415
1,Q5FPX5,Q5FPX5,DAPE_GLUOX,reviewed,Succinyl-diaminopimelate desuccinylase (SDAP d...,dapE GOX1832,Gluconobacter oxydans (strain 621H) (Gluconoba...,381
2,A9ILD7,A9ILD7,DAPE_BART1,reviewed,Succinyl-diaminopimelate desuccinylase (SDAP d...,dapE BT_0051,Bartonella tribocorum (strain CIP 105476 / IBS...,390
3,A5ESQ3,A5ESQ3,DAPE_BRASB,reviewed,Succinyl-diaminopimelate desuccinylase (SDAP d...,dapE BBta_7333,Bradyrhizobium sp. (strain BTAi1 / ATCC BAA-1182),384
4,Q59284,Q59284,DAPE_CORGL,reviewed,Succinyl-diaminopimelate desuccinylase (SDAP d...,dapE Cgl1109 cg1260,Corynebacterium glutamicum (strain ATCC 13032 ...,369
...,...,...,...,...,...,...,...,...
169,Q0SWT9,Q0SWT9,PEPT_CLOPS,reviewed,Peptidase T (EC 3.4.11.4) (Aminotripeptidase) ...,pepT CPR_0029,Clostridium perfringens (strain SM101 / Type A),406
170,A1JL17,A1JL17,DAPE_YERE8,reviewed,Succinyl-diaminopimelate desuccinylase (SDAP d...,dapE YE1147,Yersinia enterocolitica serotype O:8 / biotype...,375
171,O50173,O50173,IAAH_ENTAG,reviewed,Indole-3-acetyl-aspartic acid hydrolase (EC 3....,iaaH,Enterobacter agglomerans (Erwinia herbicola) (...,436
172,Q4QP83,Q4QP83,DAPE_HAEI8,reviewed,Succinyl-diaminopimelate desuccinylase (SDAP d...,dapE NTHI0182,Haemophilus influenzae (strain 86-028NP),377


In [27]:
gt = set(ground_truth[ground_truth['E-value']<0.01]['Target Accession'])
hmm = set(hmmsearch_blast_msa[hmmsearch_blast_msa['E-value']<0.01]['Target Accession'])
pssm = set(psiblast_blast_pssm['Entry name'])

for name, data in zip(['hmm', 'pssm'], [hmm, pssm]):
    print(name, 'FN', len(gt - data))
    print(name, 'TP', len(gt.intersection(data)))
    print(name, 'FP', len(data - gt))

hmm FN 49
hmm TP 970
hmm FP 5
pssm FN 846
pssm TP 173
pssm FP 1


# Improve your model

Manually edit the MSA in order to achieve better precision and/or recall

- Identify conserved positions (ex. evaluate column entropy)
- Remove redundancy (ex. use JalView or CD-HIT)


#### Shannon entropy

![Entropy](figures/entropy.png)
scipy.stats.entropy [documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.entropy.html)

In [7]:
from Bio import AlignIO
import scipy.stats
from collections import Counter
import numpy as np

In [8]:

print(scipy.stats.entropy([1/20 for i in range(20)], base=20))


aa = ["A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"]

seqs = []  # [[...], ...]
# with open("data/P06621_refseq_selected_blast_msa.fasta") as f:
with open("data/P06621_refseq_selected_blast_clustal.fasta") as f:
    for record in AlignIO.read(f, "fasta"):
        seqs.append(list(record.seq))  # store sequence as a list of characters


seqs = np.array(seqs, dtype="str")
for i, column in enumerate(seqs.T):
    
    # count AA in column
    count = Counter(column)
    count.pop('-')
    count_sorted = sorted(count.items(), key=lambda x:x[1], reverse=True)
    
    # count non gap AA
    non_gap = np.count_nonzero(column != "-")
    
    occupancy = non_gap / column.size
    
    # AA probability in column (gap excluded)
#     probabilities = [count.get(k, 0.0) / column.size for k in aa]  
    probabilities = [count.get(k, 0.0) / non_gap for k in aa]  

    # Zero entropy = complete conservation
    entropy = scipy.stats.entropy(probabilities, base=20)  

    print("{} {:>5.2f} {:>5.2f} {}".format(i, 
                                         occupancy, 
                                         entropy,
                                         count_sorted))

0.9999999999999999
0  0.00  0.00 [('M', 2)]
1  0.00  0.23 [('S', 1), ('T', 1)]
2  0.00  0.21 [('D', 2), ('M', 1)]
3  0.00  0.00 [('T', 1)]
4  0.00  0.21 [('P', 2), ('S', 1)]
5  0.00  0.37 [('H', 1), ('D', 1), ('S', 1)]
6  0.01  0.20 [('M', 17), ('E', 1), ('T', 1), ('P', 1)]
7  0.01  0.50 [('S', 7), ('N', 6), ('T', 4), ('K', 1), ('H', 1), ('Q', 1)]
8  0.00  0.00 [('D', 1)]
9  0.00  0.00 [('P', 1)]
10  0.01  0.32 [('D', 16), ('M', 2), ('S', 2), ('A', 1), ('E', 1)]
11  0.01  0.58 [('T', 9), ('P', 4), ('Q', 2), ('I', 2), ('A', 2), ('S', 1), ('G', 1), ('E', 1)]
12  0.01  0.71 [('S', 5), ('T', 5), ('M', 3), ('Q', 3), ('H', 3), ('V', 2), ('F', 1), ('A', 1), ('E', 1), ('R', 1)]
13  0.01  0.52 [('P', 13), ('M', 10), ('Q', 5), ('S', 4), ('E', 1), ('R', 1), ('A', 1)]
14  0.01  0.69 [('E', 8), ('S', 6), ('Q', 5), ('K', 4), ('T', 4), ('D', 2), ('V', 2), ('P', 1), ('A', 1), ('N', 1)]
15  0.01  0.65 [('R', 13), ('G', 7), ('A', 3), ('T', 3), ('P', 2), ('K', 2), ('Q', 1), ('H', 1), ('I', 1), ('D', 1), 

191  0.01  0.12 [('S', 29), ('G', 4)]
192  0.01  0.51 [('P', 11), ('E', 10), ('D', 8), ('A', 3), ('R', 2), ('V', 1)]
193  0.01  0.00 [('R', 35)]
194  0.01  0.00 [('V', 35)]
195  0.01  0.23 [('R', 29), ('L', 2), ('G', 2), ('T', 1), ('A', 1)]
196  0.01  0.52 [('D', 11), ('S', 10), ('A', 7), ('R', 3), ('E', 3), ('G', 1)]
197  0.01  0.14 [('L', 31), ('I', 3), ('V', 1)]
198  0.01  0.04 [('V', 34), ('I', 1)]
199  0.01  0.30 [('A', 25), ('D', 6), ('S', 2), ('E', 1), ('T', 1)]
200  0.01  0.17 [('D', 30), ('H', 3), ('A', 2)]
201  0.01  0.00 [('L', 35)]
202  0.01  0.42 [('Q', 15), ('R', 13), ('H', 4), ('S', 2), ('G', 1)]
203  0.01  0.33 [('S', 23), ('A', 8), ('G', 2), ('P', 1), ('N', 1)]
204  0.01  0.12 [('W', 32), ('Q', 2), ('G', 1)]
205  0.01  0.04 [('G', 34), ('T', 1)]
206  0.01  0.04 [('P', 34), ('G', 1)]
207  0.01  0.00 [('A', 35)]
208  0.01  0.50 [('V', 14), ('R', 10), ('Q', 4), ('E', 4), ('L', 2), ('A', 1)]
209  0.01  0.12 [('A', 32), ('V', 2), ('T', 1)]
210  0.01  0.27 [('S', 27), ('D', 

391  0.10  0.83 [('M', 94), ('R', 47), ('F', 46), ('E', 36), ('L', 33), ('P', 26), ('S', 18), ('I', 16), ('T', 14), ('K', 11), ('A', 11), ('G', 10), ('H', 10), ('V', 10), ('Q', 9), ('C', 3), ('W', 2), ('Y', 1), ('N', 1), ('D', 1)]
392  0.11  0.81 [('S', 86), ('M', 66), ('P', 51), ('R', 44), ('T', 41), ('A', 26), ('N', 24), ('L', 18), ('Q', 16), ('K', 15), ('F', 12), ('D', 6), ('I', 6), ('V', 4), ('H', 4), ('G', 3), ('E', 2), ('C', 1)]
393  0.08  0.80 [('F', 70), ('R', 54), ('L', 31), ('H', 29), ('S', 23), ('P', 19), ('V', 16), ('A', 15), ('T', 14), ('I', 11), ('M', 6), ('K', 6), ('Q', 4), ('G', 4), ('Y', 3), ('D', 2), ('C', 1)]
394  0.08  0.76 [('P', 71), ('L', 55), ('S', 47), ('R', 34), ('M', 27), ('K', 19), ('A', 18), ('T', 16), ('I', 8), ('F', 7), ('V', 5), ('H', 4), ('N', 4), ('Q', 3), ('D', 2), ('Y', 1)]
395  0.06  0.68 [('R', 83), ('A', 43), ('P', 28), ('M', 20), ('S', 17), ('T', 13), ('F', 12), ('V', 5), ('H', 4), ('L', 3), ('I', 3), ('G', 2), ('Q', 2), ('K', 1), ('N', 1)]
396  

517  0.99  0.78 [('R', 716), ('E', 609), ('T', 485), ('A', 477), ('D', 333), ('Q', 302), ('K', 261), ('P', 229), ('S', 119), ('G', 90), ('L', 57), ('H', 40), ('V', 28), ('Y', 22), ('N', 18), ('F', 10), ('M', 6), ('W', 6), ('I', 2), ('C', 2)]
518  0.99  0.49 [('L', 2079), ('W', 815), ('V', 200), ('I', 177), ('Y', 157), ('M', 152), ('F', 86), ('S', 60), ('A', 45), ('T', 36), ('C', 6), ('H', 2)]
519  0.99  0.33 [('V', 2836), ('I', 335), ('A', 254), ('S', 127), ('T', 120), ('C', 107), ('L', 27), ('M', 7), ('N', 1), ('G', 1)]
520  0.99  0.72 [('N', 1030), ('E', 937), ('A', 406), ('R', 276), ('S', 268), ('D', 194), ('L', 140), ('G', 126), ('T', 122), ('K', 91), ('Q', 78), ('H', 37), ('V', 36), ('M', 32), ('C', 30), ('I', 6), ('P', 3), ('Y', 2), ('F', 1)]
521  0.99  0.62 [('I', 1684), ('C', 608), ('V', 384), ('L', 284), ('T', 247), ('Q', 160), ('M', 114), ('A', 105), ('H', 103), ('R', 60), ('F', 22), ('K', 22), ('Y', 9), ('S', 8), ('E', 5), ('W', 3)]
522  0.99  0.38 [('E', 1968), ('D', 987), 

638  0.01  0.63 [('K', 23), ('R', 6), ('Q', 5), ('H', 3), ('F', 3), ('V', 3), ('T', 2), ('S', 2), ('Y', 1), ('E', 1), ('D', 1), ('I', 1)]
639  1.00  0.74 [('R', 935), ('V', 854), ('L', 420), ('I', 412), ('H', 272), ('E', 246), ('F', 144), ('K', 142), ('Y', 98), ('T', 78), ('Q', 70), ('A', 53), ('S', 51), ('D', 26), ('M', 15), ('C', 6), ('G', 4), ('W', 2), ('N', 2)]
640  0.00  0.00 [('Y', 1)]
641  1.00  0.65 [('A', 1512), ('W', 582), ('L', 501), ('I', 301), ('G', 261), ('V', 254), ('F', 157), ('C', 68), ('R', 45), ('T', 32), ('M', 29), ('Y', 23), ('S', 17), ('N', 11), ('E', 11), ('K', 10), ('P', 6), ('D', 6), ('H', 2)]
642  0.98  0.68 [('R', 1365), ('T', 858), ('E', 350), ('S', 343), ('V', 180), ('K', 132), ('D', 105), ('Q', 101), ('A', 73), ('H', 56), ('L', 32), ('I', 30), ('N', 29), ('Y', 26), ('P', 21), ('C', 17), ('W', 17), ('F', 16), ('M', 15), ('G', 11)]
643  0.94  0.83 [('F', 859), ('L', 527), ('V', 404), ('Y', 257), ('H', 228), ('R', 210), ('N', 207), ('K', 158), ('I', 149), ('A

732  0.35  0.62 [('D', 483), ('E', 348), ('K', 179), ('I', 96), ('R', 58), ('Q', 45), ('A', 36), ('V', 29), ('N', 17), ('H', 15), ('T', 13), ('S', 11), ('P', 5), ('G', 3), ('M', 2), ('L', 2), ('F', 1), ('C', 1)]
733  0.52  0.50 [('G', 944), ('D', 572), ('E', 203), ('N', 96), ('A', 73), ('S', 29), ('R', 24), ('P', 19), ('K', 18), ('T', 12), ('Q', 9), ('V', 5), ('M', 1), ('I', 1)]
734  0.00  0.23 [('G', 1), ('M', 1)]
735  0.00  0.23 [('H', 1), ('R', 1)]
736  0.00  0.23 [('G', 1), ('M', 1)]
737  0.00  0.00 [('N', 1)]
738  0.00  0.00 [('P', 1)]
739  0.00  0.00 [('S', 1)]
740  0.00  0.00 [('G', 1)]
741  0.13  0.53 [('G', 216), ('D', 113), ('N', 82), ('S', 24), ('E', 19), ('A', 9), ('T', 7), ('P', 3), ('V', 3), ('K', 2), ('R', 2), ('Q', 2), ('H', 2), ('F', 1)]
742  0.00  0.30 [('D', 10), ('S', 2), ('V', 1), ('E', 1)]
743  0.01  0.59 [('D', 21), ('E', 8), ('P', 5), ('A', 5), ('G', 4), ('V', 2), ('Q', 1), ('T', 1), ('R', 1), ('S', 1)]
744  0.16  0.68 [('D', 157), ('G', 153), ('E', 105), ('N', 

848  1.00  0.14 [('G', 3499), ('S', 204), ('A', 73), ('L', 12), ('N', 8), ('T', 6), ('H', 4), ('Y', 4), ('E', 3), ('F', 3), ('Q', 3), ('D', 3), ('K', 2), ('P', 2), ('V', 2), ('I', 2)]
849  0.00  0.21 [('Y', 2), ('S', 1)]
850  0.00  0.00 [('A', 2)]
851  1.00  0.27 [('S', 3203), ('T', 182), ('K', 74), ('H', 69), ('L', 69), ('D', 66), ('A', 59), ('R', 37), ('N', 22), ('E', 21), ('Q', 9), ('G', 7), ('M', 4), ('V', 4), ('P', 3), ('Y', 1)]
852  1.00  0.61 [('P', 2115), ('L', 390), ('R', 154), ('A', 148), ('V', 148), ('I', 134), ('H', 129), ('D', 124), ('G', 93), ('F', 93), ('T', 70), ('Q', 58), ('E', 51), ('K', 49), ('S', 20), ('M', 20), ('N', 19), ('Y', 12), ('C', 2), ('W', 1)]
853  0.92  0.71 [('S', 903), ('T', 700), ('G', 606), ('A', 525), ('D', 138), ('E', 126), ('H', 123), ('V', 107), ('Y', 101), ('F', 59), ('I', 30), ('L', 27), ('N', 26), ('W', 22), ('R', 17), ('K', 15), ('M', 12), ('Q', 7), ('C', 4)]
854  0.10  0.70 [('L', 128), ('S', 53), ('I', 46), ('A', 38), ('T', 33), ('V', 32), (

984  0.00  0.00 [('E', 1)]
985  0.00  0.00 [('D', 1)]
986  0.00  0.00 [('P', 1)]
987  0.00  0.00 [('A', 1)]
988  0.00  0.00 [('G', 1)]
989  0.00  0.00 [('L', 1)]
990  0.00  0.00 [('G', 1)]
991  0.00  0.00 [('R', 1)]
992  0.00  0.00 [('L', 1)]
993  0.00  0.00 [('S', 1)]
994  0.00  0.00 [('N', 1)]
995  0.00  0.00 [('T', 1)]
996  0.00  0.00 [('A', 1)]
997  0.00  0.00 [('Y', 1)]
998  0.00  0.00 [('D', 1)]
999  0.00  0.00 [('F', 1)]
1000  0.00  0.00 [('A', 1)]
1001  0.00  0.00 [('T', 1)]
1002  1.00  0.58 [('F', 1294), ('Y', 1263), ('W', 322), ('A', 274), ('L', 236), ('V', 198), ('I', 98), ('M', 86), ('S', 28), ('G', 18), ('C', 8), ('P', 6), ('H', 4), ('T', 2)]
1003  1.00  0.86 [('T', 599), ('R', 547), ('E', 504), ('D', 327), ('S', 312), ('N', 255), ('K', 229), ('V', 185), ('Q', 171), ('H', 153), ('L', 130), ('Y', 117), ('A', 94), ('I', 72), ('F', 72), ('M', 30), ('G', 15), ('C', 15), ('W', 10)]
1004  1.00  0.46 [('L', 1592), ('V', 1171), ('I', 717), ('M', 147), ('F', 110), ('A', 87), ('C', 

1152  1.00  0.85 [('E', 876), ('R', 520), ('Q', 362), ('K', 302), ('A', 269), ('V', 212), ('G', 183), ('T', 180), ('D', 168), ('H', 167), ('S', 163), ('I', 115), ('L', 108), ('M', 47), ('N', 43), ('Y', 32), ('C', 25), ('F', 24), ('W', 17), ('P', 17)]
1153  1.00  0.58 [('V', 1446), ('I', 767), ('A', 518), ('L', 381), ('G', 377), ('F', 241), ('C', 39), ('M', 29), ('S', 17), ('W', 11), ('T', 3), ('R', 1)]
1154  0.00  0.21 [('L', 2), ('R', 1)]
1155  0.06  0.03 [('L', 234), ('M', 1), ('I', 1), ('W', 1)]
1156  0.06  0.16 [('S', 209), ('T', 18), ('C', 5), ('V', 3), ('I', 1), ('L', 1)]
1157  0.06  0.10 [('M', 226), ('V', 5), ('S', 2), ('A', 2), ('L', 1), ('G', 1), ('Q', 1), ('C', 1)]
1158  0.93  0.27 [('D', 2791), ('N', 365), ('E', 232), ('G', 159), ('S', 16), ('Y', 11), ('L', 9), ('H', 7), ('A', 1), ('V', 1), ('T', 1)]
1159  0.94  0.55 [('V', 1585), ('L', 767), ('I', 446), ('A', 268), ('F', 200), ('M', 169), ('T', 58), ('C', 36), ('S', 35), ('Y', 9), ('R', 9), ('G', 7), ('W', 2), ('H', 2), ('

1233  0.00  0.17 [('G', 4), ('A', 1)]
1234  0.00  0.17 [('T', 4), ('Q', 1)]
1235  0.00  0.32 [('F', 3), ('V', 1), ('A', 1)]
1236  0.00  0.44 [('P', 2), ('T', 1), ('G', 1), ('Y', 1)]
1237  0.00  0.23 [('S', 1), ('A', 1)]
1238  0.00  0.23 [('R', 1), ('S', 1)]
1239  0.00  0.23 [('Q', 1), ('L', 1)]
1240  0.00  0.00 [('V', 1)]
1241  0.00  0.00 [('R', 1)]
1242  0.00  0.00 [('Q', 1)]
1243  0.00  0.23 [('R', 1), ('A', 1)]
1244  0.00  0.23 [('D', 1), ('R', 1)]
1245  0.00  0.23 [('E', 1), ('A', 1)]
1246  0.00  0.23 [('Q', 1), ('R', 1)]
1247  0.00  0.46 [('Q', 1), ('I', 1), ('R', 1), ('Y', 1)]
1248  0.01  0.60 [('E', 10), ('A', 10), ('S', 9), ('R', 2), ('L', 1), ('Q', 1), ('K', 1), ('G', 1), ('H', 1), ('N', 1)]
1249  0.07  0.62 [('N', 88), ('K', 74), ('T', 22), ('G', 16), ('D', 15), ('S', 13), ('Q', 10), ('R', 8), ('E', 4), ('A', 3), ('H', 3), ('V', 1), ('I', 1)]
1250  0.25  0.79 [('K', 149), ('Q', 145), ('A', 144), ('R', 112), ('N', 91), ('E', 88), ('D', 45), ('H', 45), ('S', 42), ('L', 38), ('P

1346  0.00  0.00 [('M', 1)]
1347  0.00  0.00 [('P', 1)]
1348  0.00  0.00 [('T', 1)]
1349  0.00  0.00 [('E', 1)]
1350  0.00  0.00 [('G', 1)]
1351  0.00  0.00 [('V', 1)]
1352  0.00  0.00 [('V', 1)]
1353  0.00  0.00 [('R', 1)]
1354  0.84  0.80 [('P', 573), ('D', 480), ('E', 471), ('G', 391), ('T', 243), ('A', 240), ('S', 187), ('K', 170), ('R', 147), ('L', 95), ('Q', 57), ('N', 49), ('V', 38), ('M', 32), ('H', 21), ('I', 9), ('F', 7), ('Y', 4), ('W', 1)]
1355  0.46  0.77 [('P', 473), ('E', 306), ('A', 159), ('T', 157), ('D', 139), ('S', 120), ('K', 90), ('G', 81), ('Q', 54), ('V', 51), ('R', 36), ('H', 27), ('N', 22), ('L', 11), ('M', 9), ('I', 7), ('Y', 3), ('F', 3), ('W', 2), ('C', 1)]
1356  0.99  0.53 [('L', 1867), ('I', 723), ('V', 605), ('P', 187), ('M', 182), ('F', 68), ('A', 53), ('T', 36), ('Y', 20), ('S', 19), ('R', 14), ('Q', 10), ('K', 8), ('E', 8), ('C', 7), ('G', 5), ('H', 4), ('D', 2), ('N', 1), ('W', 1)]
1357  0.99  0.86 [('E', 561), ('T', 516), ('G', 437), ('A', 332), ('R'

1415  0.00  0.32 [('A', 4), ('P', 2), ('T', 1)]
1416  0.00  0.14 [('D', 6), ('A', 1)]
1417  0.00  0.14 [('G', 6), ('E', 1)]
1418  0.00  0.27 [('S', 5), ('G', 1), ('T', 1)]
1419  0.00  0.00 [('K', 7)]
1420  0.00  0.39 [('L', 4), ('E', 1), ('D', 1), ('V', 1)]
1421  0.02  0.47 [('E', 53), ('D', 19), ('N', 9), ('P', 5), ('Q', 4), ('K', 3), ('S', 2), ('V', 1)]
1422  0.99  0.00 [('E', 3808), ('K', 2), ('N', 1)]
1423  0.99  0.59 [('Y', 1332), ('H', 1156), ('F', 470), ('W', 281), ('T', 201), ('Q', 103), ('R', 80), ('S', 40), ('N', 31), ('K', 26), ('I', 21), ('D', 20), ('C', 12), ('V', 12), ('E', 11), ('A', 7), ('L', 7), ('G', 1), ('M', 1)]
1424  0.99  0.56 [('V', 1108), ('I', 991), ('L', 904), ('A', 402), ('M', 183), ('C', 84), ('G', 66), ('T', 60), ('S', 7), ('F', 7)]
1425  0.00  0.00 [('L', 1)]
1426  0.00  0.00 [('A', 1)]
1427  0.00  0.00 [('L', 1)]
1428  0.00  0.00 [('R', 1)]
1429  0.00  0.00 [('K', 1)]
1430  0.00  0.00 [('P', 1)]
1431  0.00  0.00 [('P', 1)]
1432  0.00  0.00 [('K', 1)]
1433

1515  0.01  0.56 [('A', 8), ('R', 6), ('S', 6), ('P', 3), ('D', 1), ('T', 1), ('V', 1)]
1516  0.01  0.73 [('D', 10), ('G', 5), ('A', 4), ('R', 4), ('H', 4), ('T', 3), ('P', 3), ('V', 2), ('S', 1), ('K', 1), ('E', 1)]
1517  0.01  0.73 [('R', 9), ('G', 6), ('T', 4), ('A', 3), ('E', 2), ('Q', 2), ('D', 2), ('S', 1), ('H', 1), ('L', 1), ('I', 1), ('V', 1)]
1518  0.01  0.52 [('P', 9), ('A', 7), ('S', 5), ('R', 2), ('T', 2), ('H', 1)]
1519  0.01  0.43 [('G', 16), ('V', 3), ('A', 3), ('Q', 1), ('N', 1), ('L', 1), ('D', 1)]
1520  0.01  0.67 [('E', 6), ('C', 4), ('A', 3), ('W', 3), ('Q', 2), ('T', 2), ('S', 1), ('D', 1), ('Y', 1)]
1521  0.01  0.62 [('N', 5), ('D', 5), ('E', 5), ('A', 3), ('S', 1), ('C', 1), ('P', 1), ('T', 1)]
1522  0.01  0.65 [('H', 8), ('R', 6), ('A', 3), ('T', 2), ('P', 1), ('Q', 1), ('N', 1), ('I', 1), ('D', 1), ('G', 1)]
1523  0.01  0.62 [('A', 7), ('P', 5), ('G', 5), ('D', 3), ('E', 2), ('S', 1), ('V', 1), ('T', 1)]
1524  0.01  0.45 [('R', 14), ('T', 2), ('W', 2), ('G', 2

1765  0.00  0.00 [('R', 1)]
1766  0.00  0.00 [('F', 1)]
1767  0.00  0.00 [('A', 1)]
1768  0.00  0.00 [('R', 1)]
1769  0.00  0.00 [('N', 1)]
1770  0.00  0.00 [('V', 1)]
1771  0.00  0.00 [('F', 1)]
1772  0.00  0.00 [('A', 1)]
1773  0.00  0.00 [('T', 1)]
1774  0.00  0.00 [('N', 1)]
1775  0.00  0.00 [('F', 1)]
1776  0.00  0.00 [('S', 1)]
1777  0.00  0.00 [('S', 1)]
1778  0.00  0.00 [('G', 1)]
1779  0.00  0.00 [('H', 1)]
1780  0.00  0.00 [('I', 1)]
1781  0.00  0.00 [('A', 1)]
1782  0.00  0.00 [('R', 1)]
1783  0.00  0.00 [('G', 1)]
1784  0.00  0.00 [('V', 1)]
1785  0.00  0.00 [('T', 1)]
1786  0.00  0.00 [('R', 1)]
1787  0.00  0.00 [('A', 1)]
1788  0.00  0.00 [('G', 1)]
1789  0.00  0.00 [('W', 1)]
1790  0.00  0.00 [('Y', 1)]
1791  0.00  0.00 [('R', 1)]
1792  0.00  0.00 [('L', 1)]
1793  0.00  0.00 [('A', 1)]
1794  0.00  0.00 [('T', 1)]
1795  0.00  0.00 [('G', 1)]
1796  0.00  0.00 [('G', 1)]
1797  0.00  0.00 [('Q', 1)]
1798  0.00  0.00 [('Q', 1)]
