# HMM build and analysis

Here we will construct and analyze HMM models using MSA as the input data. We will utilize several databases as sources to enhance our analysis and gain valuable insights from the generated HMM profiles.


In [1]:
import os
import re
import subprocess
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sns
import ipywidgets as widgets
from functions import get_fasta, process_hmmsearch_file

In [2]:
local_path = os.getcwd()
local_path

'/Users/alina/HMM'

In [28]:
disordered = pd.read_csv("disordered_df.csv")
disordered.head()

Unnamed: 0,query_id,subject_id,query_len,hsp_len,query_seq,match_seq,subject_seq,query_start,query_end,subject_start,subject_end,identity,positive,gaps,eval,bit_score,count
0,Q9H832,A0A6J2FM24,354,356,MAESPTEEAATA--GAGAAGPGASSVAGVVGVSGSGGGFGPPFLPD...,MAESPTEEAATA GAGAAGPGAS V GVVGVSGSG FGPPFLPD...,MAESPTEEAATATAGAGAAGPGASGVTGVVGVSGSG--FGPPFLPD...,1,354,1,354,350,350,4,0.0,1851.0,200
1,Q9H832,A0A3Q7W6Y2,354,356,MAESPTEEAATA--GAGAAGPGASSVAGVVGVSGSGGGFGPPFLPD...,MAESPTEEAATA GAGA GPGAS VAGVVGVSGSG FGPPFLPD...,MAESPTEEAATATAGAGATGPGASGVAGVVGVSGSG--FGPPFLPD...,1,354,1,354,350,350,4,0.0,1851.0,200
2,Q9H832,A0A2U3VK69,354,356,MAESPTEEAATA--GAGAAGPGASSVAGVVGVSGSGGGFGPPFLPD...,MAESPTEEAATA GAGAAGPGAS V GVVGVSGSG FGPPFLPD...,MAESPTEEAATATAGAGAAGPGASGVTGVVGVSGSG--FGPPFLPD...,1,354,1,354,350,350,4,0.0,1851.0,200
3,Q9H832,A0A2Y9JVH5,354,358,MAESPTEEAATA----GAGAAGPGASSVAGVVGVSGSGGGFGPPFL...,MAESPTEEAATA GAGAAGPGAS VAGVVGVSGSG FGPPFL...,MAESPTEEAATATATAGAGAAGPGASGVAGVVGVSGSG--FGPPFL...,1,354,1,356,351,351,6,0.0,1854.0,200
4,Q9H832,A0A8C7ALE4,354,358,MAESPTEEAATA----GAGAAGPGASSVAGVVGVSGSGGGFGPPFL...,MAESPTEEAATA GAGAAGPGAS VAGVVGVSGSG FGPPFL...,MAESPTEEAATATATAGAGAAGPGASGVAGVVGVSGSG--FGPPFL...,1,354,1,356,351,351,6,0.0,1854.0,200


In [29]:
# Dropdown list of query IDs for disordered regions
output = widgets.Select(
    options=disordered["query_id"].unique(),
    rows=10,
    description='Query ID: ',
    layout={'width': 'max-content'},
    disabled=False
)
display(output)

Select(description='Query ID: ', layout=Layout(width='max-content'), options=('Q9H832', 'Q8IW19', 'Q99967', 'Q…

In [147]:
# Input parameters
q_id = output.value
i = 1
hmm_file = f'{local_path}/results/disordered/hmmbuild/{q_id}_{i}.hmm'
align_file = f'{local_path}/results/alignments/output_files/disordered/{q_id}_{i}.fasta'

Using the MSA in .fasta format as an input data, we generate HMM model with `hmmbuild` command.

In [148]:
# Build HMM 
!hmmbuild {hmm_file} {align_file}

# hmmbuild :: profile HMM construction from multiple sequence alignments
# HMMER 3.3.2 (Nov 2020); http://hmmer.org/
# Copyright (C) 2020 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# input alignment file:             /Users/alina/HMM/results/alignments/output_files/disordered/Q99967_1.fasta
# output HMM file:                  /Users/alina/HMM/results/disordered/hmmbuild/Q99967_1.hmm
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

# idx name                  nseq  alen  mlen eff_nseq re/pos description
#---- -------------------- ----- ----- ----- -------- ------ -----------
1     Q99967_1               201    50    50     0.82  1.105 

# CPU time: 0.02u 0.00s 00:00:00.02 Elapsed: 00:00:00.02


Here we should pay attention at the occasional difference between the values of `alen` and `mlen` which stand for aligned sequence and consensus sequence lengths respectfully. If they differ, we handle the sequences with the deletions.

In [149]:
# Analysis of the model
!hmmstat {hmm_file}

# hmmstat :: display summary statistics for a profile file
# HMMER 3.3.2 (Nov 2020); http://hmmer.org/
# Copyright (C) 2020 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#
# idx  name                 accession        nseq eff_nseq      M relent   info p relE compKL
# ---- -------------------- ------------ -------- -------- ------ ------ ------ ------ ------
1      Q99967_1             -                 201     0.82     50   1.10   1.07   1.05   0.05


In this part, we analyze the statistics of the generated HMM and focus on key metrics:
- `eff_nseq`: effective sequence number. The smaller value is suggesting a relatively low similarity among the sequences.
- `relent`: relative entropy of the match state. The value is 0.59, indicating a moderate conservation of residues.
- `compKL`: Kullback-Leibler divergence from the average composition. The value is low, indicating a better alignment fit with the background model. 

## 1. HMMsearch

After building the model, our objective is to assess if overlaps with the profiles in Reference Proteome exist. Then, we enrich the model by utilizing a protein database such as the Reference Proteome. For both databases, we generate dataframes containing the most significant sequences, utilizing a default E-value threshold of 0.01.

### Reference Proteome

The objective here is to enhance the quality of the model by conducting additional model training using a larger set of homologous sequences.

In [154]:
# Copy the file to remote computer
!scp {local_path}/results/disordered/hmmbuild/{q_id}_{i}.hmm alina@echidna:~/{q_id}_{i}.hmm

Q99967_1.hmm                                  100%   24KB   4.3MB/s   00:00    


In [155]:
# HMM search against Reference Proteome 15%
!ssh alina@echidna "/software/packages/hmmer/hmmer-3.3.1/usr/bin/hmmsearch {q_id}_{i}.hmm /db/rp/rp-seqs-15.fasta.gz > hmmsearch_rp_15_{q_id}_{i}.txt"

In [156]:
# HMM search against Reference Proteome 75%
!ssh alina@echidna "/software/packages/hmmer/hmmer-3.3.1/usr/bin/hmmsearch {q_id}_{i}.hmm /db/rp/rp-seqs-75.fasta.gz > hmmsearch_rp_75_{q_id}_{i}.txt"

In [157]:
# Copy results to the local folder
!scp alina@echidna:~/hmmsearch_rp_15_{q_id}_{i}.txt {local_path}/results/disordered/hmmsearch/
!scp alina@echidna:~/hmmsearch_rp_75_{q_id}_{i}.txt {local_path}/results/disordered/hmmsearch/

hmmsearch_rp_15_Q99967_1.txt                  100%  166KB   6.6MB/s   00:00    
hmmsearch_rp_75_Q99967_1.txt                  100% 1300KB  10.6MB/s   00:00    


In [158]:
stats_rp_15 = process_hmmsearch_file(f"{local_path}/results/disordered/hmmsearch/hmmsearch_rp_15_{q_id}_{i}.txt")
stats_rp_15

The total number of Reference Proteome hits: 53, the number of unique sequences: 53


Unnamed: 0,E-value,score,bias,E-value.1,score.1,bias.1,exp,N,Sequence,Description
0,8.6e-37,133.7,1.3,1.1e-36,133.3,1.3,1.2,1,M7BZC1,M7BZC1_CHEMY^|^^|^Uncharacterized protein {ECO:00
1,1.5e-36,132.9,1.3,2e-36,132.5,1.3,1.1,1,S9WIN3,S9WIN3_CAMFR^|^^|^Cbp/p300-interacting transactiv
2,2.1000000000000002e-36,132.5,1.3,3.8000000000000004e-36,131.6,1.3,1.5,1,F1R408,F1R408_DANRE^|^^|^Cbp/p300-interacting transactiv
3,2.1000000000000002e-36,132.5,1.3,3.8000000000000004e-36,131.6,1.3,1.5,1,Q5XJD6,Q5XJD6_DANRE^|^^|^Zgc:103418 {ECO:0000313|EMBL:AA
4,2.1000000000000002e-36,132.4,1.3,3.4e-36,131.8,1.3,1.3,1,A0A7N9IHM0,A0A7N9IHM0_MACFA^|^^|^Uncharacterized protein {EC
5,2.1000000000000002e-36,132.4,1.3,3.2e-36,131.9,1.3,1.3,1,A0A7N9CLV6,A0A7N9CLV6_MACFA^|^^|^Cbp/p300 interacting transa
6,2.6e-36,132.2,1.3,4.2000000000000005e-36,131.5,1.3,1.4,1,A0A2K6SUZ1,A0A2K6SUZ1_SAIBB^|^^|^Cbp/p300 interacting transa
7,2.8e-36,132.0,1.3,4.3e-36,131.4,1.3,1.3,1,A0A444U6U9,A0A444U6U9_ACIRT^|^^|^Cbp/p300-interacting transa
8,2.9e-36,132.0,1.3,4.2000000000000005e-36,131.5,1.3,1.3,1,Q9DDW4,Q9DDW4_CHICK^|^^|^Cited2/melanocyte specific gene
9,2.9e-36,132.0,1.3,4.5e-36,131.4,1.3,1.3,1,A0A2I0THR9,A0A2I0THR9_LIMLA^|^^|^Uncharacterized protein {EC


In [159]:
stats_rp_75 = process_hmmsearch_file(f"{local_path}/results/disordered/hmmsearch/hmmsearch_rp_75_{q_id}_{i}.txt")
stats_rp_75

The total number of Reference Proteome hits: 1222, the number of unique sequences: 1222


Unnamed: 0,E-value,score,bias,E-value.1,score.1,bias.1,exp,N,Sequence,Description
0,5.3e-36,134.1,1.3,7e-36,133.8,1.3,1.2,1,A0A6I9Z8Z6,A0A6I9Z8Z6_GEOFO^|^^|^cbp/p300-interacting transa
1,7.4e-36,133.7,1.3,9.8e-36,133.3,1.3,1.2,1,M7BZC1,M7BZC1_CHEMY^|^^|^Uncharacterized protein {ECO:00
2,7.6e-36,133.6,1.3,1.1e-35,133.1,1.3,1.3,1,A0A091NMC4,A0A091NMC4_APAVI^|^^|^Cbp/p300-interacting transa
3,9.6e-36,133.3,1.3,1.7e-35,132.5,1.3,1.4,1,R0JZ77,R0JZ77_ANAPL^|^^|^Cbp/p300-interacting transactiv
4,1e-35,133.2,1.3,1.4e-35,132.8,1.3,1.2,1,A0A093GA40,A0A093GA40_DRYPU^|^^|^Cbp/p300-interacting transa
...,...,...,...,...,...,...,...,...,...,...
1217,0.0033,29.1,4.1,1.5e+04,7.9,0.0,5.7,5,A0A7J7B172,A0A7J7B172_9COLE^|^^|^Uncharacterized protein {EC
1218,0.005,28.6,4.9,2.3e+04,7.2,0.0,5.8,6,A0A4Z1NCC9,A0A4Z1NCC9_9PEZI^|^^|^Phosphoesterase superfamily
1219,0.0058,28.4,0.0,0.0079,27.9,0.0,1.2,1,A0A852E0F6,A0A852E0F6_VIDMA^|^^|^CITE1 protein {ECO:0000313|
1220,0.0058,28.4,0.0,0.0079,27.9,0.0,1.2,1,A0A851L126,A0A851L126_VIDCH^|^^|^CITE1 protein {ECO:0000313|


In [160]:
# # Retrieve a query sequence
# query_sequence = get_fasta(q_id)
# query_lines = query_sequence.split("\n")
# query_sequence = "".join(query_lines[1:])
# query_sequence

In [161]:
# # Retrieve the unaligned sequence from the local machine
# rpalign = f'{local_path}results/alignments/fasta/{q_id}_rpalign.fasta'
#
# with open(rpalign, 'w') as fout:
#     # Write the query sequence to the output file as the first line
#     fout.write(">{}\n{}\n".format(q_id, query_sequence))
#
#     for index, row in stats_rp_15.iterrows():
#         accession = row['Sequence']
#         sequence = get_fasta(accession)
#         if q_id == accession: # remove duplicates
#             continue
#         fout.write(sequence)

## 2. HHblits

HHblits is used for profile-profile sequence alignment. It compares a profile against a target sequence database to find homologous sequences.

In [162]:
!scp {align_file} alina@echidna:~/{q_id}_{i}.fasta

Q99967_1.fasta                                100%   12KB   3.0MB/s   00:00    


In [163]:
# HHblits against Pfam
!ssh alina@echidna "/software/packages/hhsuite/hhsuite-3.0-beta.3-Linux/bin/hhblits -i {q_id}_{i}.fasta -o hhblits_rp_15_{q_id}_{i}.txt -d /db/hhblits/pfamA_35.0/pfam"

# 1. Pfam HMM / RP (both are already prepared files) - or just download Interpro and skip this step
# 2. Compare results (overlap) - or with Interpro (extract the rows by Uniprot ID and only after that filter by Pfam ID)

- 16:39:50.453 INFO: Searching 19632 column state sequences.

- 16:39:50.563 INFO: Q99967_1.fasta is in A2M, A3M or FASTA format

- 16:39:50.564 INFO: Iteration 1

- 16:39:50.622 INFO: Prefiltering database

- 16:39:50.699 INFO: HMMs passed 1st prefilter (gapless profile-profile alignment)  : 100

- 16:39:50.700 INFO: HMMs passed 2nd prefilter (gapped profile-profile alignment)   : 100

- 16:39:50.700 INFO: HMMs passed 2nd prefilter and not found in previous iterations : 100

- 16:39:50.700 INFO: Scoring 100 HMMs using HMM-HMM Viterbi alignment

- 16:39:50.756 INFO: Alternative alignment: 0

- 16:39:52.039 INFO: 100 alignments done

- 16:39:52.039 INFO: Alternative alignment: 1

- 16:39:52.046 INFO: 10 alignments done

- 16:39:52.046 INFO: Alternative alignment: 2

- 16:39:52.046 INFO: Alternative alignment: 3

- 16:39:52.074 INFO: Realigning 10 HMM-HMM alignments using Maximum Accuracy algorithm

- 16:39:52.137 INFO: 1 sequences belonging to 1 database HM

In [164]:
# Copy results to the local folder
!scp alina@echidna:~/hhblits_rp_15_{q_id}_{i}.txt {local_path}/results/disordered/hhblits/

hhblits_rp_15_Q99967_1.txt                    100% 6717     1.9MB/s   00:00    


- `Hit`: contains information about Pfam identifier (starts with PF...), the abbreviated and full name of the domain.
- `Prob`: the probability of the match between the query sequence and the template sequence.
- `E-value`: the expected number of false positive matches that could occur by chance.
- `P-value`: the probability of obtaining a match with a score as good as or better than the observed score purely by chance.
Similar as `E-value`, the lower `P-value` indicate more significant matches.
- `Score`: the quality of the alignment between the query and template sequences.
- `SS (Secondary Structure)`: the predicted secondary structure of the aligned residues in the template sequence.
- `Cols`: the number of aligned columns or residues in the alignment between the query and template sequences.
- `Query HMM`: indicates position matches within HMM profile (input).
- `Template HMM`: indicates position matches within HMM profile (database).
Usually the length of template HMM is bigger than the length of query HMM.

In [165]:
with open(f'{local_path}/results/disordered/hhblits/hhblits_rp_15_{q_id}_{i}.txt', 'r') as file:
    lines = file.readlines()

# Extract the column names
column_names = lines[8].split()[:-4] + ['Query HMM', 'Template HMM']

# Extract the data rows
data_rows = [line.split() for line in lines[9:19]] 
data_rows = [[row[0]] + [' '.join(row[1:4])] + row[7:14] + [' '.join(row[14:16])] 
             for row in data_rows]

# Create the DataFrame
hhblits_stats = pd.DataFrame(data_rows, columns=column_names)
hhblits_stats[["Hit", "Name"]] = hhblits_stats["Hit"].str.split(" ; ", expand=True)
print(f"The total number of HHblits hits on Pfam: {len(hhblits_stats)}, the number of unique domains: {hhblits_stats.Hit.nunique()}")
hhblits_stats

The total number of HHblits hits on Pfam: 10, the number of unique domains: 10


Unnamed: 0,No,Hit,Prob,E-value,P-value,Score,SS,Cols,Query HMM,Template HMM,Name
0,1,PF04487.15,4.399999999999999e-24,8.4e-28,139.2,0.0,44.0,1-44,156-199,(201),CITED
1,2,PF07059.15,4.7,0.0011,21.6,0.0,28.0,9-36,209-240,(242),EDR2_C
2,3,PF08155.14,19.2,5.7,0.0012,17.0,0.0,17,21-37,34-50 (54),NOGCT
3,4,PF08587.14,14.0,0.0029,15.2,0.0,24.0,4-27,1-24,(50),UBA_2
4,5,PF12054.11,25.0,0.0058,19.5,0.0,23.0,6-28,409-431,(433),DUF3535
5,6,PF10415.12,31.0,0.0075,12.3,0.0,19.0,10-28,3-21,(54),FumaraseC_C
6,7,PF15364.9,36.0,0.0076,16.8,0.0,11.0,31-41,92-102,(141),PAXIP1_C
7,8,PF14377.9,4.0,47.0,0.01,11.7,0.0,13,4-16,9-21 (35),UBM
8,9,PF11504.11,62.0,0.01,15.2,0.0,18.0,29-46,55-72,(72),Colicin_Ia
9,10,PF06994.14,61.0,0.011,13.6,0.0,6.0,25-30,33-38,(39),Involucrin2
