# Class : Profile HMMs

---

Today we will be reviewing Profile HMMs in class including a demonstration of how we can implement profile HMMs using our existing framework. We will then explore a tool for building and using profile HMMs called HMMER. (http://hmmer.org/)

This is a diagram of Hidden Markov Model used in HMMER (from the HMMER User Guide by Sean Eddy). The chain of match (M), insert (I), and deletion (D) states can be extended to match the length of the multiple sequence alignment that is used as the training set to produce a model. Individual sequences may then be aligned to the model and scored based on the probability that the model would emit that sequence.

<center><img src='./figures/PMC2691815_gkp120f1.png'/></center>

Related to hmmer is the pfam database. The pfam (protein family) database is a curated collection of Hidden Markov Models for protein families and domains. hmmer is used to generate the HMMs from multiple alignments. One of the hmmer programs can be used to compare a protein sequence to the models in the database. You can run searches at http://pfam.xfam.org/search and get the data from http://pfam.xfam.org/. 

Today we will be using this tool as a demonstration of a profile HMM.

To begin, we will need to install hmmer in your environments: `$ conda install -c bioconda hmmer`

Next, we will build our own HMM using hmmer from a multiple sequence alignment (one of which was an example in your slides for today). A common tool for generating multiple sequence alignments is ClustalW; however, we will be writing our own tool for multiple sequence alignment in the upcoming weeks!

To get started, lets assume that we have a multiple sequence alignment of members of the globin gene family. We will then use HMMER to search for additional remote homologues of this family.

We will use the hmmbuild function to accomplish this first step. Explore the interface for using hmmbuild and the alignment file below. Then build your hmm called globins2.hmm using hmmbuild.

In [2]:
! cat data/globins4.sto

# STOCKHOLM 1.0

HBB_HUMAN   ........VHLTPEEKSAVTALWGKV....NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVL
HBA_HUMAN   .........VLSPADKTNVKAAWGKVGA..HAGEYGAEALERMFLSFPTTKTYFPHF.DLS.....HGSAQVKGHGKKVA
MYG_PHYCA   .........VLSEGEWQLVLHVWAKVEA..DVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDLKKHGVTVL
GLB5_PETMA  PIVDTGSVAPLSAAEKTKIRSAWAPVYS..TYETSGVDILVKFFTSTPAAQEFFPKFKGLTTADQLKKSADVRWHAERII

HBB_HUMAN   GAFSDGLAHL...D..NLKGTFATLSELHCDKL..HVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANAL
HBA_HUMAN   DALTNAVAHV...D..DMPNALSALSDLHAHKL..RVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVL
MYG_PHYCA   TALGAILKK....K.GHHEAELKPLAQSHATKH..KIPIKYLEFISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDI
GLB5_PETMA  NAVNDAVASM..DDTEKMSMKLRDLSGKHAKSF..QVDPQYFKVLAAVIADTVAAG.........DAGFEKLMSMICILL

HBB_HUMAN   AHKYH......
HBA_HUMAN   TSKYR......
MYG_PHYCA   AAKYKELGYQG
GLB5_PETMA  RSAY.......
//



In [3]:
! hmmbuild -h

# hmmbuild :: profile HMM construction from multiple sequence alignments
# HMMER 3.2.1 (June 2018); http://hmmer.org/
# Copyright (C) 2018 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Usage: hmmbuild [-options] <hmmfile_out> <msafile>

Basic options:
  -h     : show brief help on version and usage
  -n <s> : name the HMM <s>
  -o <f> : direct summary output to file <f>, not stdout
  -O <f> : resave annotated, possibly modified MSA to file <f>

Options for selecting alphabet rather than guessing it:
  --amino : input alignment is protein sequence data
  --dna   : input alignment is DNA sequence data
  --rna   : input alignment is RNA sequence data

Alternative model construction strategies:
  --fast           : assign cols w/ >= symfrac residues as consensus  [default]
  --hand           : manual construction (requires reference annotation)
  --symfrac <x>    : sets sym 

In [7]:
# Build your hmm here

! hmmbuild globins4.hmm data/globins4.sto

# hmmbuild :: profile HMM construction from multiple sequence alignments
# HMMER 3.2.1 (June 2018); http://hmmer.org/
# Copyright (C) 2018 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# input alignment file:             data/globins4.sto
# output HMM file:                  globins4.hmm
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

# idx name                  nseq  alen  mlen eff_nseq re/pos description
#---- -------------------- ----- ----- ----- -------- ------ -----------
1     globins4                 4   171   149     0.96  0.589 

# CPU time: 0.25u 0.00s 00:00:00.25 Elapsed: 00:00:00.25


Next we need to compress the hmm file for use by the rest of the hmmer suite:

In [8]:
! hmmpress globins4.hmm

Working...    done.
Pressed and indexed 1 HMMs (1 names).
Models pressed into binary file:   globins4.hmm.h3m
SSI index for binary model file:   globins4.hmm.h3i
Profiles (MSV part) pressed into:  globins4.hmm.h3f
Profiles (remainder) pressed into: globins4.hmm.h3p


Take a look at the HMM generated. Does this look similar to our HMMs?

In [9]:
! cat globins4.hmm

HMMER3/f [3.2.1 | June 2018]
NAME  globins4
LENG  149
ALPH  amino
RF    no
MM    no
CONS  yes
CS    no
MAP   yes
DATE  Wed Feb 13 13:35:11 2019
NSEQ  4
EFFN  0.964844
CKSUM 2027839109
STATS LOCAL MSV       -9.9014  0.70957
STATS LOCAL VITERBI  -10.7224  0.70957
STATS LOCAL FORWARD   -4.1637  0.70957
HMM          A        C        D        E        F        G        H        I        K        L        M        N        P        Q        R        S        T        V        W        Y   
            m->m     m->i     m->d     i->m     i->i     d->m     d->d
  COMPO   2.36553  4.52577  2.96709  2.70473  3.20818  3.02239  3.41069  2.90041  2.55332  2.35210  3.67329  3.19812  3.45595  3.16091  3.07934  2.66722  2.85475  2.56965  4.55393  3.62921
          2.68640  4.42247  2.77497  2.73145  3.46376  2.40504  3.72516  3.29302  2.67763  2.69377  4.24712  2.90369  2.73719  3.18168  2.89823  2.37879  2.77497  2.98431  4.58499  3.61525
          0.57544  1.78073  1.31293  1.75577  0.18968  0.0000

Now we will use our HMM to scan a database of proteins for similar domains. To do this, we will use the UNIPROT database (ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz). Download and uncompress this database into your datafolder.

We will use the hmmsearch function. The key value in this is the first column: E-value. E-value is the expected number of false positivies that scored this well or better - a lower E-value means a better match. The second column is  the log-odds score for the complete sequence being scanned. This is exactly what we would calculate by taking the log-odds of the forward probability as described in the slides.

In [11]:
! wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz

--2019-02-13 13:36:58--  ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
           => ‘uniprot_sprot.fasta.gz’
Resolving ftp.uniprot.org (ftp.uniprot.org)... 141.161.180.197
Connecting to ftp.uniprot.org (ftp.uniprot.org)|141.161.180.197|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/databases/uniprot/current_release/knowledgebase/complete ... done.
==> SIZE uniprot_sprot.fasta.gz ... 88385361
==> PASV ... done.    ==> RETR uniprot_sprot.fasta.gz ... done.
Length: 88385361 (84M) (unauthoritative)


2019-02-13 13:37:07 (9.64 MB/s) - ‘uniprot_sprot.fasta.gz’ saved [88385361]



In [10]:
! hmmsearch -h

# hmmsearch :: search profile(s) against a sequence database
# HMMER 3.2.1 (June 2018); http://hmmer.org/
# Copyright (C) 2018 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Usage: hmmsearch [options] <hmmfile> <seqdb>

Basic options:
  -h : show brief help on version and usage

Options directing output:
  -o <f>           : direct output to file <f>, not stdout
  -A <f>           : save multiple alignment of all hits to file <f>
  --tblout <f>     : save parseable table of per-sequence hits to file <f>
  --domtblout <f>  : save parseable table of per-domain hits to file <f>
  --pfamtblout <f> : save table of hits and domains to file, in Pfam format <f>
  --acc            : prefer accessions over names in output
  --noali          : don't output alignments, so output is smaller
  --notextw        : unlimit ASCII text output line width
  --textw <n>      : set max width of

In [12]:
# Search your sequences using your profile HMM here:

! hmmsearch globins4.hmm uniprot_sprot.fasta.gz

# hmmsearch :: search profile(s) against a sequence database
# HMMER 3.2.1 (June 2018); http://hmmer.org/
# Copyright (C) 2018 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# query HMM file:                  globins4.hmm
# target sequence database:        uniprot_sprot.fasta.gz
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Query:       globins4  [M=149]
Scores for complete sequences (score includes all domains):
   --- full sequence ---   --- best 1 domain ---    -#dom-
    E-value  score  bias    E-value  score  bias    exp  N  Sequence              Description
    ------- ------ -----    ------- ------ -----   ---- --  --------              -----------
    6.8e-65  222.7   3.2    7.5e-65  222.6   3.2    1.0  1  sp|P02185|MYG_PHYCD    Myoglobin OS=Physeter catodon OX=9755 
    3.5e-63  217.2   0.1    3.9e-63  217.0   0.1    1.0  1  sp|P02024

Finally, we can expand the search of our proteins of interest using the entire PFAM database. PFAM is a collection of profile HMMs build just as we did for globin. You can get the PFAM HMMs here: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz

To do this, we now want to scan a sequence against the entire profile database, so this uses the hmmscan function. We will scan our unknown seqeuence $data/unknown.fasta$ to see what domains it has.

In [14]:
! hmmscan -h

# hmmscan :: search sequence(s) against a profile database
# HMMER 3.2.1 (June 2018); http://hmmer.org/
# Copyright (C) 2018 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Usage: hmmscan [-options] <hmmdb> <seqfile>

Basic options:
  -h : show brief help on version and usage

Options controlling output:
  -o <f>           : direct output to file <f>, not stdout
  --tblout <f>     : save parseable table of per-sequence hits to file <f>
  --domtblout <f>  : save parseable table of per-domain hits to file <f>
  --pfamtblout <f> : save table of hits and domains to file, in Pfam format <f>
  --acc            : prefer accessions over names in output
  --noali          : don't output alignments, so output is smaller
  --notextw        : unlimit ASCII text output line width
  --textw <n>      : set max width of ASCII text output lines  [120]  (n>=120)

Options controlling reporti

In [31]:
# Scan unknown.fasta here

In [30]:
! hmmpress -f Pfam-A.hmm

Working...    done.
Pressed and indexed 17929 HMMs (17929 names and 17929 accessions).
Models pressed into binary file:   Pfam-A.hmm.h3m
SSI index for binary model file:   Pfam-A.hmm.h3i
Profiles (MSV part) pressed into:  Pfam-A.hmm.h3f
Profiles (remainder) pressed into: Pfam-A.hmm.h3p


In [35]:
! hmmscan Pfam-A.hmm data/unknown.fasta

# hmmscan :: search sequence(s) against a profile database
# HMMER 3.2.1 (June 2018); http://hmmer.org/
# Copyright (C) 2018 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# query sequence file:             data/unknown.fasta
# target HMM database:             Pfam-A.hmm
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Query:       UK2  [L=33]
Scores for complete sequence (score includes all domains):
   --- full sequence ---   --- best 1 domain ---    -#dom-
    E-value  score  bias    E-value  score  bias    exp  N  Model    Description
    ------- ------ -----    ------- ------ -----   ---- --  -------- -----------

   [No hits detected that satisfy reporting thresholds]


Domain annotation for each model (and alignments):

   [No targets detected that satisfy reporting thresholds]


Internal pipeline statistics summary:
----------------------