Start here.
This script runs the full pipeline, and has options to perform each section individually. Most of the sections are not neccessary unless you wish to build your own model rather than using the one provided. If that is the case, it is highly recommended that an HPC cluster is used, as the process takes time.
Downloads the data files:
-
Pfam-A.hmm - containing the hidden markov models from Pfam A
-
enzyme.dat - containing the EC numbers and which proteins are members of those groups
-
uniprot_sprot.fasta - containing the sequences of the proteins in swissprot
Uses the output of hmmsearch (profile against sequence database) to generate a sparse matrix of hit scores between proteins and pfam hmms
Produces a sparse, boolean matrix with Trues where a protein is annotated as having a particular EC number
Calculates:
- how much of the swissprot database does not have EC numbers
- how many (and which) Pfam HMMs are not seen in swissprot
- how many (and which) swissprot proteins do not have a family in pfam
Plots a frequency bar chart for number of proteins per EC number, with logarithmic bar widths
Plots frequency bar charts for:
-
Number of hits per HMM
-
Ratio of enzyme to non-enzyme hits per HMM
A variable containing the path to the data folder
A function to draw consistent bar charts
Functions to produce a test dataset
Functions to:
-
reduce the data by removing proteins that do not hit any HMMs
-
remove a portion of the non enzyme proteins, as they will be less important to the learning process