Annotation

Scripts

bash

annotation

Start here.

This script runs the full pipeline, and has options to perform each section individually. Most of the sections are not neccessary unless you wish to build your own model rather than using the one provided. If that is the case, it is highly recommended that an HPC cluster is used, as the process takes time.

get_data

Downloads the data files:

Pfam-A.hmm - containing the hidden markov models from Pfam A
enzyme.dat - containing the EC numbers and which proteins are members of those groups
uniprot_sprot.fasta - containing the sequences of the proteins in swissprot

python

parse_result.py

Uses the output of hmmsearch (profile against sequence database) to generate a sparse matrix of hit scores between proteins and pfam hmms

identify.py

Produces a sparse, boolean matrix with Trues where a protein is annotated as having a particular EC number

calculate_stats.py

Calculates:

how much of the swissprot database does not have EC numbers
how many (and which) Pfam HMMs are not seen in swissprot
how many (and which) swissprot proteins do not have a family in pfam

ec_frequency.py

Plots a frequency bar chart for number of proteins per EC number, with logarithmic bar widths

pfam_plots.py

Plots frequency bar charts for:

Number of hits per HMM
Ratio of enzyme to non-enzyme hits per HMM

Modules

paths.py

A variable containing the path to the data folder

frequency_bars.py

A function to draw consistent bar charts

test_data.py

Functions to produce a test dataset

prep.py

Functions to:

reduce the data by removing proteins that do not hit any HMMs
remove a portion of the non enzyme proteins, as they will be less important to the learning process

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
Data		Data
Report		Report
Scripts		Scripts
README.md		README.md
annotate		annotate

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Annotation

Scripts

bash

annotation

get_data

python

parse_result.py

identify.py

calculate_stats.py

ec_frequency.py

pfam_plots.py

Modules

paths.py

frequency_bars.py

test_data.py

prep.py

About

Uh oh!

Releases

Packages

Uh oh!

Languages

MichaelKubiak/Annotation

Folders and files

Latest commit

History

Repository files navigation

Annotation

Scripts

bash

annotation

get_data

python

parse_result.py

identify.py

calculate_stats.py

ec_frequency.py

pfam_plots.py

Modules

paths.py

frequency_bars.py

test_data.py

prep.py

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages