GitHub

Code to reproduce the paper Learned Protein Embeddings for Machine Learning.

Computing Environment:

This was originally developed using Anaconda Python 3.5 and the following packages and versions:

gensim 1.0.1
numpy 1.13.1
pandas 0.20.3
scipy 0.19.1
sklearn 0.19.0
matplotlib 2.0.2
seaborn 0.8.1

File structure

The repository is divided into code, inputs and outputs. Inputs contains all the unlabeled sequences used to build docvec models, the labeled sequences used to build Gaussian process regression models, and AAIndex, ProFET, and one-hot encodings of the labeled sequences. Code contains Python implementations of Gaussian process regression and the mismatch string kernel in addition to Jupyter notebooks that reproduce the analyses in the paper. Outputs contains all the embeddings produced during the course of analysis and csvs storing the results of the cross-validation over embedding hyperparameters, the negative controls, and the results of varying the embedding dimension or the number of unlabeled sequences. Note that while code to train docvec models is provided, the actual docvec models produced by gensim are not included in the repository because they are too large. These are at freely available at http://cheme.caltech.edu/~kkyang/.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
code		code
inputs		inputs
outputs		outputs
.gitattributes		.gitattributes
.gitignore		.gitignore
license.md		license.md
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

inputs

inputs

outputs

outputs

.gitattributes

.gitattributes

.gitignore

.gitignore

license.md

license.md

readme.md

readme.md

Repository files navigation

Code to reproduce the paper Learned Protein Embeddings for Machine Learning.

Computing Environment:

File structure

About

Releases

Packages

Languages

License

vzg100/embeddings_reproduction

Folders and files

Latest commit

History

Repository files navigation

Code to reproduce the paper Learned Protein Embeddings for Machine Learning.

Computing Environment:

File structure

About

Resources

License

Stars

Watchers

Forks

Languages