Reimplementation of the UniRep protein featurization model in JAX.
This repo is a self-contained version of the UniRep model (so far only the 1900 hidden-unit mLSTM), adapted and extended from fundl.
Ensure that your compute environment allows you to run JAX code. (A modern Linux or macOS with a GLIBC>=2.23 is probably necessary.)
jax-unirep is available by pip installing from source.
Installation from GitHub:
pip install git+https://github.com/ElArkk/jax-unirep.git
To generate representations of protein sequences,
pass a list of sequences as strings or a single sequence to
It will return a tuple consisting of the following representations for each sequence:
h_avg: Average hidden state of the mLSTM over the whole sequence.
h_final: Final hidden state of the mLSTM
c_final: Final cell state of the mLSTM
From the original paper,
h_avg is considered the "representation" (or "rep") of the protein sequence.
Only valid amino acid sequence letters belonging to the set:
are allowed as inputs to
They may be passed in as a single string or an iterable of strings,
and need not necessarily be of the same length.
In Python code, for a single sequence:
from jax_unirep import get_reps sequence = "ASDFGHJKL" # h_avg is the canonical "reps" h_avg, h_final, c_final = get_reps(sequence)
And for multiple sequences:
from jax_unirep import get_reps sequences = ["ASDF", "YJKAL", "QQLAMEHALQP"] # h_avg is the canonical "reps" h_avg, h_final, c_final= get_reps(sequences) # each of the arrays will be of shape (len(sequences), 1900), # with the correct order of sequences preserved
All the model weights are licensed under the terms of Creative Commons Attribution-NonCommercial 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.
Otherwise the code in this repository is licensed under the terms of GPL v3.