Skip to content

Latest commit

 

History

History
70 lines (44 loc) · 3.88 KB

README.md

File metadata and controls

70 lines (44 loc) · 3.88 KB

AbExpress

Assessment and optimization of antibody expressability using Long-Short Term Memory and structural design

This repository contains scripts to train and predict expressability and to calculate the engineerability. A pyrosetta/rosetta scripts based design protocol allows to improve the expressability based on the prediction of its single point mutants.

Sequences must be aligned using an antibody numbering scheme. The pre-trained models were aligned via IMGT Numbering using an MSA generated by IgReconstruct. Other numbering schemes can be implemented by the user.

Note, that whenever a tensorflow model is used for prediction or design, the sequence alignment must match the alignment of the trained model. This means, that the exact same (IMGT) numbers, which are the columns in the msa, must be present in the same order. Easiest way to achieve this, is by creating an MSA that contains training and testing data.

Installation

The scripts are ready to use and were tested with python 3.6 and tensorflow 2.1. List of dependencies:

  • pyrosetta==2019.11+release.fdb3942
  • numpy==1.19.2
  • pandas==1.0.3
  • tensorflow==2.1.0
  • scikit-learn==0.21.2

Train

Training on an imgt aligned sequence set (e.g. from IgReconstruct) with a label file that contains as many rows as sequences. Each line is either 1 (expressing) or 0 (non-expressing). Note, that this will pair all sequences (--paired) and thus expects that each light chain entry is followed by its corresponding heavy chain.

./train.py --msa imgt.fa --label labels.txt --checkpoint my_models --paired

Pre-Trained models

10 pre-trained models are included using the Flu Ab dataset used for benchmarking in the publication in pre-trained/Flu

The IMGT numbers used for training were:

1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,111.1,111.2,111.3,111.4,111.5,111.6,112.6,112.5,112.4,112.3,112.2,112.1,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128

For using these models:

  • The columns in the MSA must correspond to the same IMGT numbers (or requivalent residues using different Ab numbering) in the same order.
  • The antibodies were --paired and are sorted in the MSA so that each light chain is followed by its heavy chain

Predict

Prediction on an imgt aligned light/heavy chain pair. The msa must contain the same imgt numbers as the msa used for training. Predicted expressabilities for each sequence (pair) are written into a tab separated file.

./predict.py --msa sample.fa --checkpoint my_model_1 --paired -o prediction.tsv

Engineerability

It has been shown, that the "engineerability" term can be used to estimate by how much the expressability can be improved via design. Same rules for the (imgt) numbering apply as for prediction. To calculate the expressability:

./engineerability.py --msa sample.fa --checkpoint my_model_1 --paired -o screening.tsv

Design

There are three weighting schemes benchmarked for the manuscript, whereas the engineerability term is the lower boundary of the expected expressability after design for all three weights.

Scheme Expression weight Native bonus
low 4.0 3.0
medium 3.0 2.0
strong 4.0 2.0

Example command for designing a structure on the pre trained Flu model 0 using the strong weighting scheme. It is recommended to graft a resfile that repacks the pose, and allows design in the Fv region w/o CDR3 and Cysteins.

./design.py --pdb structure.pdb --xml xml/AbExpress.xml --script_vars weight_express=4 bonus_native=2 resfile=resfile msa=sample.fa model=pre-trained/Flu/manuscript_0