This repository contains scripts to train and predict expressability and to calculate the engineerability. A pyrosetta/rosetta scripts based design protocol allows to improve the expressability based on the prediction of its single point mutants.
Sequences must be aligned using an antibody numbering scheme. The pre-trained models were aligned via IMGT Numbering using an MSA generated by IgReconstruct. Other numbering schemes can be implemented by the user.
Note, that whenever a tensorflow model is used for prediction or design, the sequence alignment must match the alignment of the trained model. This means, that the exact same (IMGT) numbers, which are the columns in the msa, must be present in the same order. Easiest way to achieve this, is by creating an MSA that contains training and testing data.
The scripts are ready to use and were tested with python 3.6 and tensorflow 2.1. List of dependencies:
- pyrosetta==2019.11+release.fdb3942
- numpy==1.19.2
- pandas==1.0.3
- tensorflow==2.1.0
- scikit-learn==0.21.2
Training on an imgt aligned sequence set (e.g. from IgReconstruct) with a label file that contains as many rows as sequences.
Each line is either 1 (expressing) or 0 (non-expressing).
Note, that this will pair all sequences (--paired
) and thus expects that each light chain entry is followed by its corresponding heavy chain.
./train.py --msa imgt.fa --label labels.txt --checkpoint my_models --paired
10 pre-trained models are included using the Flu Ab dataset used for benchmarking in the publication in pre-trained/Flu
The IMGT numbers used for training were:
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,111.1,111.2,111.3,111.4,111.5,111.6,112.6,112.5,112.4,112.3,112.2,112.1,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128
For using these models:
- The columns in the MSA must correspond to the same IMGT numbers (or requivalent residues using different Ab numbering) in the same order.
- The antibodies were
--paired
and are sorted in the MSA so that each light chain is followed by its heavy chain
Prediction on an imgt aligned light/heavy chain pair. The msa must contain the same imgt numbers as the msa used for training. Predicted expressabilities for each sequence (pair) are written into a tab separated file.
./predict.py --msa sample.fa --checkpoint my_model_1 --paired -o prediction.tsv
It has been shown, that the "engineerability" term can be used to estimate by how much the expressability can be improved via design. Same rules for the (imgt) numbering apply as for prediction. To calculate the expressability:
./engineerability.py --msa sample.fa --checkpoint my_model_1 --paired -o screening.tsv
There are three weighting schemes benchmarked for the manuscript, whereas the engineerability term is the lower boundary of the expected expressability after design for all three weights.
Scheme | Expression weight | Native bonus |
---|---|---|
low | 4.0 | 3.0 |
medium | 3.0 | 2.0 |
strong | 4.0 | 2.0 |
Example command for designing a structure on the pre trained Flu model 0 using the strong weighting scheme. It is recommended to graft a resfile that repacks the pose, and allows design in the Fv region w/o CDR3 and Cysteins.
./design.py --pdb structure.pdb --xml xml/AbExpress.xml --script_vars weight_express=4 bonus_native=2 resfile=resfile msa=sample.fa model=pre-trained/Flu/manuscript_0