This is a software package for protein engineers. It uses protein structure and sequence information to aid researchers in designing protein recombination libraries.
SCHEMA-RASPP was developed originally developed in the laboratory of Frances H. Arnold at the California Institute of Technology. SCHEMA-RICE was developed by Weiliang Huang in the laboratory of Elizabeth Gillam at The University of Queensland.
These tools can calculate SCHEMA energies of chimeric proteins and run the RASPP algorithm to find optimal library designs.
SCHEMA-RICE modifies the original SCHEMA algorithm (Voigt et al., 2002) and takes into account the nature of the interaction, not just the proximity of the groups.
Where:
References:
Voigt, C. et al., "Protein building blocks preserved by recombination," Nature Structural Biology 9(7):553-558 (2002).
Meyer, M. et al., "Library analysis of SCHEMA-guided recombination," Protein Science 12:1686-1693 (2003).
Otey, C. et al., "Functional evolution and structural conservation in chimeric cytochromes P450: Calibrating a structure-guided approach," Chemistry & Biology 11:1-20 (2004)
Silberg, J. et al., "SCHEMA-guided protein recombination," Methods in Enzymology 388:35-42 (2004).
Endelman, J. et al., "Site-directed protein recombination as a shortest-path problem," Protein Engineering, Design & Selection 17(7):589-594 (2005).
Smith, M.A., Arnold, F.H., "Designing Libraries of Chimeric Proteins Using SCHEMA Recombination and RASPP", Methods in Molecular Biology 1179: 335-343 (2014)
- Clone this repository to your computer. It is assumed you have Python 3.8 or higher installed.
git clone https://github.com/SebPorras/SCHEMA.git
- Move into the cloned directory
cd SCHEMA
There are essentially two steps to calculate optimal crossover points using SCHEMA-RICE.
-
Generate a contact map for the proteins to be used in recombination (parental proteins).
-
Find optimal crossover points using SCHEMA-RICE scoring and the RASPP algorithm.
usage: rice.py [-h] -pdb PDB -msa MSA -xo XO [-pdbal PDBAL] [-chains CHAINS]
[-min MIN] [-bin BIN] [-o output.txt] [-con contacts.txt]
Options:
-h, --help Show this help message and exit
-pdb PDB A PDB file from the Protein Data Bank
-msa MSA A multiple sequence alignment in ALN format (e.g. ClustalW)
-xo XO The number of crossovers
-pdbal PDBAL (Optional) In ALN format. If this argument is not provided,
then the PDB file's ID (e.g., 1G68) will be extracted, and
the sequence having that ID in the multiple sequence
alignment file will be used.
-chains CHAINS (Optional) The PDB chain identifers (e.g. -chain A B). Chains 'A' and
'' are included by default.
-min MIN (Optional) The minimum fragment length (minus invariant positions), in
residues. Default min is 4.
-bin BIN (Optional) The width of each average mutation bin. Default bin is 1.
-o output.txt (Optional) Specify where you want your RASPP curve to be saved. If this
option is not used, output will be printed to stdout.
-con contacts.txt (Optional) You can provide an existing contact file you have previously
created. If not specified, rice.py will generate a new file
called contacts.txt.
There are two essential files you need to use the tool.
- A multiple sequence alignment (MSA) of the parental proteins (proteins you wish to recombine)in ALN format without a header. An example of this is below. msa.txt is shown below.
# Multiple sequence alignment for P450s
1A1 QVPKGLKNPPGPWGWPLIGHMLTLGKNPHLALSRMSQQYGDVLQIRIGSTPVVVLSGLDT
1A2 RVPKGLKSPPEPWGWPLLGHVLTLGKNPHLALSRMSQRYGDVLQIRIGSTPVLVLSRLDT
:******.** ******:**:****************:**************:*** ***
1A1 IRQALVRQGDDFKGRPDLYTFTLISNGQSMSFSPDSGPVWAARRRLAQNGLKSFSIASDP
1A2 IRQALVRQGDDFKGRPDLYTSTLITDGQSLTFSTDSGPVWAARRRLAQNALNTFSIASDP
******************** ***::***::**.***************.*::*******
...
- A PDB file to generate the contact map.
If one of your parent sequences in the MSA matches the sequence in your PDB file, you can now run the tool. For example:
python rice.py -pdb 1A2.pdb -msa msa.txt -xo 6
However, if your PDB structure is not found in your MSA file, you will also need to provide an alignment between the PDB sequence and one of the sequences in your MSA. An example of this is shown in 1a1_2hi4_aln.txt:
2HI4 RVPKGLKSPPEPWGWPLLGHVLTLGKNPHLALSRMSQRYGDVLQIRIGSTPVLVLSRLDT
1A1 QVPKGLKNPPGPWGWPLIGHMLTLGKNPHLALSRMSQQYGDVLQIRIGSTPVVVLSGLDT
:******.** ******:**:****************:**************:*** ***
2HI4 IRQALVRQGDDFKGRPDLYTSTLITDGQSLTFSTDSGPVWAARRRLAQNALNTFSIASDP
1A1 IRQALVRQGDDFKGRPDLYTFTLISNGQSMSFSPDSGPVWAARRRLAQNGLKSFSIASDP
******************** ***::***::**.***************.*::*******
An example command would look like this:
python rice.py -pdb 2HI4.pdb -msa msa.txt -pdbal 1a1_2hi4_aln.txt -xo 6
The final mandatory argument is the -xo option which specifies the number of crossovers you wish to optimise for.
By default, rice.py will generate a new contact file called contacts.txt. However, by using the -con option, you can provide an existing contact file which you have previously created from running rice.py. This is generally recommended as it can speed up the runtime of the program. For example:
python rice.py -pdb 2HI4.pdb -msa msa.txt -pdbal 1a1_2hi4_aln.txt -xo 6 -con contacts.txt
The -o options specifies where you would like your output to be saved. If you do not specify a file, the output will simply be printed to your stdout. For example:
python rice.py -pdb 2HI4.pdb -msa msa.txt -pdbal 1a1_2hi4_aln.txt -xo 6 -o output.txt
The output from this command is shown below.
# Minimum fragment length = 4 # Using bin width = 1 # Number of crossovers = 6 # RASPP took 48.64 secs # RASPP found 720 results # RASPP found 15 unique (<E>,<m>) points # RASPP curve took 0.07 secs # <E> <M> crossover points 7.0000 21.5000 304 314 337 373 426 437 9.0000 25.0000 53 283 314 351 426 447 8.0000 26.0000 38 85 304 314 337 426 9.0000 26.5000 38 85 283 314 337 426 10.5000 34.3906 243 304 314 337 426 437 11.5000 36.2656 231 304 314 337 426 437 12.0000 36.8281 228 304 314 337 426 437 12.5000 37.9062 85 243 304 323 426 447 13.5000 39.1562 85 231 304 323 426 447 15.0000 39.7500 38 90 142 228 304 337 14.0000 40.8281 53 110 228 304 332 426 14.0000 41.9531 57 137 228 304 337 426 15.0000 43.3438 81 124 188 228 304 426 15.0000 43.5938 90 164 188 228 304 426 16.5000 44.7344 94 188 243 279 351 426
You can also specify a minimum fragment length using the -min option. This means that the distance bewteen crossover points will not be less than this value. By default, this value is set to 4. It is generally recommended to specify -min to prevent RASPP from choosing trival crossovers.
python rice.py -pdb 1G68.pdb -msa lac-msa.txt -pdbal PSE4-1G68.txt -xo 6 -min 10 -o output.txt
Finally, when generating the RASPP output, you can specify the width of each average mutation bin if you wish. The default bin width is 1. For example:
python rice.py -pdb 1G68.pdb -msa lac-msa.txt -pdbal PSE4-1G68.txt -xo 6 -min 10 -bin 2
rice.py is simply streamlines the operation of rasppcurve.py and schemacontacts.py originally provided in the SCHEMA Tools package. If you wish to use these tools independently, or understand the program better, refer to the SCHEMA Tools section below.
The original authors have also meticulously documented the original Python tools in schema-tools-doc.html.
If you are interested in exploring the use of these tools, the scripts have been updated to Python 3 but can be used exactly the same way as demonstrated in the original documentation.
However, SCHEMA energy (E) is calculated using the SCHEMA-RICE algorithm which factors in the physio-chemical properties of interacting residues.
The excerpt below details what these tools can be used for.
- Generate a contact map from a PDB file and an alignment of parent proteins
- Calculate SCHEMA energy E and mutation m for chimeras, or an entire combinatorial library, using a contact map
- Enumerate crossover points and compute average and for the resulting libraries
- Find crossover points predicted to optimize folded, diverse proteins using RASPP