#CafChem tools for Training and using protein GPTs

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MauricioCafiero/CafChem/blob/main/notebooks/ProteinGPT_CafChem.ipynb)

## This notebook allows you to:
- load a set of protein sequences based on length and organisms
- tokenize the sequences
- train a GPT model based on the tokenized data
- save and load models.
- load a foundation model
- finetune a foundation model or your own model
- run inference to generate novel proteins.

## Requirements:

- Runs on an L4 GPU, memory dependent; large datasets need A100 or higher

## Install and import libraries

In [1]:
!pip install deepchem -q
!pip install -q rdkit

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m552.4/552.4 kB[0m [31m35.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m36.4/36.4 MB[0m [31m56.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
!git clone https://github.com/MauricioCafiero/CafChem.git

Cloning into 'CafChem'...
remote: Enumerating objects: 1218, done.[K
remote: Counting objects: 100% (401/401), done.[K
remote: Compressing objects: 100% (133/133), done.[K
remote: Total 1218 (delta 368), reused 268 (delta 268), pack-reused 817 (from 2)[K
Receiving objects: 100% (1218/1218), 60.98 MiB | 17.17 MiB/s, done.
Resolving deltas: 100% (712/712), done.


In [4]:
import tensorflow as tf
import numpy as np
import pandas as pd
from io import BytesIO
import requests

from CafChem.CafChemProteinGPT import *

Instructions for updating:
experimental_relax_shapes is deprecated, use reduce_retracing instead


## Get protein data

### Load data

In [5]:
#@title Set protein fetching parameters

protein_min_length = 10 #@param {type:"integer"}
protein_max_length = 100 #@param {type:"integer"}

#@markdown Include:
sequence = True #@param {type:"boolean"}
subcellular_location = False #@param {type:"boolean"}
protein_name = True #@param {type:"boolean"}
gene_names = True #@param {type:"boolean"}
organism_name = True #@param {type:"boolean"}
interaction = False #@param {type:"boolean"}
#@markdown ---
#@markdown Only Human proteins?
human_only = True #@param {type:"boolean"}

fields = ''
if subcellular_location:
  fields += '%2Ccc_subcellular_location'
if sequence:
  fields += '%2Csequence'
if protein_name:
  fields += '%2Cprotein_name'
if gene_names:
  fields += '%2Cgene_names'
if organism_name:
  fields += '%2Corganism_name'
if interaction:
  fields += '%2Ccc_interaction'

if human_only:
  include_human = 'organism_id%3A9606%29%20AND%20%28'
else:
  include_human = ''




In [6]:
query_url = f"https://rest.uniprot.org/uniprotkb/stream?compressed=true&fields=accession\
{fields}&format=tsv&query=%28%28{include_human}reviewed%3Atrue%29%20AND%20%28length%3A%5B\
{protein_min_length}%20TO%20{protein_max_length}%5D%29%29"

uniprot_request = requests.get(query_url)

bio = BytesIO(uniprot_request.content)

df = pd.read_csv(bio, compression='gzip', sep='\t')
df

Unnamed: 0,Entry,Sequence,Protein names,Gene Names,Organism
0,A0A0B4J2F0,MFRRLTFAQLLFATVLGIAGGVYIFQPVFEQYAKDQKELKEKMQLV...,Protein PIGBOS1 (PIGB opposite strand protein 1),PIGBOS1,Homo sapiens (Human)
1,A0A0C5B5G6,MRWQEMGYIFYPRKLR,Mitochondrial-derived peptide MOTS-c (Mitochon...,MT-RNR1,Homo sapiens (Human)
2,A0A0U1RRE5,MGDQPCASGRSTLPPGNAREAKPPKKRCLLAPRWDYPEGTPNGGST...,Negative regulator of P-body association (P-bo...,NBDY LINC01420,Homo sapiens (Human)
3,A1L190,MDDADPEERNYDNMLKMLSDLNKDLEKLLEEMEKISVQATWMAYDM...,Synaptonemal complex central element protein 3...,SYCE3 C22orf41 THEG2,Homo sapiens (Human)
4,A8MT69,MEGAGAGSGFRKELVSRLLHLHFKDDKTKVSGDALQLMVELLKVFV...,Centromere protein X (CENP-X) (FANCM-associate...,CENPX FAAP10 MHF2 STRA13,Homo sapiens (Human)
...,...,...,...,...,...
757,Q9UI25,MEEMSYGENSGTHVGSFSCSPQPSQQMKVLFVGNSFLLTPVLHRQP...,Putative uncharacterized protein PRO0461,PRO0461,Homo sapiens (Human)
758,Q9UI54,MESPKCLYSRITVNTAFGTKFSHISFIILFKVFLFPRITISKKTKL...,Putative uncharacterized protein PRO0628,PRO0628,Homo sapiens (Human)
759,Q9UI72,MGMALELYWLCGFRSYWPLGTNAENEGNRKENRRQMQSRNERGCNV...,Putative uncharacterized protein PRO0255,PRO0255,Homo sapiens (Human)
760,Q9Y3F1,MSLLWTPQILTISFVSYILSLFPSPFPSCYTSCWFETSITTEKELN...,Putative TAP2-associated 6.5 kDa polypeptide,,Homo sapiens (Human)


In [None]:
df.to_csv('protein_data_human_10_500.csv', index = False)

### filter data

In [6]:
#df = pd.read_csv('/content/protein_data_human_10_500.csv')

In [7]:
df_length = len(df)
print(f'There are {df_length} rows in the dataframe')

df.dropna(inplace=True, subset=['Sequence'])
df_length = len(df)
print(f'There are {df_length} rows in the dataframe after dropping missing sequences')

df.drop_duplicates(inplace=True, subset=['Sequence'])
df_length = len(df)
print(f'There are {df_length} rows in the dataframe aftr removing duplicates')

df.reset_index(drop=True, inplace=True)
df

There are 762 rows in the dataframe
There are 762 rows in the dataframe after dropping missing sequences
There are 751 rows in the dataframe aftr removing duplicates


Unnamed: 0,Entry,Sequence,Protein names,Gene Names,Organism
0,A0A0B4J2F0,MFRRLTFAQLLFATVLGIAGGVYIFQPVFEQYAKDQKELKEKMQLV...,Protein PIGBOS1 (PIGB opposite strand protein 1),PIGBOS1,Homo sapiens (Human)
1,A0A0C5B5G6,MRWQEMGYIFYPRKLR,Mitochondrial-derived peptide MOTS-c (Mitochon...,MT-RNR1,Homo sapiens (Human)
2,A0A0U1RRE5,MGDQPCASGRSTLPPGNAREAKPPKKRCLLAPRWDYPEGTPNGGST...,Negative regulator of P-body association (P-bo...,NBDY LINC01420,Homo sapiens (Human)
3,A1L190,MDDADPEERNYDNMLKMLSDLNKDLEKLLEEMEKISVQATWMAYDM...,Synaptonemal complex central element protein 3...,SYCE3 C22orf41 THEG2,Homo sapiens (Human)
4,A8MT69,MEGAGAGSGFRKELVSRLLHLHFKDDKTKVSGDALQLMVELLKVFV...,Centromere protein X (CENP-X) (FANCM-associate...,CENPX FAAP10 MHF2 STRA13,Homo sapiens (Human)
...,...,...,...,...,...
746,Q9UI25,MEEMSYGENSGTHVGSFSCSPQPSQQMKVLFVGNSFLLTPVLHRQP...,Putative uncharacterized protein PRO0461,PRO0461,Homo sapiens (Human)
747,Q9UI54,MESPKCLYSRITVNTAFGTKFSHISFIILFKVFLFPRITISKKTKL...,Putative uncharacterized protein PRO0628,PRO0628,Homo sapiens (Human)
748,Q9UI72,MGMALELYWLCGFRSYWPLGTNAENEGNRKENRRQMQSRNERGCNV...,Putative uncharacterized protein PRO0255,PRO0255,Homo sapiens (Human)
749,Q9Y3F1,MSLLWTPQILTISFVSYILSLFPSPFPSCYTSCWFETSITTEKELN...,Putative TAP2-associated 6.5 kDa polypeptide,,Homo sapiens (Human)


In [8]:
seqs = df['Sequence'].tolist()

chars_list = []
lengths_list = []
for seq in seqs:
  lengths_list.append(len(seq))
  for char in seq:
    chars_list.append(char)

ave_length = sum(lengths_list)/len(lengths_list)
print(f'The average sequence length is {ave_length}, minimum is: {min(lengths_list)} and maximum is: {max(lengths_list)}')

chars_set = set(chars_list)
print(f'There are {len(list(chars_set))} unique tokens in the sequences')
print('They are ========================================================')
print(chars_set)

The average sequence length is 75.22503328894807, minimum is: 10 and maximum is: 100
There are 21 unique tokens in the sequences
{'I', 'G', 'N', 'V', 'H', 'Q', 'K', 'S', 'M', 'C', 'R', 'P', 'T', 'D', 'L', 'W', 'U', 'F', 'Y', 'E', 'A'}


## Prepare dataset

In [9]:
fx, fy, VOCAB_SIZE, tokenizer, max_length = make_datasets(df)

tokenizer_config.json:   0%|          | 0.00/367 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/71.0 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

100 10
Vocabulary size for this dataset:  25
Number of features and datapoints, targets:  (751, 98) (751, 98)
featurization done with:  SMILES Tokenizer


### Test tokenizer

In [10]:
i=22
prot = df['Sequence'].iloc[i]
print(prot)
prot = ''
for char in df['Sequence'].iloc[i]:
  prot += char + ' '
print(prot)
tokens = tokenizer.encode(prot)
print(tokens)
back = tokenizer.decode(tokens)
back = back.replace(' ','').replace('[CLS]','').replace('[SEP]','')
print(back)

MWFEILPGLSVMGVCLLIPGLATAYIHRFTNGGKEKRVAHFGYHWSLMERDRRISGVDRYYVSKGLENID
M W F E I L P G L S V M G V C L L I P G L A T A Y I H R F T N G G K E K R V A H F G Y H W S L M E R D R R I S G V D R Y Y V S K G L E N I D 
[0, 1, 19, 17, 6, 16, 20, 13, 12, 20, 7, 15, 1, 12, 15, 11, 20, 20, 16, 13, 12, 20, 14, 8, 14, 18, 16, 3, 2, 17, 8, 9, 12, 12, 4, 6, 4, 2, 15, 14, 3, 17, 12, 18, 3, 19, 7, 20, 1, 6, 2, 5, 2, 2, 16, 7, 12, 15, 5, 2, 18, 18, 15, 7, 4, 12, 20, 6, 9, 16, 5, 22]
MWFEILPGLSVMGVCLLIPGLATAYIHRFTNGGKEKRVAHFGYHWSLMERDRRISGVDRYYVSKGLENID


## Make and train GPT

In [11]:
gpt = make_gpt(num_blocks=4, max_length=max_length, VOCAB_SIZE=VOCAB_SIZE, model_dimension=256, aheads = 4)

In [12]:
batch_size = 1024
gpt.compile("adamax",loss=[tf.keras.losses.SparseCategoricalCrossentropy(),None])
gpt.fit(fx,fy,epochs = 50, batch_size = batch_size, initial_epoch=0)

Epoch 1/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m48s[0m 48s/step - loss: 4.3658
Epoch 2/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 146ms/step - loss: 5.0119
Epoch 3/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 144ms/step - loss: 3.8862
Epoch 4/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 143ms/step - loss: 3.4466
Epoch 5/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 143ms/step - loss: 2.8974
Epoch 6/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 145ms/step - loss: 2.6868
Epoch 7/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 144ms/step - loss: 2.6257
Epoch 8/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 144ms/step - loss: 2.5743
Epoch 9/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 143ms/step - loss: 2.5114
Epoch 10/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 143ms/step - loss: 2.4440
Epoch 11/5

<keras.src.callbacks.history.History at 0x79f6e505b1a0>

## Save model

In [14]:
save_gpt(gpt,'test_Jan15', 25, 100)

New layer names:


model saved with name: test_Jan15.
model parameters saved in file: layer_store_test_Jan15.


## Load Model

In [15]:
gpt = load_gpt('test_Jan15',4, 256, 4)

input_layer_original
token_and_position_embedding_original
transformer_block_original
transformer_block_1_original
transformer_block_2_original
transformer_block_3_original
dense_8_original


input_layer_original has been named!
token_and_position_embedding_original has been named!
transformer_block_original has been named!
transformer_block_1_original has been named!
transformer_block_2_original has been named!
transformer_block_3_original has been named!
dense_8_original has been named!
model loaded with name: test_Jan15.
vocab_size: 25
max_length: 100


## Finetune a model
- Upload a model
- add new layers for finetuning

In [21]:
ft_gpt = make_finetune_gpt('test_Jan15',old_layers=4, model_dimension=256, aheads=4, num_new_blocks=1, freeze_old_layers=True)

Reading in layers:
input_layer_original
token_and_position_embedding_original
transformer_block_original
transformer_block_1_original
transformer_block_2_original
transformer_block_3_original
dense_8_original
vocab_size: 25
max_length: 100


input_layer_original has been named!
token_and_position_embedding_original has been named!
transformer_block_original has been named!
transformer_block_1_original has been named!
transformer_block_2_original has been named!
transformer_block_3_original has been named!
transformer_block_X_1 has been named!
setting layer input_layer_original untrainable.
setting layer token_and_position_embedding_original untrainable.
setting layer transformer_block_original untrainable.
setting layer transformer_block_1_original untrainable.
setting layer transformer_block_2_original untrainable.
setting layer transformer_block_3_original untrainable.
setting layer transformer_block_X_1 trainable.
setting layer dense_X trainable.



Layer 'key' expected 2 variables, but received 0 variables during loading. Expected: ['kernel', 'bias']

List of objects that could not be loaded:
[<EinsumDense name=key, built=True>, <EinsumDense name=attention_output, built=True>, <EinsumDense name=query, built=True>, <EinsumDense name=value, built=True>, <Dense name=dense_26, built=True>, <Dense name=dense_27, built=True>, <LayerNormalization name=layer_normalization_24, built=True>, <LayerNormalization name=layer_normalization_25, built=True>]


In [24]:
batch_size = 1024
ft_gpt.compile("adamax",loss=[tf.keras.losses.SparseCategoricalCrossentropy(),None])
ft_gpt.fit(fx,fy,epochs = 30, batch_size = batch_size, initial_epoch=15)

Epoch 16/30
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 30s/step - loss: 2.2962
Epoch 17/30
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 176ms/step - loss: 3.5046
Epoch 18/30
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 170ms/step - loss: 2.7901
Epoch 19/30
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 174ms/step - loss: 2.5237
Epoch 20/30
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 169ms/step - loss: 2.4484
Epoch 21/30
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 169ms/step - loss: 2.3716
Epoch 22/30
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 169ms/step - loss: 2.3243
Epoch 23/30
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 170ms/step - loss: 2.3223
Epoch 24/30
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 171ms/step - loss: 2.3302
Epoch 25/30
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 171ms/step - loss: 2.3213
E

<keras.src.callbacks.history.History at 0x79f608af5fa0>

In [23]:
ft_gpt = unfreeze_gpt(ft_gpt)

setting layer input_layer_original trainable.
setting layer token_and_position_embedding_original trainable.
setting layer transformer_block_original trainable.
setting layer transformer_block_1_original trainable.
setting layer transformer_block_2_original trainable.
setting layer transformer_block_3_original trainable.
setting layer transformer_block_X_1 trainable.
setting layer dense_X trainable.


## Inference

In [16]:
residues = ['M', 'N', 'P', 'F', 'H', 'E', 'D', 'G', 'S', 'W', 'I', 'A', 'K', 'T', 'Q', 'C', 'L', 'R', 'V', 'Y']
prompts = []
prompt_length = 2
num_prompts = 20
for i in range(num_prompts):
  prompt = ''
  for j in range(prompt_length):
    prompt += random.choice(residues)
  prompts.append(prompt)

print(prompts)

['YM', 'GK', 'RS', 'RM', 'EM', 'KC', 'IR', 'AF', 'CP', 'CL', 'CI', 'VL', 'MT', 'SP', 'LI', 'VY', 'YV', 'EH', 'FI', 'FK']


In [18]:
new_proteins = gen_proteins(prompts,False,gpt,tokenizer,0.5,VOCAB_SIZE, 50)

[[ 0 18  1]
 [ 0 12  4]
 [ 0  2  7]
 [ 0  2  1]
 [ 0  6  1]
 [ 0  4 11]
 [ 0 16  2]
 [ 0 14 17]
 [ 0 11 13]
 [ 0 11 20]
 [ 0 11 16]
 [ 0 15 20]
 [ 0  1  8]
 [ 0  7 13]
 [ 0 20 16]
 [ 0 15 18]
 [ 0 18 15]
 [ 0  6  3]
 [ 0 17 16]
 [ 0 17  4]]
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 6s/step
using variable temp generation with 0.5.
(20, 4)
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 6s/step
using variable temp generation with 0.5.
(20, 5)
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 6s/step
using variable temp generation with 0.5.
(20, 6)
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 6s/step
using variable temp generation with 0.5.
(20, 7)
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 6s/step
using variable temp generation with 0.5.
(20, 8)
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 6s/step
using variable temp generation with 0.5.
(20, 9)
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37

In [19]:
gen_lengths = []
gen_diversity = []
for i,seq in enumerate(new_proteins):
  gen_lengths.append(len(seq))
  gen_diversity.append(len(set(seq)))
  print(seq)
  print(f'protein {i} has a length of {gen_lengths[-1]} and contains {gen_diversity[-1]} different residues.')

print('========================================================================================================')
print(f'The average sequence length is {sum(gen_lengths)/len(gen_lengths)}, minimum is: {min(gen_lengths)} and maximum is: {max(gen_lengths)}')
print(f'The average diversity is {sum(gen_diversity)/len(gen_diversity)}, minimum is: {min(gen_diversity)} and maximum is: {max(gen_diversity)}')

all_residues = []
for seq in new_proteins:
  for char in seq:
    all_residues.append(char)

all_residues = set(all_residues)
print(f'There are {len(list(all_residues))} unique tokens in the sequences')
print('They are ========================================================')
print(all_residues)

YMLLLLLVLTLGETLLLGVAILLLFRFLLLLKGGNSLFLLKYLAAALQLL
protein 0 has a length of 50 and contains 15 different residues.
GKYSHLLQDELLLPLNQNYFLGSAPCLCTCKLATGASESVALSGLILLLA
protein 1 has a length of 50 and contains 17 different residues.
RSRNEENDQHGRTTRLAAQGAEGNFVPDPQKPSYVLLSLAAFLLSKLLED
protein 2 has a length of 50 and contains 16 different residues.
RMILLLLLLLRSLLLGLLYSRLLLLLLLRNIRALLSELLVVVSLELILHH
protein 3 has a length of 50 and contains 12 different residues.
EMKNLTILLLLLLLLLLLLLALLSALVSLSYCLCLCGAAGSVSHNLAASK
protein 4 has a length of 50 and contains 14 different residues.
KCSDPRKAADPPKLDSTALSEESPSCGVGLLLLLDAGTTEKIELRPQLQS
protein 5 has a length of 50 and contains 14 different residues.
IRLSLLLLLLLLLLLLLLLLEGTALLVLLLRLSLLLSSALLQAELLQYPI
protein 6 has a length of 50 and contains 12 different residues.
AFSQLLSSLQQLKLQSLLLLAEYKEAYAVLLLLLLLLTTAVLLLLLLLLV
protein 7 has a length of 50 and contains 10 different residues.
CPQIILLLLLETLLLLLTVLAEALLKTVILLLELLLLSSLLVRRLVDLLN
protein 8 has a lengt

In [20]:
print('[')
for i,seq in enumerate(new_proteins):
  print(f'"{seq}",')
print(']')


[
"YMLLLLLVLTLGETLLLGVAILLLFRFLLLLKGGNSLFLLKYLAAALQLL",
"GKYSHLLQDELLLPLNQNYFLGSAPCLCTCKLATGASESVALSGLILLLA",
"RSRNEENDQHGRTTRLAAQGAEGNFVPDPQKPSYVLLSLAAFLLSKLLED",
"RMILLLLLLLRSLLLGLLYSRLLLLLLLRNIRALLSELLVVVSLELILHH",
"EMKNLTILLLLLLLLLLLLLALLSALVSLSYCLCLCGAAGSVSHNLAASK",
"KCSDPRKAADPPKLDSTALSEESPSCGVGLLLLLDAGTTEKIELRPQLQS",
"IRLSLLLLLLLLLLLLLLLLEGTALLVLLLRLSLLLSSALLQAELLQYPI",
"AFSQLLSSLQQLKLQSLLLLAEYKEAYAVLLLLLLLLTTAVLLLLLLLLV",
"CPQIILLLLLETLLLLLTVLAEALLKTVILLLELLLLSSLLVRRLVDLLN",
"CLFLLGLLEPPKCCNLLLNGSELLLLALHVLLLALLLACKL",
"CIGGAALLVSALTGLLSAALLLLLLLVPCRLLLLLFLLGLLLLLLLLLHL",
"VLLLLLLLLLLALLLLLLLLLLASLLLLLLLLALSCLLLLLEGNIPRLLL",
"MTVVLVVDGLLVLLLLTLLSLVSLLLAELDGLLAAAPEARRAFLAIQELL",
"SPPKSLLLALLALLLLKDLLGLLLLLNRFTPVNGCHLLAQLLSQLLFLLL",
"LILPLLLLFTAPPEYFLLLLLLGKELLALLLACAVKPDKEKLTEPETIFC",
"VYVPPACCNTEPKPPC",
"YVVVGLLKLPLNEREEDLLLLRNGAIAALL",
"EHKEVVAVRLLRYLAALLTLLVPWLLLNLRLLLVLLLKLKLLAIFLPVLL",
"FIQPTAGFLLTVLGALEGLLCPQVATEELLCAPICCVKLISAFAPTALLL",
"FKLTSLLLLLLLLLLLLLKLGLLDLRLLIRLMLL