# Use Amazon SageMaker for Parameter-Efficient Fine Tuning of the ESM-2 Protein Language Model

Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
SPDX-License-Identifier: MIT-0

Note: We recommend running this notebook on a **ml.t3.medium** instance with the **Data Science 3.0** image.

### What is a Protein?

Proteins are complex molecules that are essential for life. The shape and structure of a protein determines what it can do in the body. Knowing how a protein is folded and how it works helps scientists design drugs that target it. For example, if a protein causes disease, a drug might be made to block its function. The drug needs to fit into the protein like a key in a lock. Understanding the protein's molecular structure reveals where drugs can attach. This knowledge helps drive the discovery of innovative new drugs.

![Proteins are made up of long chains of amino acids](img/protein.png)

### What is a Protein Language Model?

Proteins are made up of linear chains of molecules called amino acids, each with its own chemical structure and properties. If we think of each amino acid in a protein like a word in a sentence, it becomes possible to analyze them using methods originally developed for analyzing human language. Scientists have trained these so-called, "Protein Language Models", or pLMs, on millions of protein sequences from thousands of organisms. With enough data, these models can begin to capture the underlying evolutionary relationships between different amino acid sequences.

It can take a lot of time and compute to train a pLM from scratch for a certain task. For example, a team at Tsinghua University [recently described](https://www.biorxiv.org/content/10.1101/2023.07.05.547496v3) training a 100 Billion-parameter pLM on 768 A100 GPUs for 164 days! Fortunately, in many cases we can save time and resources by adapting an existing pLM to our needs. This technique is called "fine-tuning" and also allows us to borrow advanced tools from other types of language modeling

### What is ESM-2?

[ESM-2](https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1) is a pLM trained using unsupervied masked language modelling on 250 Million protein sequences by researchers at [Facebook AI Research (FAIR)](https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1). It is available in several sizes, ranging from 8 Million to 15 Billion parameters. The smaller models are suitable for various sequence and token classification tasks. The FAIR team also adapted the 3 Billion parameter version into the ESMFold protein structure prediction algorithm. They have since used ESMFold to predict the struture of [more than 700 Million metagenomic proteins](https://esmatlas.com/about). 

ESM-2 is a powerful pLM. However, it has traditionally required multiple A100 GPU chips to fine-tune. In this notebook, we demonstrate how to use QLoRA to fine-tune ESM-2 in on an inexpensive Amazon SageMaker training instance. We will use ESM-2 to predict [subcellular localization](https://academic.oup.com/nar/article/50/W1/W228/6576357). Understanding where proteins appear in cells can help us understand their role in disease and find new drug targets. 

---
## 1. Set up environment

In [None]:
%pip install -U --disable-pip-version-check  --no-warn-conflicts -r notebook-requirements.txt

Load the sagemaker package and create some service clients

In [None]:
import boto3
from datasets import Dataset
import os
import pandas as pd
import random
import sagemaker
from sagemaker.experiments.run import Run
from sagemaker.huggingface import HuggingFace
from sagemaker.inputs import TrainingInput
from time import strftime
from transformers import AutoTokenizer

boto_session = boto3.session.Session()
sagemaker_session = sagemaker.session.Session(boto_session)
S3_BUCKET = sagemaker_session.default_bucket()
s3 = boto_session.client("s3")
sagemaker_client = boto_session.client("sagemaker")
sagemaker_execution_role = sagemaker.session.get_execution_role(sagemaker_session)
REGION_NAME = sagemaker_session.boto_region_name
print(f"Assumed SageMaker role is {sagemaker_execution_role}")

S3_PREFIX = "esm-loc-ft"
S3_PATH = sagemaker.s3.s3_path_join("s3://", S3_BUCKET, S3_PREFIX)
print(f"S3 path is {S3_PATH}")

EXPERIMENT_NAME = "esm-loc-ft-" + strftime("%Y-%m-%d-%H-%M-%S")
print(f"Experiment name is {EXPERIMENT_NAME}")

---
## 2. Build Dataset

We'll use a version of the [DeepLoc-2 data set](https://services.healthtech.dtu.dk/services/DeepLoc-2.0/) to fine tune our localization model. It consists of several thousand protein sequences, each with one or more experimentally-observed location labels. This data was extracted by the DeepLoc team at Technical University of Denmark from the public [UniProt sequence database](https://www.uniprot.org/).

In [12]:
df = pd.read_csv(
    "https://services.healthtech.dtu.dk/services/DeepLoc-2.0/data/Swissprot_Train_Validation_dataset.csv"
).drop(["Unnamed: 0", "Partition"], axis=1)
df["Membrane"] = df["Membrane"].astype("int32")

# filter for sequences between 100 and 512 amino acides
df = df[df["Sequence"].apply(lambda x: len(x)).between(100, 512)]

# Remove unnecessary features
df = df[["Sequence", "Kingdom", "Membrane"]]

display(df)
print(df.groupby("Kingdom").size())
print()
print(df.groupby("Membrane").size())

Kingdom
Fungi             605
Metazoa          1711
Other              74
Viridiplantae     610
dtype: int64

Membrane
0    1584
1    1416
dtype: int64


Unnamed: 0,Sequence,Kingdom,Membrane
0,MTLNGGSGASGSRGAGGRERDRRRGSTPWGPAPPLHRRSMPVDERD...,Metazoa,1
1,MSLINEHCNERNYISTPNSSEDLSSPQNCGLDEGASASSSSTINSD...,Viridiplantae,0
2,MGRPEFNRGGGGGGFRGGRGGDRGGSRGGFGGGGRGGYGGGDRGSF...,Metazoa,0
3,MILSNTTAVTPFLTKLWQETVQQGGNMSGLARRSPRSSDGKLEALY...,Metazoa,1
4,MEAMGEWSNNLGGMYTYATEEADFMNQLLASYDHPGTGSSSGAAAS...,Viridiplantae,0
...,...,...,...
2995,MRRRVFSSQDWRASGWDGMGFFSRRTFCGRSGRSCRGQLVQVSRPE...,Metazoa,1
2996,MAGYATTPSPMQTLQEEAVCAICLDYFKDPVSISCGHNFCRGCVTQ...,Metazoa,0
2997,MASIDSLQFHSLCNLQSSIGRAKLQNPSSLVIFRRRPVNLNWVQFE...,Viridiplantae,1
2998,MHNLFLYSVVFSLGLVSFITCFAAEFKRTQKEDIRWDTERNCYVPG...,Viridiplantae,1


This looks good, but the two membrane classes are unbalanced. Let's resample to bring them closer together.

In [43]:
# Resample rows to randomize and create equal distribution of Membrane values
weights = 1.0 / df.groupby("Membrane")["Membrane"].transform("count")
df = df.sample(n=3000, weights=weights).reset_index(drop=True)

# Visualize data
print(df.groupby("Kingdom").size())
print()
print(df.groupby("Membrane").size())

Kingdom
Fungi             605
Metazoa          1711
Other              74
Viridiplantae     610
dtype: int64

Membrane
0    1584
1    1416
dtype: int64


Next, we tokenize the sequences and trim them to a max length of 512 amino acids.

In [13]:
dataset = Dataset.from_pandas(df).train_test_split(test_size=0.2, shuffle=True)
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")


def preprocess_data(examples, max_length=512):
    text = examples["Sequence"]
    encoding = tokenizer(text, truncation=True, max_length=max_length)
    encoding["labels"] = examples["Membrane"]
    return encoding


encoded_dataset = dataset.map(
    preprocess_data,
    batched=True,
    num_proc=os.cpu_count(),
    remove_columns=dataset["train"].column_names,
)

encoded_dataset.set_format("torch")
print(encoded_dataset)

Map (num_proc=2):   0%|          | 0/2400 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/600 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2400
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 600
    })
})


Look at an example record

In [14]:
random_idx = random.randint(3, len(encoded_dataset["train"]))
example = encoded_dataset["train"][random_idx]

print(f"Viewing example record {random_idx}")
print(f"Raw sequence:\n{tokenizer.decode(example['input_ids'])}\n")
print(f"Tokenized sequence:\n{example['input_ids'].tolist()}\n")
print(f"Label:\n{example['labels']}")

Viewing example record 291
Raw sequence:
<cls> M S N G H V K F D A D E S Q A S A S A V T D R Q D D V L V I S K K D K E V H S S S D E E S D D D D A P Q E E G L H S G K S E V E S Q I T Q R E E A I R L E Q S Q L R S K R R K Q N E L Y A K Q K K S V N E T E V T D E V I A E L P E E L L K N I D Q K D E G S T Q Y S S S R H V T F D K L D E S D E N E E A L A K A I K T K K R K T L K N L R K D S V K R G K F R V Q L L S T T Q D S K T L P P K K E S S I I R S K D R W L N R K A L N K G <eos>

Tokenized sequence:
[0, 20, 8, 17, 6, 21, 7, 15, 18, 13, 5, 13, 9, 8, 16, 5, 8, 5, 8, 5, 7, 11, 13, 10, 16, 13, 13, 7, 4, 7, 12, 8, 15, 15, 13, 15, 9, 7, 21, 8, 8, 8, 13, 9, 9, 8, 13, 13, 13, 13, 5, 14, 16, 9, 9, 6, 4, 21, 8, 6, 15, 8, 9, 7, 9, 8, 16, 12, 11, 16, 10, 9, 9, 5, 12, 10, 4, 9, 16, 8, 16, 4, 10, 8, 15, 10, 10, 15, 16, 17, 9, 4, 19, 5, 15, 16, 15, 15, 8, 7, 17, 9, 11, 9, 7, 11, 13, 9, 7, 12, 5, 9, 4, 14, 9, 9, 4, 4, 15, 17, 12, 13, 16, 15, 13, 9, 6, 8, 11, 16, 19, 8, 8, 8, 10, 21, 7, 11, 18, 13, 15, 4,

Finally, we upload the processed training, test, and validation data to S3.

In [15]:
train_s3_uri = S3_PATH + "/data/train"
test_s3_uri = S3_PATH + "/data/test"

encoded_dataset["train"].save_to_disk(train_s3_uri)
encoded_dataset["test"].save_to_disk(test_s3_uri)

Saving the dataset (0/1 shards):   0%|          | 0/2400 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/600 [00:00<?, ? examples/s]

---
## 3. Train model in SageMaker

Let's try a few different approaches to improving the efficiency of our fine-tuning job.

### Gradient Accumulation

Gradient accumulation is a training technique that allows models to simulate training on larger batch sizes. Typically, the batch size - the number of samples used to calculate the gradient in one training step - is limited by the GPU memory capacity. With gradient accumulation, the model calculates gradients on smaller batches first. Then, instead of updating the model weights right away, the gradients get accumulated over multiple small batches. Once the accumulated gradients equal the target larger batch size, the optimization step is performed to update the model. This lets models train with effectively bigger batches without exceeding the GPU memory limit. However, extra computation is needed for the smaller batch forward and backward passes. So increased batch sizes via gradient accumulation can slow down training, especially if too many accumulation steps are used. The aim is to maximize GPU usage but avoid excessive slowdowns from too many extra gradient computation steps.

### Gradient Checkpointing

Gradient checkpointing is a technique that reduces the memory needed during training while keeping the computational time reasonable. Large neural networks take up a lot of memory because they have to store all the intermediate values from the forward pass in order to calculate the gradients during the backward pass. This can cause memory issues. One solution is to not store these intermediate values, but then they have to be recalculated during the backward pass, which takes a lot of time. 

Gradient checkpointing provides a balanced approach. It saves only some of the intermediate values, called "checkpoints," and recalculates the others as needed. So it uses less memory than storing everything, but also less computation than recalculating everything. By strategically selecting which activations to checkpoint, gradient checkpointing enables large neural networks to be trained with manageable memory usage and computation time. This important technique makes it feasible to train very large models that would otherwise run into memory limitations.

### Low-Rank Adaptation of Large Language Models (LoRA)

Large language models like ESM-2 can contain billions of parameters that are expensive to train and run. [Researchers](https://www.biorxiv.org/content/10.1101/2023.07.05.547496v3) have developed a new training method called Low-Rank Adaptation (LoRA) to make fine-tuning these huge models more efficient. 

The key idea behind LoRA is that when fine-tuning a model for a specific task, you don't need to update all of the original parameters. Instead, LoRA adds new smaller matrices to the model that transform the inputs and outputs. Only these smaller matrices are updated during fine-tuning, which is much faster and uses less memory. The original model parameters stay frozen. 

After fine-tuning with LoRA, the small adapted matrices can be merged back into the original model. Or they can be kept separate if you want to quickly fine-tune the model for other tasks without forgetting previous ones. Overall, LoRA allows large language models to be efficiently adapted to new tasks at a fraction of the usual cost.

In [35]:
hyperparameters = {
    "model_id": "facebook/esm2_t33_650M_UR50D",
    "per_device_train_batch_size": 4,
    "gradient_accumulation_steps": 2,
    "lora": True,
    "use_gradient_checkpointing": True,
}

metric_definitions = [
    {"Name": "epoch", "Regex": "'epoch': ([0-9.]*)"},
    {
        "Name": "max_gpu_mem",
        "Regex": "Max GPU memory use during training: ([0-9.e-]*) MB",
    },
    {"Name": "train_loss", "Regex": "'loss': ([0-9.e-]*)"},
    {
        "Name": "train_samples_per_second",
        "Regex": "'train_samples_per_second': ([0-9.e-]*)",
    },
    {"Name": "eval_loss", "Regex": "'eval_loss': ([0-9.e-]*)"},
    {"Name": "eval_accuracy", "Regex": "'eval_accuracy': ([0-9.e-]*)"},
]

hf_estimator = HuggingFace(
    base_job_name="esm-2-membrane-finetuning",
    entry_point="lora-train.py",
    source_dir="scripts/training/peft",
    instance_type="ml.g5.2xlarge",
    instance_count=1,
    transformers_version="4.28",
    pytorch_version="2.0",
    py_version="py310",
    output_path=f"{S3_PATH}/output",
    role=sagemaker_execution_role,
    hyperparameters=hyperparameters,
    metric_definitions=metric_definitions,
    checkpoint_local_path="/opt/ml/checkpoints",
    sagemaker_session=sagemaker_session,
    tags=[{"Key": "project", "Value": "esm-fine-tuning"}],
)

In [None]:
with Run(
    experiment_name=EXPERIMENT_NAME,
    sagemaker_session=sagemaker_session,
) as run:
    hf_estimator.fit(
        {
            "train": TrainingInput(s3_data=train_s3_uri, input_mode="File"),
            "test": TrainingInput(s3_data=test_s3_uri, input_mode="File"),
        },
        wait=False,
    )

You can view metrics and debugging information for this run in SageMaker Experiments. On the left-side navigation panel, select the Home icon, then "Experiments". From there, you can select your experiment name and training job name and view the Debugger insights.

The following table compares the different training methods we discussed above and their effect on the run time, accuracy, and GPU memory requirements of our job.

| Configuration | Billable time (sec) | Evaluation Accuracy | Max GPU Memory Usage (GB)
| ----------- | ----------- | ----------- | ----------- |
| Base 650M model | 1187 | 0.86 | 23.0 |
| Base + Gradient Accumulation | 1212 | 0.86 | 18.1 |
| Base + Gradient Checkpointing | 1051 | 0.86 | 9.3 |
| Base + LoRA | 810 | 0.85 | 18.7 |


All of the methods produced models with high evaluation accuracy. Using LoRA decreased the run time (and cost) by 32% and using gradient checkpointing decreased the maximum GPU memory usage by 60%. Depending on our constraints (cost, time, hardware) one of these approaches may make more sense than another.

Each of these methods perform well by themselves, but what happens when we use them in combination? Here are the results:

| Configuration | Billable time (sec) | Evaluation Accuracy | Max GPU Memory Usage (GB)
| ----------- | ----------- | ----------- | ----------- |
| Base + LoRA + GA + GC | 825 | 0.78 | 3.7 |

In this case, we see a 10% reduction in accuracy. However, the run time improvements are very close to what we saw with LoRA by itself. More importantly, we've reduced the GPU memory use by 84%! This is a massive decrease that allows us to train on a wide range of cost-effective instance types.


---
## 4. Deploy Model as Real-Time Inference Endpoint

Finally, let's deploy our trained model to an inference endpoint and test it against some protein sequences with known subcellular localization. In this case, we'll load a previously-trained version of the model saved on the public HuggingFace model hub.

In [40]:
%%time

hub = {"HF_MODEL_ID": "bloyal/esm2_650M_membrane_loc", "HF_TASK": "text-classification"}

hf_model = HuggingFaceModel(
    env=hub,
    role=sagemaker_execution_role,
    transformers_version="4.28",
    pytorch_version="2.0",
    py_version="py310",
)

predictor = hf_model.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.2xlarge",
    role=sagemaker_execution_role,
)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


INFO:sagemaker:Creating model with name: huggingface-pytorch-inference-2023-11-10-21-15-46-385
INFO:sagemaker:Creating endpoint-config with name huggingface-pytorch-inference-2023-11-10-21-15-47-040
INFO:sagemaker:Creating endpoint with name huggingface-pytorch-inference-2023-11-10-21-15-47-040


------------!CPU times: user 510 ms, sys: 4.44 ms, total: 514 ms
Wall time: 6min 33s


Try running some known proteins

In [41]:
# Example cell membrane proteins
glp_1_receptor = "MAGAPGPLRLALLLLGMVGRAGPRPQGATVSLWETVQKWREYRRQCQRSLTEDPPPATDLFCNRTFDEYACWPDGEPGSFVNVSCPWYLPWASSVPQGHVYRFCTAEGLWLQKDNSSLPWRDLSECEESKRGERSSPEEQLLFLYIIYTVGYALSFSALVIASAILLGFRHLHCTRNYIHLNLFASFILRALSVFIKDAALKWMYSTAAQQHQWDGLLSYQDSLSCRLVFLLMQYCVAANYYWLLVEGVYLYTLLAFSVLSEQWIFRLYVSIGWGVPLLFVVPWGIVKYLYEDEGCWTRNSNMNYWLIIRLPILFAIGVNFLIFVRVICIVVSKLKANLMCKTDIKCRLAKSTLTLIPLLGTHEVIFAFVMDEHARGTLRFIKLFTELSFTSFQGLMVAILYCFVNNEVQLEFRKSWERWRLEHLHIQRDSSMKPLKCPTSSLSSGATAGSSMYTATCQASCS"
pd1 = "MQIPQAPWPVVWAVLQLGWRPGWFLDSPDRPWNPPTFSPALLVVTEGDNATFTCSFSNTSESFVLNWYRMSPSNQTDKLAAFPEDRSQPGQDCRFRVTQLPNGRDFHMSVVRARRNDSGTYLCGAISLAPKAQIKESLRAELRVTERRAEVPTAHPSPSPRPAGQFQTLVVGVVGGLLGSLVLLVWVLAVICSRAARGTIGARRTGQPLKEDPSAVPVFSVDYGELDFQWREKTPEPPVPCVPEQTEYATIVFPSGMGTSSPARRGSADGPRSAQPLRPEDGHCSWPL"
trac = "IQNPDPAVYQLRDSKSSDKSVCLFTDFDSQTNVSQSKDSDVYITDKTVLDMRSMDFKSNSAVAWSNKSDFACANAFNNSIIPEDTFFPSPESSCDVKLVEKSFETDTNLNFQNLSVIGFRILLLKVAGFNLLMTLRLWSS"
apj = "MEEGGDFDNYYGADNQSECEYTDWKSSGALIPAIYMLVFLLGTTGNGLVLWTVFRSSREKRRSADIFIASLAVADLTFVVTLPLWATYTYRDYDWPFGTFFCKLSSYLIFVNMYASVFCLTGLSFDRYLAIVRPVANARLRLRVSGAVATAVLWVLAALLAMPVMVLRTTGDLENTTKVQCYMDYSMVATVSSEWAWEVGLGVSSTTVGFVVPFTIMLTCYFFIAQTIAGHFRKERIEGLRKRRRLLSIIVVLVVTFALCWMPYHLVKTLYMLGSLLHWPCDFDLFLMNIFPYCTCISYVNSCLNPFLYAFFDPRFRQACTSMLCCGQSRCAGTSHSSSGEKSASYSSGHSQGPGPNMGKGGEQMHEKSIPYSQETLVVD"
rit1 = "MDSGTRPVGSCCSSPAGLSREYKLVMLGAGGVGKSAMTMQFISHRFPEDHDPTIEDAYKIRIRIDDEPANLDILDTAGQAEFTAMRDQYMRAGEGFIICYSITDRRSFHEVREFKQLIYRVRRTDDTPVVLVGNKSDLKQLRQVTKEEGLALAREFSCPFFETSAAYRYYIDDVFHALVREIRRKEKEAVLAMEKKSKPKNSVWKRLKSPFRKKKDSVT"

# Example non-cell membrane proteins
tubulin_beta_1 = "MREIVHIQIGQCGNQIGAKFWEMIGEEHGIDLAGSDRGASALQLERISVYYNEAYGRKYVPRAVLVDLEPGTMDSIRSSKLGALFQPDSFVHGNSGAGNNWAKGHYTEGAELIENVLEVVRHESESCDCLQGFQIVHSLGGGTGSGMGTLLMNKIREEYPDRIMNSFSVMPSPKVSDTVVEPYNAVLSIHQLIENADACFCIDNEALYDICFRTLKLTTPTYGDLNHLVSLTMSGITTSLRFPGQLNADLRKLAVNMVPFPRLHFFMPGFAPLTAQGSQQYRALSVAELTQQMFDARNTMAACDLRRGRYLTVACIFRGKMSTKEVDQQLLSVQTRNSSCFVEWIPNNVKVAVCDIPPRGLSMAATFIGNNTAIQEIFNRVSEHFSAMFKRKAFVHWYTSEGMDINEFGEAENNIHDLVSEYQQFQDAKAVLEEDEEVTEEAEMEPEDKGH"
p53 = "MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD"
ptgs2 = "MLARALLLCAVLALSHTANPCCSHPCQNRGVCMSVGFDQYKCDCTRTGFYGENCSTPEFLTRIKLFLKPTPNTVHYILTHFKGFWNVVNNIPFLRNAIMSYVLTSRSHLIDSPPTYNADYGYKSWEAFSNLSYYTRALPPVPDDCPTPLGVKGKKQLPDSNEIVEKLLLRRKFIPDPQGSNMMFAFFAQHFTHQFFKTDHKRGPAFTNGLGHGVDLNHIYGETLARQRKLRLFKDGKMKYQIIDGEMYPPTVKDTQAEMIYPPQVPEHLRFAVGQEVFGLVPGLMMYATIWLREHNRVCDVLKQEHPEWGDEQLFQTSRLILIGETIKIVIEDYVQHLSGYHFKLKFDPELLFNKQFQYQNRIAAEFNTLYHWHPLLPDTFQIHDQKYNYQQFIYNNSILLEHGITQFVESFTRQIAGRVAGGRNVPPAVQKVSQASIDQSRQMKYQSFNEYRKRFMLKPYESFEELTGEKEMSAELEALYGDIDAVELYPALLVEKPRPDAIFGETMVEVGAPFSLKGLMGNVICSPAYWKPSTFGGEVGFQIINTASIQSLICNNVKGCPFTSFSVPDPELIKTVTINASSSRSGLDDINPTVLLKERSTEL"
znf195 = "MTLLTFRDVAIEFSLEEWKCLDLAQQNLYRDVMLENYRNLFSVGLTVCKPGLITCLEQRKEPWNVKRQEAADGHPEMGFHHATQACLELLGSSDLPASASQSAGITGVNHRAQPGLNVSVDKFTALCSPGVLQTVKWFLEFRCIFSLAMSSHFTQDLLPEQGIQDAFPKRILRGYGNCGLDNLYLRKDWESLDECKLQKDYNGLNQCSSTTHSKIFQYNKYVKIFDNFSNLHRRNISNTGEKPFKCQECGKSFQMLSFLTEHQKIHTGKKFQKCGECGKTFIQCSHFTEPENIDTGEKPYKCQECNNVIKTCSVLTKNRIYAGGEHYRCEEFGKVFNQCSHLTEHEHGTEEKPCKYEECSSVFISCSSLSNQQMILAGEKLSKCETWYKGFNHSPNPSKHQRNEIGGKPFKCEECDSIFKWFSDLTKHKRIHTGEKPYKCDECGKAYTQSSHLSEHRRIHTGEKPYQCEECGKVFRTCSSLSNHKRTHSEEKPYTCEECGNIFKQLSDLTKHKKTHTGEKPYKCDECGKNFTQSSNLIVHKRIHTGEKPYKCEECGRVFMWFSDITKHKKTHTGEKPYKCDECGKNFTQSSNLIVHKRIHTGEKPYKCEKCGKAFTQFSHLTVHESIHT"
adh5 = "MANEVIKCKAAVAWEAGKPLSIEEIEVAPPKAHEVRIKIIATAVCHTDAYTLSGADPEGCFPVILGHEGAGIVESVGEGVTKLKAGDTVIPLYIPQCGECKFCLNPKTNLCQKIRVTQGKGLMPDGTSRFTCKGKTILHYMGTSTFSEYTVVADISVAKIDPLAPLDKVCLLGCGISTGYGAAVNTAKLEPGSVCAVFGLGGVGLAVIMGCKVAGASRIIGVDINKDKFARAKEFGATECINPQDFSKPIQEVLIEMTDGGVDYSFECIGNVKVMRAALEACHKGWGVSVVVGVAASGEEIATRPFQLVTGRTWKGTAFGGWKSVESVPKLVSEYMSKKIKVDEFVTHNLSFDEINKAFELMHSGKSIRTVVKI"

sample = {"inputs": glp_1_receptor}
predictor.predict(sample)

[{'label': 'LABEL_1', 'score': 0.7002050280570984}]

In [None]:
try:
    predictor.delete_endpoint()
except:
    pass