## **2025 SDSU Data Science Symposium**

**Venue:** Student Union, South Dakota State University, Brookings, SD, USA

**Date of the session:** February 6, 2025

**Instructor :** Cameron Pykiet, Jaylin Dyson, Bishnu Sarker

**Affiliation :** Meharry Medical College School of Applied Computational Sciences, Tennessee, USA

Please cite this tutorial as:

**Bishnu Sarker, Cameron Pykiet, Jaylin Dyson (2025, February). Workshop 1: Building Interdisciplinary applications using Large Language Models. In 2025 Data Science Symposium at South Dakota State University, February 6-7, 2025, Brookings, SD.**

### **Case study 1: Protein Function Prediction using Generative Large Language Model.**

**How to use a pre-trained Transformer Model for generating sequence embeddings**

RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory networks) possess a remarkable capability to model sequential data and capture spatial dependencies. They have been widely utilized as go-to deep learning models for tasks involving sequence data, such as text classification. However, one limitation of traditional LSTMs is their inability to attend to different parts of the input sentence simultaneously.

In certain classification tasks, it becomes essential to associate different sections of the input sequence with one another. By doing so, we can unveil important patterns and relationships that describe the connection between the sequence and the corresponding label.

To address this limitation, more advanced models and architectures have been developed, such as the attention mechanism. Attention-based models allow for the dynamic allocation of attention weights to different parts of the input sequence, enabling the model to focus on the most relevant and informative elements. By incorporating attention mechanisms, we can enhance the model's ability to capture intricate patterns and dependencies, leading to improved performance in tasks requiring sequence classification.

By leveraging attention mechanisms, we can unlock the potential to discover crucial associations between various parts/tokens of the input sequence, thereby enhancing the model's comprehension and representation of the underlying long distance relationships.

The researches published in [( Bahdanau et al., 2014)](https://arxiv.org/pdf/1409.0473.pdf) and [( Luong et al., 2015.)](https://arxiv.org/pdf/1508.04025.pdf) introduced attention mechanism  roughly a decade ago. Following that progress, in 2017 a team of googe researchers first developed the Transformer architecture published as ["Attention is All You Need"](https://arxiv.org/pdf/1706.03762.pdf) that uses multiple attentions (aka multi-head attention). A visual description of the attention mechanism is available [here](https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/) and a much detail explanation [here](https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452). Transformers are a encoder-decoder model designed to perform multitudes of sequence modeling tasks such as classification, translations, learning representaitons etc.


Transformers are resource intensive model in a sense that they require powerful machines and gpus to train. Therefore, it is not feasible to train a new model for every task. There is the concept of pre-training emerges that perform the transfer learning. The model is pre-trained using a resource rich computing facility. The model architecture with pre-trained parameters are shared with the public. The pre-trained models are then used to build other models for many downstream tasks.

In this hands-on, the objectives are:
1. Load  pre-trained transformer model trained on millions of protein sequences from UniProtKB.
2. Generate the embeddings for a list of sequences.

#### 1. Setting up your python environment. 
To work with large language models, there are a number of python packages that facilitate the implementation. 

Following command will install:
1. torch
2. torchvision
3. torchaudio
4.transformers
5.sentencepiece
6.accelerate


!pip3 install torch torchvision torchaudio transformers sentencepiece accelerate --extra-index-url https://download.pytorch.org/whl/cu116

it is highly recommended that you setup a virtual environment. 

#### 2. Loading a pre-trained model. 
As it is stated in [UniProtKB](https://www.uniprot.org/help/embeddings), "Protein embeddings are a way to encode functional and structural properties of a protein, mostly from its sequence only, in a machine-friendly format (vector representation). Generating such embeddings is computationally expensive, but once computed they can be leveraged for different tasks, such as sequence similarity search, sequence clustering, and sequence classification."
In an attempt to build language model for protein sequences, [ROST LAB](https://www.rostlab.org/) released [ProTrans](https://github.com/agemagician/ProtTrans) published [here](https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3).
Using the ProtTrans T5 model, UniProtKB generated 1024 dimension long  [Embeddings](https://www.uniprot.org/help/downloads#embeddings) for all of the reviewed sequences in SwissProt database.
A detailed description of the inner working of the model can be found in the paper and github repository mentioned above.

In the recent days, a number of generative protein language models were released and shared in hugging face such as `ProtGPT2 (nferruz/ProtGPT2)`, `ProLLaMA(GreatCaptainNemo/ProLLaMA)`, `ESM-2(facebook/esm2_t33_650M_UR50D)`. This is an ongoing trend and fast moving area of research and development. We expect to see more powerful protein language models will be emerging and shared publicly. The primary purpose was to generate protein sequences based on property as well as predicting property based on protein sequence. 

In this section, the learning objectives would be:
1. To learn the syntax for loading pre-trained language model. 
2. To extract embedding from pre-trained language model. 
3. To programmatically access the generative language model.

#### 1. Loading the required packages

import pandas as pd
import torch 
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, EsmModel, EsmTokenizer, pipeline

#### 2. Loading the model and extracting embeddings per sequence. 

### **ProtGPT2**

ProtGPT2 is a generative protein language model based on GPT-2, optimized for creating realistic protein-like sequences. It has applications in protein design and exploring protein sequence space. ProtGPT2(https://pmc.ncbi.nlm.nih.gov/articles/PMC9329459/) utilized a Transformer decoder model architecture, processing input sequences tokenized through a `Byte Pair Encoding (BPE)` strategy. During training, the model employed the original dot-product self-attention mechanism. This architecture consists of 36 layers with a model dimensionality of `1280`, aligning with the GPT-2 large Transformer architecture obtained from HuggingFace. Prior to training, the model weights were reinitialized.

The team optimized the model using the Adam optimizer, setting parameters to β1 = 0.9, β2 = 0.999, with a learning rate of 1e-03. For the primary model, they processed 65,536 tokens per batch, distributed as 512 tokens across each of 128 GPUs. Each GPU handled a batch size of 8, leading to a total effective batch size of 1024. Training was completed over four days using 128 NVIDIA A100 GPUs, with DeepSpeed (Reference 69) managing model parallelism for efficient processing.



##### Prompting ProtGPT2 to generate sequences. 

# loading the model for generation
protgpt2 = pipeline('text-generation', model="nferruz/ProtGPT2")

# prompt 1
sequences = protgpt2("<|endoftext|>MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERG", max_length=100, do_sample=True, top_k=950, repetition_penalty=1.2, num_return_sequences=10, eos_token_id=0)

# Clearning the sequences. 
for seq in sequences:
    print(seq['generated_text'].strip('<|endoftext|>').replace('\n', ''))

#### Retrieving Embedding from ProtGPT2

Although ProtGPT2 is trained to generate sequence and it is a decoder only architechture, we can still retrieve embeddings learned from the trained dataset. 


model_name="nferruz/ProtGPT2" ## link to hugging face repository
tokenizer=AutoTokenizer.from_pretrained(model_name) # load the tokenizer
model=AutoModelForCausalLM.from_pretrained(model_name) # load the model 

Given a protein sequence, our aim is to get a fixed dimensional representation. In case protGPT2, it is 1280 dimensional vector learned for each token. 

**Input Sequence**

sequence="MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN"


**Tokenizing the input sequence**

It returns bunch of ids corresponding to the token dictionary. 

tokens=tokenizer(sequence, return_tensors='pt')
print(tokens)

**Id to Token mapping**

id2tokens=[tokenizer.decode(iid) for iid in tokens['input_ids'][0]]


print(id2tokens)

As you can see, tokens are variable length. These are learned using Byte-Pair Algorithm (BPE). 

print(len(sequence),'-->', len(id2tokens))


A 110 amino acid long sequence is now tokenized into 26 tokens. 

**Computing Token Embeddings**

with torch.no_grad():
    outputs=model(**tokens, output_hidden_states=True)
    #print(outputs)
    hidden_states=outputs.hidden_states  ## all the hidden states output.
    #print(hidden_states)
    embeddings=hidden_states[-1]  ## embeddings of the last state. 
    #print(embeddings)
    
    embedding=embeddings.mean(dim=1) ## averaging embedding from all the tokens. 
    #print(len(embedding[0]),embedding[0])


**Lets write a helper function**

def embed_protein_sequence(sequence, model=model, tokenizer=tokenizer):
    tokens=tokenizer(sequence, return_tensors='pt')
    
    with torch.no_grad():
        outputs=model(**tokens, output_hidden_states=True)
    
        hidden_states=outputs.hidden_states  ## all the hidden states output.
        
        embeddings=hidden_states[-1]  ## embeddings of the last state. 
       
        
        embedding=embeddings.mean(dim=1) ## averaging embedding from all the tokens. 
        return embedding[0]

    

embed_protein_sequence(sequence)

**Embed the dataset**

Embdding the full dataset will take a lot of time. We prepared a another dataset with embedding taken from a public database. 

seq_df=pd.read_csv('Seq_class.csv')
seq_df

sequences=list(seq_df.sequence)
len(sequences)

import numpy as np

classes=list(seq_df.classification)

seq_embeddings=[embed_protein_sequence(seq) for seq in sequences]