<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173_Fall2025/blob/main/F25_Class_04_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 1173: Intro Computational Biology**

##### **Module 4: ChatGPT and Large Language Models**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Biology, Health and the Environment](https://sciences.utsa.edu/bhe/), [UTSA](https://www.utsa.edu/)

### Module 4 Material

* Part 4.1: Introduction to Transformers and Accessing ChatGTP
* **Part 4.2: LLM Memory and Embedding**
* Part 4.3: Generative AI
* Part 4.4: Text to Images with Stable Diffusion


## Google CoLab Instructions

You MUST run the following code cell to get credit for this class lesson. By running this code cell, you will map your GDrive to /content/drive and print out your Google GMAIL address. Your Instructor will use your GMAIL address to verify the author of this class lesson.

In [None]:
# You must run this cell first
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    COLAB = True
    print("Note: Using Google CoLab")
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("**WARNING**: Your GMAIL address was **not** printed in the output below.")
    print("**WARNING**: You will NOT receive credit for this lesson.")
    COLAB = False

Make sure your GMAIL address is included as the last line in the output above.

### Install Custom Functions

Run the cell below to load custom functions used in this lesson.

In [None]:
# Simple function to print out elasped time
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)

# **LLM Memory**

Human minds have both long-term and short-term memory. Long-term memory is what the human has learned throughout their lifetime. Short-term memory is what a human has only recently discovered in the last minute or so. For humans, learning is converting short-term memory into long-term memory that we will retain.

This process works somewhat differently for a LLM. Long-term memory was the weight of the neural network when it was initially trained or finetuned. Short-term memory is additional information that we wish the LLM to retain from previous prompts. For example, if the first prompt is "My name is Jeff", the LLM will likely tell you hello and repeat your name. However, the LLM will not know the answer if the second prompt is "What is my name." without adding a memory component.

These memory objects, which LangChain provides, provide a sort of short-term memory. It is important to note that these objects are not affecting the long-term memory of the LLM, and once you discard the memory object, the LLM will forget. Additionally, the memory object can only hold so much information; newer information may replace older information once it is filled.

One important point to remember is that LLM's only have their input prompt. To provide such memory, these objects are appending anything we wish the LLM to remember to the input prompt. This section will see two ways to augment the prompt with previous information: a buffer and a summary. The buffer prepends a script of the last conversation up to this point. The summary approach keeps a consistently updated summary paragraph of the conversation.

### Install LongChain

In [None]:
!pip install langchain langchain_openai chromadb langchain_community sentence-transformers langchainhub pypdf

## **Obtaining an OpenAI API Key**

In order to delve into the practical exercises and code demonstrations within this section, students will need to obtain an **OpenAI API key**. This key grants access to OpenAI's services, including the ChatGPT functionality we'll be exploring. It's important to note that there is a nominal cost associated with the usage of this key, depending on the volume and intensity of requests made to OpenAI's servers.

To obtain an OpenAI API key, access this [site](https://platform.openai.com/apps).

In [None]:
# This is the model you will generally use for this class
LLM_MODEL = 'gpt-3.5-turbo-1106'

We begin with a very basic query to LangChain, we ask LangChain what are the 5 largest cities in the USA.


In [None]:
from google.colab import userdata
from langchain_openai import OpenAI, ChatOpenAI

# Retrieve the OpenAI API key and store it in a variable
OPENAI_KEY = userdata.get('OPENAI_KEY')

# Ensure that the API key is correctly set
if not OPENAI_KEY:
    raise ValueError("OpenAI API key is not set. Please check if you have stored the API key in userdata.")

LLM_MODEL = 'gpt-3.5-turbo-1106'

# Initialize the OpenAI LLM (Language Learning Model) with your API key using ChatOpenAI
llm = ChatOpenAI(openai_api_key=OPENAI_KEY, model=LLM_MODEL, temperature=0.7)

# Define the question
question = "What is the largest university in San Antonio, Texas?"

# Use Langchain to call the OpenAI chat API
response = llm.invoke(question)

# Print the response
print(response.content)


## Conversation Buffer Window Memory

The LangChain library includes a conversation object named **ConversationChain**; this object facilitates an ongoing conversation with an LLM. For any conversation object, you must also specify a memory. For this first example, we will use the **ConversationBufferWindowMemory** object. This object keeps a transcript of the most recent conversation to reference. This memory allows the conversation object to remember what you have asked or told it and its responses to you.

In [None]:
#
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferWindowMemory

memory = ConversationBufferWindowMemory()
conversation = ConversationChain(
    llm=llm,
    memory = memory,
    verbose=False
)

We can now have a conversation with the LLM.



In [None]:
#
conversation.predict(input="Hi, my name is David")

This newly constructed prompt can now perform the intended task of translation.

In [None]:
conversation.predict(input="What is my name?")

We can have a look at what the memory now contains.

In [None]:
conversation.memory.load_memory_variables({})

## **Custom Conversation Bots**

You can define the prompt template for a conversationbot. This technique allows you to create a bot with a name and perform a specialized task. In this case, we created a bot named "WashU Assistant" that we designed to help students.

In [None]:
# Original code
# Now we can override it and set it to "AI Assistant"
from langchain.prompts.prompt import PromptTemplate

template = """The following is a friendly conversation between a human and an
AI to assist UTSA Students. The AI should stick to topics
about The University of Texas at San Antonion (UTSA). If the AI does not know the answer to a question,
it should suggest the student speak to their advisor.

Current conversation:
{history}
Human: {input}
UTSA Assistant:"""
PROMPT = PromptTemplate(input_variables=["history", "input"], template=template)
conversation = ConversationChain(
    prompt=PROMPT,
    llm=llm,
    verbose=False,
    memory=ConversationBufferWindowMemory(ai_prefix="UTSA Assistant"),
)

We can now have a conversation with the UTSA assistant bot.



In [None]:
# Orignal code
conversation.predict(input="Where is the bookstore?")

Another question.


In [None]:
conversation.predict(input="What is a nice quiet area to study?")


Often you will have multiple components in langchain that you must call in a "chain", to do this you can construct a chain.


In [None]:
conversation.predict(input="Which of these is closest to the bookstore?")


In [None]:
conversation.predict(input="What is the meaning of life.")

We can have a look at what the memory now contains.

In [None]:
conversation.memory.load_memory_variables({})

## **Conversation Summary Memory**

Now, let's look at using a slightly more complex type of memory, the ConversationSummaryMemory object. This type of memory creates a summary of the conversation over time. This memory can be helpful for condensing information from the conversation over time. Conversation summary memory summarizes the conversation and stores the current summary in memory. You can use this memory to inject the conversation summary so far into a prompt/chain. This memory is most useful for more extended conversations, where keeping the past message history in the prompt verbatim would take up too many tokens.

In [None]:
from langchain.memory import ConversationSummaryMemory

memory = ConversationSummaryMemory(llm=llm)
conversation = ConversationChain(
    llm=llm,
    memory = memory,
    verbose=False
)

In [None]:
conversation.predict(input="I am a computational biologist, what do you do for a living?")

In [None]:
conversation.memory.load_memory_variables({})

## **What are Embedding Layers in PyTorch**

[Embedding Layers](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) are a handy feature of PyTorch that allows the program to automatically insert additional information into the data flow of your neural network. An embedding layer would automatically allow you to insert vectors in the place of word indexes.  

Programmers often use embedding layers with Natural Language Processing (NLP); however, you can use these layers when you wish to insert a lengthier vector in an index value place. In some ways, you can think of an embedding layer as dimension expansion. However, the hope is that these additional dimensions provide more information to the model and provide a better score.

## **Simple Embedding Layer Example**

* **num_embeddings** = How large is the vocabulary?  How many categories are you encoding? This parameter is the number of items in your "lookup table."
* **embedding_dim** = How many numbers in the vector you wish to return.

Now we create a neural network with a vocabulary size of 10, which will reduce those values between 0-9 to 4 number vectors. This neural network does nothing more than passing the embedding on to the output. But it does let us see what the embedding is doing. Each feature vector coming in will have two such features.

In [None]:
import torch
import torch.nn as nn

embedding_layer = nn.Embedding(num_embeddings=10, embedding_dim=4)
optimizer = torch.optim.Adam(embedding_layer.parameters(), lr=0.001)
loss_function = nn.MSELoss()

Let's take a look at the structure of this neural network to see what is happening inside it.


In [None]:
print(embedding_layer)

For this neural network, which is just an embedding layer, the input is a vector of size 2. These two inputs are integer numbers from 0 to 9 (corresponding to the requested input_dim quantity of 10 values). Looking at the summary above, we see that the embedding layer has 40 parameters. This value comes from the embedded lookup table that contains four amounts (output_dim) for each of the 10 (input_dim) possible integer values for the two inputs. The output is 2 (input_length) length 4 (output_dim) vectors, resulting in a total output size of 8, which corresponds to the Output Shape given in the summary above.

Now, let us query the neural network with two rows. The input is two integer values, as was specified when we created the neural network.

In [None]:
input_tensor = torch.tensor([[1, 2]], dtype=torch.long)
pred = embedding_layer(input_tensor)

print(input_tensor.shape)
print(pred)


Here we see two length-4 vectors that PyTorch looked up for each input integer. Recall that Python arrays are zero-based. PyTorch replaced the value of 1 with the second row of the 10 x 4 lookup matrix. Similarly, PyTorch returned the value of 2 by the third row of the lookup matrix. The following code displays the lookup matrix in its entirety. The embedding layer performs no mathematical operations other than inserting the correct row from the lookup table.


In [None]:
embedding_layer.weight.data

The values above are random parameters that PyTorch generated as starting points. Generally, we will transfer an embedding or train these random values into something useful. The following section demonstrates how to embed a hand-coded embedding.

## **Transferring An Embedding**

Now, we see how to hard-code an embedding lookup that performs a simple one-hot encoding.  One-hot encoding would transform the input integer values of 0, 1, and 2 to the vectors $[1,0,0]$, $[0,1,0]$, and $[0,0,1]$ respectively. The following code replaced the random lookup values in the embedding layer with this one-hot coding-inspired lookup table.

In [None]:
import torch
import torch.nn as nn

# Define the embedding lookup matrix
embedding_lookup = torch.tensor([
    [1, 0, 0],
    [0, 1, 0],
    [0, 0, 1]
], dtype=torch.float32)  # Make sure to use float32 for weight matrices

# Create the embedding layer
embedding_layer = nn.Embedding(num_embeddings=3, embedding_dim=3)

# Set the weights of the embedding layer
embedding_layer.weight.data = embedding_lookup


We have the following parameters for the Embedding layer:
    
* input_dim=3 - There are three different integer categorical values allowed.
* output_dim=3 - Three columns represent a categorical value with three possible values per one-hot encoding.
* input_length=2 - The input vector has two of these categorical values.

We query the neural network with two categorical values to see the lookup performed.

In [None]:
# Create the input tensor directly in PyTorch
input_tensor = torch.tensor([[0, 1]], dtype=torch.long)

# Forward pass to get the predictions
pred = embedding_layer(input_tensor)

print(input_tensor.shape)
print(pred)

The given output shows that we provided the program with two rows from the one-hot encoding table. This encoding is a correct one-hot encoding for the values 0 and 1, where there are up to 3 unique values possible.

The following section demonstrates how to train this embedding lookup table.

## **Training an Embedding**

First, we make use of the following imports.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.preprocessing import OneHotEncoder
from torch.nn.utils.rnn import pad_sequence

We create a neural network that classifies restaurant reviews according to positive or negative. This neural network can accept strings as input, such as given here. This code also includes positive or negative labels for each review.

In [None]:
# Define 10 resturant reviews.
reviews = [
    'Never coming back!',
    'Horrible service',
    'Rude waitress',
    'Cold food.',
    'Horrible food!',
    'Awesome',
    'Awesome service!',
    'Rocks!',
    'poor work',
    'Couldn\'t have done better']

# Define labels (1=negative, 0=positive)
labels = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]

Notice that the second to the last label is incorrect.  Errors such as this are not too out of the ordinary, as most training data could have some noise.

We define a vocabulary size of 50 words.  Though we do not have 50 words, it is okay to use a value larger than needed.  If there are more than 50 words, the least frequently used words in the training set are automatically dropped by the embedding layer during training.  For input, we one-hot encode the strings.  We use the TensorFlow one-hot encoding method here rather than Scikit-Learn. Scikit-learn would expand these strings to the 0's and 1's as we would typically see for dummy variables.  TensorFlow translates all words to index values and replaces each word with that index.

In [None]:
# One-hot encode reviews
VOCAB_SIZE = 50
encoded_reviews = [torch.tensor([hash(word) % VOCAB_SIZE for word in review.split()]) for review in reviews]

print(f"Encoded reviews: {encoded_reviews}")

The program one-hot encodes these reviews to word indexes; however, their lengths are different.  We pad these reviews to 4 words and truncate any words beyond the fourth word.

In [None]:
MAX_LENGTH = 4
padded_reviews = pad_sequence(encoded_reviews, batch_first=True, padding_value=0).narrow(1, 0, MAX_LENGTH)
print(padded_reviews)

As specified by the **padding=post** setting, each review is padded by appending zeros at the end, as specified by the **padding=post** setting.

Next, we create a neural network to learn to classify these reviews.

In [None]:
model = nn.Sequential(
    nn.Embedding(VOCAB_SIZE, 8),
    nn.Flatten(),
    nn.Linear(8 * MAX_LENGTH, 1),
    nn.Sigmoid()
)

This network accepts four integer inputs that specify the indexes of a padded movie review. The first embedding layer converts these four indexes into four length vectors 8. These vectors come from the lookup table that contains 50 (VOCAB_SIZE) rows of vectors of length 8. This encoding is evident by the 400 (8 times 50) parameters in the embedding layer. The output size from the embedding layer is 32 (4 words expressed as 8-number embedded vectors). A single output neuron is connected to the embedding layer by 33 weights (32 from the embedding layer and a single bias neuron). Because this is a single-class classification network, we use the sigmoid activation function and binary_crossentropy.

The program now trains the neural network. The embedding lookup and dense 33 weights are updated to produce a better score.

In [None]:
criterion = nn.BCELoss()  # Binary Cross Entropy
optimizer = optim.Adam(model.parameters())

# Training the model
epochs = 100
for epoch in range(epochs):
    optimizer.zero_grad()
    outputs = model(padded_reviews.long())
    loss = criterion(outputs.squeeze(), torch.tensor(labels, dtype=torch.float))
    loss.backward()
    optimizer.step()

We can see the learned embeddings.  Think of each word's vector as a location in the 8 dimension space where words associated with positive reviews are close to other words.  Similarly, training places negative reviews close to each other.  In addition to the training setting these embeddings, the 33 weights between the embedding layer and output neuron similarly learn to transform these embeddings into an actual prediction.  You can see these embeddings here.

In [None]:
embedding_weights = list(model[0].parameters())[0]
print(embedding_weights.shape)
print(embedding_weights)


We can now evaluate this neural network's accuracy, including the embeddings and the learned dense layer.


In [None]:
# Evaluation
with torch.no_grad():
    outputs = model(padded_reviews.long())
    predictions = (outputs > 0.5).float().squeeze()
    accuracy = (predictions == torch.tensor(labels)).float().mean().item()
    loss_value = criterion(outputs.squeeze(), torch.tensor(labels, dtype=torch.float)).item()

print(f'Accuracy: {accuracy}')
print(f'Log-loss: {loss_value}')

The accuracy is great, but there could be overfitting. It would be good to use early stopping to not overfit for a more complex data set. However, the loss is not perfect. Even though the predicted probabilities indicated a correct prediction in every case, the program did not achieve absolute confidence in each correct answer. The lack of confidence was likely due to the small amount of noise (previously discussed) in the data set. Some words that appeared in both positive and negative reviews contributed to this lack of absolute certainty.


## **Lesson Turn-in**

When you have completed and run all of the code cells, use the **File --> Print.. --> Save to PDF** to generate a PDF of your Colab notebook. Save your PDF as `Copy of Class_04_2.lastname.pdf` where _lastname_ is your last name, and upload the file to Canvas.

## **Lizard Tail**

## **Attention Is All You Need**

![__](https://upload.wikimedia.org/wikipedia/commons/8/8f/The-Transformer-model-architecture.png)

**"Attention Is All You Need"** is a 2017 landmark research paper in machine learning authored by eight scientists working at Google. The paper introduced a new deep learning architecture known as the transformer, based on the attention mechanism proposed in 2014 by Bahdanau et al. It is considered a foundational paper in modern artificial intelligence, as the transformer approach has become the main architecture of large language models like those based on GPT. At the time, the focus of the research was on improving Seq2seq techniques for machine translation, but the authors go further in the paper, foreseeing the technique's potential for other tasks like question answering and what is now known as multimodal Generative AI.

The paper's title is a reference to the song "All You Need Is Love" by the Beatles. The name "Transformer" was picked because Jakob Uszkoreit, one of the paper's authors, liked the sound of that word.

An early design document was titled "Transformers: Iterative Self-Attention and Processing for Various Tasks", and included an illustration of six characters from the Transformers animated show. The team was named Team Transformer.

Some early examples that the team tried their Transformer architecture on included English-to-German translation, generating Wikipedia articles on "The Transformer", and parsing. These convinced the team that the Transformer is a general purpose language model, and not just good for translation.

As of 2024, the paper has been cited more than 140,000 times.

**Authors**

The authors of the paper are: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin. All eight authors were "equal contributors" to the paper; the listed order was randomized. The Wired article highlights the group's diversity:

Six of the eight authors were born outside the United States; the other two are children of two green-card-carrying Germans who were temporarily in California and a first-generation American whose family had fled persecution, respectively.

**Methods discussed & introduced**

The paper is most well known for the introduction of the Transformer architecture, which forms the underlying architecture for most forms of modern Large Language Models (LLMs). A key reason for why the architecture is preferred by most modern LLMs is the parallelizability of the architecture over its predecessors. This ensures that the operations necessary for training can be accelerated on a GPU allowing both faster training times and models of bigger sizes to be trained.

The following mechanisms were introduced by the paper as part of the development of the transformer architecture.

**Scaled dot-product attention & self-attention**

The use of the scaled dot-product attention and self-attention mechanism instead of an RNN or LSTM (which rely on recurrence instead) allow for better performance as described in the following paragraph.

Since the model relies on Query (Q), Key (K) and Value (V) matrices that come from the same source itself (i.e. the input sequence / context window), this eliminates the need for RNNs completely ensuring parallelizability for the architecture. This differs from the original form of the Attention mechanism introduced in 2014. Additionally, the paper also discusses the use of an additional scaling factor that was found to be most effective with respect to the dimension of the key vectors.

In the specific context of translation which the paper focused on, the Query and Key matrices are usually represented in embeddings corresponding to the source language while the Value matrix corresponds to the target language.

**Multi-head attention**

In the self-attention mechanism, queries (Q), keys (K), and values (V) are dynamically generated for each input sequence (limited typically by the size of the context window), allowing the model to focus on different parts of the input sequence at different steps. Multi-head attention enhances this process by introducing multiple parallel attention heads. Each attention head learns different linear projections of the Q, K, and V matrices. This allows the model to capture different aspects of the relationships between words in the sequence simultaneously, rather than focusing on a single aspect.

By doing this, multi-head attention ensures that the input embeddings are updated from a more varied and diverse set of perspectives. After the attention outputs from all heads are calculated, they are concatenated and passed through a final linear transformation to generate the output.

**Positional encoding**

Since the Transformer model is not a seq2seq model and does not rely on the sequence of the text in order to perform encoding and decoding, the paper relied on the use of sine and cosine wave functions to encode the position of the token into the embedding.

** Historical context**

For many years, sequence modelling and generation was done by using plain recurrent neural networks (RNNs). A well-cited early example was the Elman network (1990). In theory, the information from one token can propagate arbitrarily far down the sequence, but in practice the vanishing-gradient problem leaves the model's state at the end of a long sentence without precise, extractable information about preceding tokens.

A key breakthrough was LSTM (1995), a RNN which used various innovations to overcome the vanishing gradient problem, allowing efficient learning of long-sequence modelling. One key innovation was the use of an attention mechanism which used neurons that multiply the outputs of other neurons, so-called multiplicative units. Neural networks using multiplicative units were later called sigma-pi networks or higher-order networks. LSTM became the standard architecture for long sequence modelling until the 2017 publication of Transformers. However, LSTM still used sequential processing, like most other RNNs. Specifically, RNNs operate one token at a time from first to last; they cannot operate in parallel over all tokens in a sequence.

Modern Transformers overcome this problem, but unlike RNNs, they require computation time that is quadratic in the size of the context window. The linearly scaling fast weight controller (1992) learns to compute a weight matrix for further processing depending on the input. One of its two networks has "fast weights" or "dynamic links" (1981). A slow neural network learns by gradient descent to generate keys and values for computing the weight changes of the fast neural network which computes answers to queries. This was later shown to be equivalent to the unnormalized linear Transformer.