# Adapting Foundation Models

Adaptation in AI is a crucial step to enhance the capabilities of foundation models, allowing them to cater to specific tasks and domains. This process is about tailoring pre-trained AI systems with new data, ensuring they perform optimally in specialized applications and respect privacy constraints. Reaping the benefits of adaptation leads to AI models that are not only versatile but also more aligned with the unique needs of organizations and industries.

<img src="img/img_12.png">

**Technical terms explained:**
* **Fine-tuning**: This is a technique in machine learning where an already trained model is further trained (or tuned) on a new, typically smaller, dataset for better performance on a specific task.

# Why We Need to Adapt Foundation Models

Adapting foundation models is essential due to their limitations in specific areas despite their extensive training on large datasets. Although they excel at many tasks, these models can sometimes misconstrue questions or lack up-to-date information, which highlights the need for fine-tuning. By addressing these weaknesses through additional training or other techniques, the performance of foundation models can be significantly improved.

# Retrieval Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a powerful approach for keeping Generative AI models informed with the most recent data, particularly when dealing with domain-specific questions. It cleverly combines the comprehensive understanding capacity of a large language model (LLM) with the most up-to-date information pulled from a database of relevant text snippets. The beauty of this system is in its ability to ensure that responses remain accurate and reflective of the latest developments.

<img src="img/img_13.png">

**Technical Terms:**
* **Semantic-embedding**: A representation of text in a high-dimensional space where distances between points correspond to semantic similarity. Phrases with similar meanings are closer together.
* **Cosine similarity**: A metric used to measure how similar two vectors are, typically used in the context of semantic embeddings to assess similarity of meanings.
* **Vector databases**: Specialized databases designed to store and handle vector data, often employed for facilitating fast and efficient similarity searches.

# Prompt Design Techniques

Prompt Design Techniques are innovative strategies for tailoring AI foundation models to specific tasks, fostering better performance in various domains. These methods enable us to guide the AI's output by carefully constructing the prompts we provide, enhancing the model's relevance and efficiency in generating responses.

1. Prompt tuning: customizing templates to guide a model's predictions in a domain specific task, often involves what words define concepts and where they are placed in the overall structure of the prompt, e.g., placing a question at the end of a prompt is often associated with better results
2. Few Shot Prompting: provide a handfull of examples to help guide the model's predictions, e.g., if we want the model to answer a mathematics problem, we can include a few examples of math problems with their solutions
3. Zero-Shot Prompting: allow the foundation model to handle a task without any task-specfic examples in the prompt, e.g., "what is the weather tomorrow" instead of "what is the weather tomorrow, for example, last Tuesday was rainy with low of 30"
4. In-Context Learning: allow the model to learn from the context provided in the prompt. There are two forms this may occur:
* Provide instructions in the prompt
* Provide examples in the prompt
5. Chain of Thought (CoT): Provide a series of steps the model can consider to solve a complex task, e.g., provide the steps used to answer a math problem that the model may be able to *consider* and *follow*; you may even just say "think in steps" for encouraging the model to use a CoT approach instead of you laying out the steps

**Technical Term Explanations:**
* **Domain-Specific Task**: A task that is specialized or relevant to a particular area of knowledge or industry, often requiring tailored AI responses.

## 1. Prompt Tuning

Prompt tuning is a technique in generative AI which allows models to target specific tasks effectively. By crafting prompts, whether through a hands-on approach with hard prompts or through an automated process with soft prompts, we enhance the model's predictive capabilities.

**Technical Terms Defined**:
* **Prompt**: In AI, a prompt is an input given to the model to generate a specific response or output.
* **Prompt Tuning**: This is a method to improve AI models by optimizing prompts so that the model produces better results for specific tasks.
* **Hard Prompt**: A manually created template used to guide an AI model's predictions. It requires human ingenuity to craft effective prompts.
* **Soft Prompt**: A series of tokens or embeddings optimized through deep learning to help guide model predictions, without necessarily making sense to humans
> * A Soft prompt may appear to people as gibberish, but because it takes advantage of the way in which the model is designed, it can be associated with superior outcomes

<img src="img/img_14.png">

## 2. One and Few-Shot Prompting

One and few-shot prompting represent cutting-edge techniques that enable AI to adapt and perform tasks with minimal instructions. Instead of relying on extensive databases for learning, these methods guide generative AI through just one or a few examples, streamlining the learning process and demonstrating its ability to generalize solutions to new problems. This innovative approach marks a significant advancement in machine learning, empowering AI to quickly adjust to specialized tasks and showcasing the incredible potential for efficiency in teaching AI new concepts.

**Technical Terms Explained:**
* **One-shot prompting**: Giving an AI model a single example to learn from before it attempts a similar task.
* **Few-shot prompting**: Providing an AI model with a small set of examples, such as five or fewer, from which it can learn to generalize and perform tasks.

**One Shot Prompt Example:**
```
Q: What is (25+1)*2?
A: The calculation is simple. You add 25 to another 1, which equals 26. Then, you multiply26 by 2, which equals 52. So, (25+1)*2 is 52.

Q: What is (3+2)*2?
___
```

**Few Shot Prompt Example:**
```
1. Q: What is the capital of France?
A: Paris
2. Q: What is the capital of Peru?
A: Lima
3. Q: What is the capital of the Phillippines?
A: Manila
4. Q: What is the capital of Algeria?
___
```

## 3. Zero-Shot Prompting to Classify a Legal Document

Zero-shot prompting is a remarkable technique where a generative AI model can take on new tasks without the need for specific training examples. This process leverages the AI's extensive pre-existing knowledge gained from learning patterns across vast datasets. It empowers the AI to infer and generalize effectively to provide answers and solutions in contexts that were not expressly covered during its initial training.

**Technical Terms Explained:**
* **Zero-shot prompting**: This refers to the capability of an AI model to correctly respond to a prompt or question it hasn't explicitly been trained to answer, relying solely on its prior knowledge and training.

## 4. In-Context Learning

When performing few-shot, one-shot, or zero-shot learning, we can pass information to the model within the prompt in the form of examples, descriptions, or other data. When we rely on a model using information from within the prompt itself instead of relying on what is stored within its own parameters we are using in-context learning.

As these AI models grow in size, their ability to absorb and use in-context information significantly improves, showcasing their potential to adapt to various tasks effectively. The progress in this field is inspiring, as these advances hint at an exciting future where such models could be even more intuitive and useful.

<img src="img/img_15.png">

<img src="img/img_16.png">

## 5. Chain-of-Thought Prompting

Chain-of-Thought Prompting is a vital technique for enhancing the reasoning capabilities of large language models by breaking down complex problems into intermediate steps that lead to a solution. By providing models with a line of reasoning, they can more effectively tackle problems that require more advanced problem-solving processes, enabling them to deduce information, such as the number of cookies in a box, after considering all variables.

**Technical Terms Explained:**
* **Chain-of-Thought Prompting**: A method of guiding a language model through a step-by-step reasoning process to help it solve complex tasks by explicitly detailing the logic needed to reach a conclusion.

Prompt Example WITHOUT CoT:
```
Problem: A baker bakes 60 cookies. She sells 15 of them to a customer and then packs the rest equally into 5 boxes. How many cookies are in each box?

Answer: 9 cookies

Problem: A baker bakes 30 cookies. Five of them are burnt. She sells 15 of them to a customer and then packs the rest equally into 5 boxes. How many cookies are in each box?
___
```

Prompt Example WITH CoT:
```
Problem: A baker bakes 60 cookies. She sells 15 of them to a customer and then packs the rest equally into 5 boxes. How many cookies are in each box?

Answer:
1. Start with the total number of cookies, which is 60.
2. Subtract the number of cookies sold to the customer, 15, from the total.
3. Calculate the remaining cookies: 60 - 15 = 45 cookies.
4. Divide the reamining cookies equally into 5 boxes.
5. To find out how many cookies are in each box: 45 / 5 = 9 cookies

Problem: A baker bakes 30 cookies. Five of them are burnt. She sells 15 of them to a customer and then packs the rest equally into 5 boxes. How many cookies are in each box?
___
```

# Using Probing to TRain A Classifier

Using probing to train a classifier is a powerful approach to tailor generative AI foundation models, like BERT, for specific applications. By adding a modestly-sized neural network, known as a classification head, to a foundation model, one can specialize in particular tasks such as sentiment analysis. This technique involves freezing the original model's parameters and only adjusting the classification head through training with labeled data. Ultimately, this process simplifies adapting sophisticated AI systems to our needs, providing a practical tool for developing efficient and targeted machine learning solutions.

**Technical Terms Explained:**
* **Probing**: This is a method of examining what information is contained in different parts of a machine learning model.
* **Linear Probing**: A simple form of probing that involves attaching a linear classifier to a pre-trained model to adapt it to a new task without modifying the original model.
* **Classification Head**: It is the part of a neural network that is tailored to classify input data into defined categories.

<img src="img/img_17.png">

# Exercise: Create a BERT sentiment classifier

In this exercise, you will create a BERT sentiment classifier (actually DistilBERT) using the [Hugging Face Transformers](https://huggingface.co/transformers/) library. 

You will use the [IMDB movie review dataset](https://huggingface.co/datasets/imdb) to train and evaluate your model. The IMDB dataset contains movie reviews that are labeled as either positive or negative. 

In [1]:
# You will need to choose "Kernel -> Restart Kernel" from the menu after executing this cell
%pip install datasets==3.2.0 huggingface_hub==0.28.1

Collecting huggingface_hub==0.28.1
  Downloading huggingface_hub-0.28.1-py3-none-any.whl.metadata (13 kB)
Downloading huggingface_hub-0.28.1-py3-none-any.whl (464 kB)
Installing collected packages: huggingface_hub
  Attempting uninstall: huggingface_hub
    Found existing installation: huggingface-hub 0.26.5
    Uninstalling huggingface-hub-0.26.5:
      Successfully uninstalled huggingface-hub-0.26.5
Successfully installed huggingface_hub-0.28.1
Note: you may need to restart the kernel to use updated packages.


In [12]:
# Import the datasets
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
import numpy as np
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments
import pandas as pd

In [2]:
# Load the train and test splits of the imdb dataset
splits = ["train", "test"]
ds = {split: ds for split, ds in zip(splits, load_dataset("imdb", split=splits))}

# Thin out the dataset to make it run faster for this example
for split in splits:
    ds[split] = ds[split].shuffle(seed=42).select(range(500))

# Show the dataset
ds

{'train': Dataset({
     features: ['text', 'label'],
     num_rows: 500
 }),
 'test': Dataset({
     features: ['text', 'label'],
     num_rows: 500
 })}

## Pre-process datasets

Now we are going to process our datasets by converting all the text into tokens for our models. You may ask, why isn't the text converted already? Well, different models may use different tokenizers, so by converting at train time we retain more flexibility.

In [None]:
# Replace <MASK> with your code that constructs a query to send to the LLM
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")


def preprocess_function(examples):
    """Preprocess the imdb dataset by returning tokenized examples."""
    # <MASK>
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_ds = {}
for split in splits:
    tokenized_ds[split] = ds[split].map(preprocess_function, batched=True)


# Check that we tokenized the examples properly
assert tokenized_ds["train"][0]["input_ids"][:5] == [101, 2045, 2003, 2053, 7189]

# Show the first example of the tokenized training set
print(tokenized_ds["train"][0]["input_ids"])

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

[101, 2045, 2003, 2053, 7189, 2012, 2035, 2090, 3481, 3771, 1998, 6337, 2099, 2021, 1996, 2755, 2008, 2119, 2024, 2610, 2186, 2055, 6355, 6997, 1012, 6337, 2099, 3504, 15594, 2100, 1010, 3481, 3771, 3504, 4438, 1012, 6337, 2099, 14811, 2024, 3243, 3722, 1012, 3481, 3771, 1005, 1055, 5436, 2024, 2521, 2062, 8552, 1012, 1012, 1012, 3481, 3771, 3504, 2062, 2066, 3539, 8343, 1010, 2065, 2057, 2031, 2000, 3962, 12319, 1012, 1012, 1012, 1996, 2364, 2839, 2003, 5410, 1998, 6881, 2080, 1010, 2021, 2031, 1000, 17936, 6767, 7054, 3401, 1000, 1012, 2111, 2066, 2000, 12826, 1010, 2000, 3648, 1010, 2000, 16157, 1012, 2129, 2055, 2074, 9107, 1029, 6057, 2518, 2205, 1010, 2111, 3015, 3481, 3771, 3504, 2137, 2021, 1010, 2006, 1996, 2060, 2192, 1010, 9177, 2027, 9544, 2137, 2186, 1006, 999, 999, 999, 1007, 1012, 2672, 2009, 1005, 1055, 1996, 2653, 1010, 2030, 1996, 4382, 1010, 2021, 1045, 2228, 2023, 2186, 2003, 2062, 2394, 2084, 2137, 1012, 2011, 1996, 2126, 1010, 1996, 5889, 2024, 2428, 2204, 1998, 6

## Load and set up the model

We will now load the model and freeze most of the parameters of the model: everything except the classification head.

In [6]:
# Replace <MASK> with your code freezes the base model
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2,
    id2label={0: "NEGATIVE", 1: "POSITIVE"},  # For converting predictions to strings
    label2id={"NEGATIVE": 0, "POSITIVE": 1},
)

# Freeze all the parameters of the base model
# Hint: Check the documentation at https://huggingface.co/transformers/v4.2.2/training.html
for param in model.base_model.parameters():
    # <MASK>
    param.requires_grad = False

model.classifier

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Linear(in_features=768, out_features=2, bias=True)

In [7]:
print(model)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


## Let's train it!

Now it's time to train our model. We'll use the `Trainer` class from the 🤗 Transformers library to do this. The `Trainer` class provides a high-level API that abstracts away a lot of the training loop.

First we'll define a function to compute our accuracy metric then we make the `Trainer`.

Let's take this opportunity to learn about the `DataCollator`. According to the HuggingFace documentation:

> Data collators are objects that will form a batch by using a list of dataset elements as input. These elements are of the same type as the elements of train_dataset or eval_dataset.

> To be able to build batches, data collators may apply some processing (like padding).


In [20]:
# Replace <MASK> with your DataCollatorWithPadding argument(s)
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}


# The HuggingFace Trainer class handles the training and eval loop for PyTorch for us.
# Read more about it here https://huggingface.co/docs/transformers/main_classes/trainer
trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./data/sentiment_analysis",
        learning_rate=2e-3,
        # Reduce the batch size if you don't have enough memory
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        num_train_epochs=1,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
    ),
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["test"],
    tokenizer=tokenizer,
    # data_collator=DataCollatorWithPadding(<MASK>),
    data_collator=DataCollatorWithPadding(tokenizer = tokenizer),
    compute_metrics=compute_metrics,
)

trainer.train()

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.484688,0.818


TrainOutput(global_step=125, training_loss=0.525149169921875, metrics={'train_runtime': 14.1793, 'train_samples_per_second': 35.263, 'train_steps_per_second': 8.816, 'total_flos': 66233699328000.0, 'train_loss': 0.525149169921875, 'epoch': 1.0})

## Evaluate the model

Evaluating the model is as simple as calling the evaluate method on the trainer object. This will run the model on the test set and compute the metrics we specified in the compute_metrics function.

In [11]:
# Show the performance of the model on the test set
# What do you think the evaluation accuracy will be?
trainer.evaluate()

{'eval_loss': 0.4612870216369629,
 'eval_accuracy': 0.802,
 'eval_runtime': 6.3511,
 'eval_samples_per_second': 78.727,
 'eval_steps_per_second': 19.682,
 'epoch': 1.0}

### View the results

Let's look at two examples with labels and predicted values.

In [13]:
df = pd.DataFrame(tokenized_ds["test"])
df = df[["text", "label"]]

# Replace <br /> tags in the text with spaces
df["text"] = df["text"].str.replace("<br />", " ")

# Add the model predictions to the dataframe
predictions = trainer.predict(tokenized_ds["test"])
df["predicted_label"] = np.argmax(predictions[0], axis=1)

df.head(2)

Unnamed: 0,text,label,predicted_label
0,When I unsuspectedly rented A Thousand Acres...,1,1
1,This is the latest entry in the long series of...,1,1


In [18]:
df['correct'] = df['label'] == df['predicted_label']
print(f"Accuracy: {round(sum(df['correct']/df.shape[0]),5)*100}")

Accuracy: 80.2


### Look at some of the incorrect predictions

Let's take a look at some of the incorrectly-predcted examples

In [19]:
# Show full cell output
pd.set_option("display.max_colwidth", None)

df[df["label"] != df["predicted_label"]].head(2)

Unnamed: 0,text,label,predicted_label,correct
7,"Don't pay any attention to the rave reviews of this film here. It is the worst Van Damme film and one of the worst of any sort I have ever seen. It would appeal to somebody with no depth whatever who requires nothing more than gunfire and explosions to be entertained. Seeing that this is directed by Peter Hyams it has made me realise that Peter has no talent as a director, but is very good at filming explosions and the like. However, movies need other elements as well; for example, a story. This one didn't have one. This might explain the awfulness of some of Mr. Hyams' more recent films, hardly any better than this one, really. One can't help wondering how some people ever were put behind a camera.",0,1,False
21,"Coming from Kiarostami, this art-house visual and sound exposition is a surprise. For a director known for his narratives and keen observation of humans, especially children, this excursion into minimalist cinematography begs for questions: Why did he do it? Was it to keep him busy during a vacation at the shore? ""Five, 5 Long Takes"" consists of, you guessed it, five long takes. They are (the title names are my own and the times approximate): ""Driftwood and waves"". The camera stands nearly still looking at a small piece of driftwood as it gets moved around by small waves splashing on a beach. Ten minutes. ""Watching people on the boardwalk"". The camera stands still looking at the ocean horizon and a boardwalk. People walk across the camera frame, their faces too far and blurry to make them interesting. Eleven minutes. ""Six dogs at the water's edge"". The camera stands still looking at the ocean horizon with a sandy stretch of beach nearby. Far away at the water's edge, six dogs not doing much, just relaxing. Sixteen minutes. ""Ducks in line, gaggle of ducks"". The camera stands still looking at the ocean horizon near the water's edge. Dozen and dozen of ducks stream in single file from left to right. I assume that Kiarostami released them gradually. The last two ducks stop dead on their track and suddenly a gaggle of ducks rolls quietly from right to left. I assume Kiarostami collected the ducks and re-released all at the same time. It is not the first time that he deals with the contrast between organized and disorganized behavior. Eight minutes. ""Frog symphony, oops, I mean cacophony, for a stormy night"". The camera stands over a pond at night. It's pitch black except for what appears to be the reflection of the moon on the undulating water. It is a stormy night and clouds race to cover the moon. The screen goes dark. What remains for us is the cacophony of frogs, howling dogs and, eventually, morning roosters. Hit me on the head if this was done in a single take. I saw this segment as a sound composition put together in the editing room and accompanied by a simple visualization. Twenty seven minutes! Except for the mildly amusing ducks, this exercise in minimalism left me cold. A nonessential film for Kiarostami admirers. I thought I would rate ""Five"" a five, but four is what it deserves. The film is dedicated to Yasujiru Ozu.",0,1,False


## End of exercise