# What is a Foundation Model

A foundation model is a powerful AI tool that can do many different things after being trained on lots of diverse data. These models are incredibly versatile and provide a solid base for creating various AI applications, like a strong foundation holds up different kind of buildings. By using a foundation model, we have a strong starting point for building specialized AI tasks.


## Terms Explained:

* **Foundation Model**: A large AI model trained on a wide variety of data, which can do many tasks without much extra training.

* **Adapted**: Modified or adjusted to suit new conditions or a new purpose, i.e. in the context of foundation models.

* **Generalize**: The ability of a model to apply what it has learned from its training data to new, unseen data.

Foundation models represent a paradigm shift in building machine learning systems. In the past, there was more of a focus on building large datasets and building models from scratch. With the advent of foundation models, the pattern has changed. Now, people generally start with the foundation model and build off of it.

# Foundation Models vs Traditional Models

Foundation Models and Traditional Models are two distinct approaches in the field of artificial intelligence with different strengths. Foundation Models, which are built on large, diverse datasets, have the incredible ability to adapt and perform well on many different tasks. In contrast, Traditional Models specialize in specific tasks by learning from smaller, focused datasets, making them more straightforward and efficient for targeted applications.

<img src="img/img_00.png">

# Architecture and Scale

The transformer architecture has revolutionized the way machines handle language by enabling the training of sequential data at scale. Thanks to this, today’s AI models are massive, with some having billions of parameters (or more) allowing for incredible flexibility across many tasks. The technology is exciting and holds great promise for the future.

## Technical Terms:

* **Sequential data**: Information that is arranged in a specific order, such as words in a sentence or events in time.

* **Self-attention mechanism**: The self-attention mechanism in a transformer is a process where each element in a sequence computes its representation by attending to and weighing the importance of all elements in the sequence, allowing the model to capture complex relationships and dependencies.

<img src="img/img_01.png">

The initial LLaMa from Meta was trained on more than 4.7TB of Data

<img src="img/img_02.png">

Traditional models typically have between 3 and 20 paramaters, each associated with their input variables (1 parameter for $X_{0}$ , another for $X_{1}$, all the way through $X_{n}$ such that for each $X_{i}$ there is a $\beta_{i}$ ) while foundation models generally have billions of paramters spread over the layers in their networks.

<img src="img/img_03.png">

# Exercise: Use a Foundation Model to Build a Spam Email Classifier

A foundation model serves as a fundamental building block for potentially endless applications. One application we will explore in this exercise is the development of a spam email classifier using only the prompt. By leveraging the capabilities of a foundation model, this project aims to accurately identify and filter out unwanted and potentially harmful emails, enhancing user experience and security.

## Steps

1. Identify and gather relevant data
2. Build and evaluate the spam email classifier
3. Build an improved classifier?

## Step 1: Identify and Gather Relevant Data

To train and test the spam email classifier, you will need a dataset of emails that are labeled as spam or not spam. It is important to identify and gather a suitable dataset that represents a wide range of spam and non-spam emails.

In [1]:
from datasets import load_dataset
from transformers import pipeline
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

In [2]:
# Find a spam dataset at https://huggingface.co/datasets and load it using the datasets library
dataset = load_dataset("sms_spam", split=["train"])[0]

In [3]:
for entry in dataset.select(range(3)):
    sms = entry["sms"]
    label = entry["label"]
    print(f"label={label}, sms={sms}")

label=0, sms=Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...

label=0, sms=Ok lar... Joking wif u oni...

label=1, sms=Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's



Those labels could be easier to read. Let's create some functions to convert numerical ids to labels.

In [4]:
id2label = {0: "NOT SPAM", 1: "SPAM"}
label2id = {v:k for k,v in id2label.items()}
# label2id = {"NOT SPAM": 0, "SPAM": 1}

In [5]:
for entry in dataset.select(range(3)):
    sms = entry["sms"]
    label_id = entry["label"]
    print(f"label={id2label[label_id]}, sms={sms}")

label=NOT SPAM, sms=Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...

label=NOT SPAM, sms=Ok lar... Joking wif u oni...

label=SPAM, sms=Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's



## Step 2: Build and evaluate the spam email classifier

Using the foundation model and the prepared dataset, you can create a spam email classifier.

Let's write a prompt that will ask the model to classify 15 message as either "spam" or "not spam". For easier parsing, we can ask the LLM to respond in JSON.

In [6]:
# Let's start with this helper function that will help us format sms messages
# for the LLM.
def get_sms_messages_string(dataset, item_numbers, include_labels=False):
    sms_messages_string = ""
    for item_number, entry in zip(item_numbers, dataset.select(item_numbers)):
        sms = entry["sms"]
        label_id = entry["label"]

        if include_labels:
            sms_messages_string += (
                f"{item_number} (label={id2label[label_id]}) -> {sms}\n"
            )
        else:
            sms_messages_string += f"{item_number} -> {sms}\n"

    return sms_messages_string

In [7]:
print(get_sms_messages_string(dataset, range(3), include_labels=True))

0 (label=NOT SPAM) -> Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...

1 (label=NOT SPAM) -> Ok lar... Joking wif u oni...

2 (label=SPAM) -> Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's




Now let's write a bit of code that will produce your prompt. Your prompt should include a few SMS message to be labelled as well as instructions for the LLM.

Some LLMs will also format the output for you as JSON if you ask them, e.g. "Respond in JSON format."

In [8]:
# Replace <MASK> with your code

# Get a few messages and format them as a string
sms_messages_string = get_sms_messages_string(dataset, range(7, 15))

# Construct a query to send to the LLM including the sms messages.
# Ask it to respond in JSON format.
# query = <MASK>
query = f"""Classify the message as SPAM or NOT SPAM. Respond in JSON format. Use the following format: {id2label}. Below is the message
---

{sms_messages_string}
"""

print(query)

Classify the message as SPAM or NOT SPAM. Respond in JSON format. Use the following format: {0: 'NOT SPAM', 1: 'SPAM'}. Below is the message
---

7 -> As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune

8 -> WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.

9 -> Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030

10 -> I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.

11 -> SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info

12 -> URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAI

In [9]:
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe = pipeline("text-generation", model="meta-llama/Meta-Llama-3-8B")

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Device set to use mps:0


In [31]:
pipe("who are you?")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


[{'generated_text': 'who are you? (the first in a series)\nI’ve been thinking a lot lately about what it means to be human. We are so much more than our bodies and our minds. We are souls, and our bodies are merely a vehicle for our souls to experience this life. Our souls are eternal and our bodies are temporary.\nI believe that we are all connected to each other and to the universe. We are all part of the same energy, and we are all here to learn and grow. We are all here to experience love and joy, and to help others do the same.\nI believe that we are all here to learn and grow. We are all here to experience love and joy, and to help others do the same. I believe that we are all connected to each other and to the universe. We are all part of the same energy, and we are all here to learn and grow.\nI believe that we are all connected to each other and to the universe. We are all part of the same energy, and we are all here to learn and grow. I believe that we are all connected to ea

In [9]:
model_id = "meta-llama/Meta-Llama-3-8B"

pipe_0 = pipeline(
    "text-generation",
    model=model_id,
    # model_kwargs={"torch_dtype": torch.bfloat16}
    model_kwargs={
                "torch_dtype": torch.bfloat16,
                # "quantization_config": {"load_in_4bit": True},
                "low_cpu_mem_usage": True,
            },
    device_map="auto"
)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Device set to use mps


In [12]:
pipe_0("Hey how are you doing today?",
       pad_token_id=pipe_0.tokenizer.eos_token_id,
      )

[{'generated_text': 'Hey how are you doing today? I am doing well and hope you are too. Today I am going to be sharing a card that I made for my sister. She is a huge fan of the TV show, "The Walking Dead" and I thought that this stamp set would be perfect for her. I am not a big fan of zombies or the show, but I do think that the stamp set is super cute. The stamp set is called "Walkers" and is from the company called "The Greeting Farm". I used the stamp set to make a card and also a matching treat bag. I hope you enjoy!\nThis is such a cute card! I love the zombies and the cute sentiment!\nI love the card. I love the zombies and the sentiment.\nLove this zombie card and the sentiment is great!\nI love the card. I love the zombies and the sentiment.'}]

In [10]:
max_tokens=4096
temperature=0.1
# top_p=0.1
# terminators = [
#     pipe_0.tokenizer.eos_token_id,
#     pipe_0.tokenizer.convert_tokens_to_ids("")]
outcome = pipe_0("Hey how are you doing today?",
       pad_token_id=pipe_0.tokenizer.eos_token_id,
       max_new_tokens=max_tokens,
       # eos_token_id=terminators,
       do_sample=True,
       temperature=temperature,
       # top_p=top_p
                )
print(outcome)

[{'generated_text': 'Hey how are you doing today? I hope you are doing well. I am doing well. I am just getting ready to go to work. I am going to be working at the hospital today. I am going to be working in the emergency room. I am going to be working with the nurses. I am going to be working with the doctors. I am going to be working with the patients. I am going to be working with the staff. I am going to be working with the visitors. I am going to be working with the volunteers. I am going to be working with the security guards. I am going to be working with the maintenance staff. I am going to be working with the cleaning staff. I am going to be working with the food service staff. I am going to be working with the pharmacy staff. I am going to be working with the lab staff. I am going to be working with the radiology staff. I am going to be working with the physical therapy staff. I am going to be working with the occupational therapy staff. I am going to be working with the spe

In [11]:
print(outcome[0]['generated_text'])

Hey how are you doing today? I hope you are doing well. I am doing well. I am just getting ready to go to work. I am going to be working at the hospital today. I am going to be working in the emergency room. I am going to be working with the nurses. I am going to be working with the doctors. I am going to be working with the patients. I am going to be working with the staff. I am going to be working with the visitors. I am going to be working with the volunteers. I am going to be working with the security guards. I am going to be working with the maintenance staff. I am going to be working with the cleaning staff. I am going to be working with the food service staff. I am going to be working with the pharmacy staff. I am going to be working with the lab staff. I am going to be working with the radiology staff. I am going to be working with the physical therapy staff. I am going to be working with the occupational therapy staff. I am going to be working with the speech therapy staff. I 

In [12]:
print(query)

Classify the message as SPAM or NOT SPAM. Respond in JSON format. Use the following format: {0: 'NOT SPAM', 1: 'SPAM'}. Below is the message
---

7 -> As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune

8 -> WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.

9 -> Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030

10 -> I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.

11 -> SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info

12 -> URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAI

In [28]:
query = f"""Classify the below message as SPAM or NOT SPAM:
---
{sms_messages_string}
---
Respond in JSON format. Use the following format: {id2label}.
"""

print(query)

Classify the below message as SPAM or NOT SPAM:
---
7 -> As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune

8 -> WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.

9 -> Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030

10 -> I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.

11 -> SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info

12 -> URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18

13 -> I've been searching for th

In [35]:
outcome = pipe_0(query,
                 pad_token_id=pipe_0.tokenizer.eos_token_id,
                 # max_tokens=max_tokens,
                 max_new_tokens=8192,
                 # eos_token_id=terminators,
                 do_sample=True,
                 temperature=0.6,
                 # top_p=top_p
                )
print(outcome)

[{'generated_text': "Classify the below message as SPAM or NOT SPAM:\n---\n7 -> As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune\n\n8 -> WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.\n\n9 -> Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030\n\n10 -> I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.\n\n11 -> SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info\n\n12 -> URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18\

In [37]:
print(outcome[0]['generated_text'])

Classify the below message as SPAM or NOT SPAM:
---
7 -> As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune

8 -> WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.

9 -> Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030

10 -> I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.

11 -> SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info

12 -> URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18

13 -> I've been searching for th

In [15]:
# I took the above and passed it to deep-seek's R1 through Hugging Chat, but I'm trying to use Meta-Llama-3.1-8b
response = {
  "7": "NOT SPAM",
  "8": "SPAM",
  "9": "SPAM",
  "10": "NOT SPAM",
  "11": "SPAM",
  "12": "SPAM",
  "13": "NOT SPAM",
  "14": "NOT SPAM"
}

In [16]:
# Estimate the accuracy of your classifier by comparing your responses to the labels in the dataset
def get_accuracy(response, dataset, original_indices):
    correct = 0
    total = 0

    for entry_number, prediction in response.items():
        if int(entry_number) not in original_indices:
            continue

        label_id = dataset[int(entry_number)]["label"]
        label = id2label[label_id]

        # If the prediction from the LLM matches the label in the dataset
        # we increment the number of correct predictions.
        # (Since LLMs do not always produce the same output, we use the
        # lower case version of the strings for comparison)
        if prediction.lower() == label.lower():
            correct += 1

        # increment the total number of predictions
        total += 1

    try:
        accuracy = correct / total
    except ZeroDivisionError:
        print("No matching results found!")
        return

    return round(accuracy, 2)

In [17]:
print(f"Accuracy: {get_accuracy(response, dataset, range(7, 15))}")

Accuracy: 1.0


That's not bad! (Assuming you used an LLM capable of handling this task)

Surely it won't be correct for every example we throw at it, but it's a great start, especially for not giving it any examples or training data.

We can see that the model is able to distinguish between spam and non-spam messages with a high degree of accuracy. This is a great example of how a foundation model can be used to build a spam email classifier.

## Step 3: Build an improved classifier?

If you provide the LLM with some examples for how to complete a task, it will sometimes improve its performance. Let's try that out here.

In [38]:
# Replace <MASK> with your code that constructs a query to send to the LLM

# Get a few labelled messages and format them as a string
sms_messages_string_w_labels = get_sms_messages_string(
    dataset, range(54, 60), include_labels=True
)

# Get a few unlabelled messages and format them as a string
sms_messages_string_no_labels = get_sms_messages_string(dataset, range(7, 15))


# Construct a query to send to the LLM including the labelled messages
# as well as the unlabelled messages. Ask it to respond in JSON format
# query = <MASK>
query = f"""Classify the message as SPAM or NOT SPAM. Respond in JSON format. Use the following format: {id2label}.
Here are some examples:

{sms_messages_string_w_labels}
---
Please classify the below
---

{sms_messages_string}
"""

print(query)

Classify the message as SPAM or NOT SPAM. Respond in JSON format. Use the following format: {0: 'NOT SPAM', 1: 'SPAM'}.
Here are some examples:

54 (label=SPAM) -> SMS. ac Sptv: The New Jersey Devils and the Detroit Red Wings play Ice Hockey. Correct or Incorrect? End? Reply END SPTV

55 (label=NOT SPAM) -> Do you know what Mallika Sherawat did yesterday? Find out now @  &lt;URL&gt;

56 (label=SPAM) -> Congrats! 1 year special cinema pass for 2 is yours. call 09061209465 now! C Suprman V, Matrix3, StarWars3, etc all 4 FREE! bx420-ip4-5we. 150pm. Dont miss out! 

57 (label=NOT SPAM) -> Sorry, I'll call later in meeting.

58 (label=NOT SPAM) -> Tell where you reached

59 (label=NOT SPAM) -> Yes..gauti and sehwag out of odi series.


---
Please classify the below
---

7 -> As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune

8 -> WINNER!! As a valued network customer you have been 

In [39]:
# Result again from deepseek R1 through hugging chat; using LLaMa 3.1 has posed issues
result_1 = {7: 'NOT SPAM',
            8: 'SPAM',
            9: 'SPAM',
            10: 'NOT SPAM',
            11: 'SPAM',
            12: 'SPAM',
            13: 'NOT SPAM',
            14: 'NOT SPAM'}

In [None]:
outcome_1 = pipe_0(query,
       pad_token_id=pipe_0.tokenizer.eos_token_id,
       max_new_tokens=max_tokens,
       # eos_token_id=terminators,
       do_sample=True,
       # temperature=temperature,
       # top_p=top_p
                  )

In [44]:
print(f"Accuracy: {get_accuracy(result_1, dataset, range(7, 15))}")

Accuracy: 1.0


# Why Benchmarks Matter

Benchmarks matter because they are the standards that help us measure and accelerate progress in AI. They offer a common ground for comparing different AI models and encouraging innovation, providing important stepping stones on the path to more advanced AI technologies.

## Technical Terms Explained:

* **Robustness**: The strength of an AI model to maintain its performance despite challenges or changes in data.

* **Open Access**: Making data sets freely available to the public, so that anyone can use them for research and develop AI technologies.

# The GLUE Benchmarks

The GLUE benchmarks serve as an essential tool to assess an AI's grasp of human language, covering diverse tasks, from grammar checking to complex sentence relationship analysis. By putting AI models through these varied linguistic challenges, we can gauge their readiness for real-world tasks and uncover any potential weaknesses.

## Technical Terms Explained:

* **Semantic Equivalence**: When different phrases or sentences convey the same meaning or idea.

* **Textual Entailment**: The relationship between text fragments where one fragment follows logically from the other.

## GLUE Tasks/Benchmarks

|Short Name| Full Name | Description| Summary|
|:--|:--|:--|:--|
|CoLA|	Corpus of Linguistic Acceptability	|Measures the ability to determine if an English sentence is linguistically acceptable.| Grammatical acceptability |
|SST-2|	Stanford Sentiment Treebank	|Consists of sentences from movie reviews and human annotations about their sentiment.| Sentiment Analysis |
|MRPC|	Microsoft Research Paraphrase Corpus|	Focuses on identifying whether two sentences are paraphrases of each other.| Paraphrase identification |
|STS-B|	Semantic Textual Similarity Benchmark	|Involves determining how similar two sentences are in terms of semantic content.| Semantic textual similarity |
|QQP|	Quora Question Pairs|	Aims to identify whether two questions asked on Quora are semantically equivalent.| Question pairs equivalence |
|MNLI|	Multi-Genre Natural Language Inference	|Consists of sentence pairs labeled for textual entailment across multiple genres of text.| Natural language inference |
|QNLI|	Question Natural Language Inference|	Involves determining whether the content of a paragraph contains the answer to a question.| Question answering inference |
|RTE|	Recognizing Textual Entailment|	Requires understanding whether one sentence entails another.| Textural entailment recognition |
|WNLI|	Winograd Natural Language Inference	|Tests a system's reading comprehension by having it determine the correct referent of a pronoun in a sentence, where understanding depends on contextual information provided by specific words or phrases.| Pronoun disambiguation |

## SuperGLUE Benchmarks

SuperGlue is designed as a successor to the original GLUE benchmark. It's a more advanced benchmark aimed at presenting even more challenging language understanding tasks for AI models. Created to push the boundaries of what AI can understand and process in natural language, SuperGlue emerged as models began to achieve human parity on the GLUE benchmark. It also features a public leaderboard, facilitating the direct comparison of models and enabling the tracking of progress over time.

Short Name| Full Name | Description|
|:--|:--|:--|
|BoolQ|	Boolean Questions|Involves answering a yes/no question based on a short passage. |
|CB	|CommitmentBank	|Tests understanding of entailment and contradiction in a three-sentence format. |
|COPA|	Choice of Plausible Alternatives|	Measures causal reasoning by asking for the cause/effect of a given sentence. |
|MultiRC|	Multi-Sentence Reading Comprehension|	Involves answering questions about a paragraph where each question may have multiple correct answers. |
|ReCoRD|	Reading Comprehension with Commonsense Reasoning|	Requires selecting the correct named entity from a passage to fill in the blank of a question. |
|RTE|	Recognizing Textual Entailment|	Involves identifying whether a sentence entails, contradicts, or is neutral towards another sentence. |
|WiC|	Words in Context|	Tests understanding of word sense disambiguation in different contexts. |
|WSC|	Winograd Schema Challenge|	Focuses on resolving coreference resolution within a sentence, often requiring commonsense reasoning. |
|AX-b|	Broad Coverage Diagnostic|	A diagnostic set to evaluate model performance on a broad range of linguistic phenomena. |
|AX-g	|Winogender Schema Diagnostics	|Tests for the presence of gender bias in automated coreference resolution systems. |

## Technical Terms Explained:

* **Coreference Resolution**: This is figuring out when different words or phrases in a text, like the pronoun she and the president, refer to the same person or thing.


<ins>BoolQ Examples</ins>

Let's take a look at some examples from the BoolQ dataset. Here is a table from the paper "BoolQ: Exploring the surprising difficulty of natural yes/no questions."

<img src="img/img_04.png">

# Data Used for Training LLMs

Generative AI, specifically Large Language Models (LLMs), rely on a rich mosaic of data sources to fine-tune their linguistic skills. These sources include web content, academic writings, literary works, and multilingual texts, among others. By engaging with a variety of data types, such as scientific papers, social media posts, legal documents, and even conversational dialogues, LLMs become adept at comprehending and generating language across many contexts, enhancing their ability to provide relevant and accurate information.

<img src="img/img_05.png">

## Explanation of Technical Terms:

* **Preprocessing**: This is the process of preparing and cleaning data before it is used to train a machine learning model. It might involve removing errors, irrelevant information, or formatting the data in a way that the model can easily learn from it.

* **Fine-tuning**: After a model has been pre-trained on a large dataset, fine-tuning is an additional training step where the model is further refined with specific data to improve its performance on a particular type of task.

# Data Scale and Volume

The scale of data for Large Language Models (LLMs) is tremendously vast, involving datasets that could equate to millions of books. The sheer size is pivotal for the model's understanding and mastery of language through exposure to diverse words and structures.

<img src="img/img_06.png">

Recall, the initial LLaMa from Meta was trained on more than 4.7TB of Data, which is ~4,700gb which is ~4,700*1,000 =4,700,000 books of information

<img src="img/img_02.png">

## Explanation of Technical Terms:

* **Gigabytes/Terabytes**: Units of digital information storage. One gigabyte (GB) is about 1 billion bytes, and one terabyte (TB) is about 1,000 gigabytes. In terms of text, a single gigabyte can hold roughly 1,000 books.

* **Common Crawl**: An open repository of web crawl data. Essentially, it is a large collection of content from the internet that is gathered by automatically scraping the web.

# Biases in Training Data

Biases in training data deeply influence the outcomes of AI models, reflecting societal issues that require attention. Ways to approach this challenge include promoting diversity in development teams, seeking diverse data sources, and ensuring continued vigilance through bias detection and model monitoring.

<img src="img/img_07.png">

<img src="img/img_08.png">


## Technical Terms Explained:

* **Selection Bias**: When the data used to train an AI model does not accurately represent the whole population or situation by virtue of the selection process, e.g. those choosing the data will tend to choose dataset their are aware of

* **Historical Bias**: Prejudices and societal inequalities of the past that are reflected in the data, influencing the AI in a way that perpetuates these outdated beliefs.

* **Confirmation Bias**: The tendency to favor information that confirms pre-existing beliefs, which can affect what data is selected for AI training.

* **Discriminatory Outcomes**: Unfair results produced by AI that disadvantage certain groups, often due to biases in the training data or malicious actors.

* **Echo Chambers**: Situations where biased AI reinforces and amplifies existing biases, leading to a narrow and distorted sphere of information.

* **Bias Detection and Correction**: Processes and algorithms designed to identify and remove biases from data before it's used to train AI models.

* **Transparency and Accountability**: Openness about how AI models are trained and the nature of their data, ensuring that developers are answerable for their AI's performance and impact.



# Exercise: Research Pre-Training Datasets

When it comes to training language models, selecting the right pre-training dataset is important. In this exercise, we will explore the options available for choosing a pre-training dataset, focusing on four key sources:

* CommonCrawl
* Github
* Wikipedia
* Gutenberg project

These sources provide a wide range of data, making them valuable resources for training language models. If you were tasked with pre-training an LLM, how would you use these datasets and how would you pre-process them? Are there other sources you would use?

In this exercise, you will construct a fictional pre-training dataset for a fictional task. The goal is to get you thinking about how to construct a pre-training dataset for your own task.

## Step 1: Evaluate the available pre-training datasets

Begin by examining the four sources mentioned in the introduction - CommonCrawl, Github, Wikipedia, and the Gutenberg project. Assess the size, quality, and relevance of the data provided by each source for training language models.

### Common Crawl

Read about [CommonCrawl on its website](https://commoncrawl.org/)
* Size: Common Crawl corpus contains petabytes of data, regularly collected since 2008
* Data Summary: CommonCrawl website contains webpages in their original form, and these pages have not been filtered for things like spam

### Github

Read about the [Github dataset on its website](https://www.githubarchive.org/)
* Data Summary: The Github dataset only contains public repositories

### Wikipedia

Read about the [Wikipedia dataset on its website](https://dumps.wikimedia.org/)
* Data Summary: Wikipedia datasets are available in many formats, even DVD

### Gutenberg Project

Read about the [Gutenberg Project on its website](https://www.gutenberg.org/)
* Size: There are over 70,000 books in the Gutenberg Project dataset

## Step 2. Select the appropriate datasets
Based on the evaluation, choose the datasets that best suit the requirements of pre-training a Language Model (LLM). Consider factors such as the diversity of data, domain-specific relevance, and the specific language model objectives.


For your use case, rank the datasets in order of preference. For example, if you were training a language model to generate code, you might rank the datasets as follows:

1. Github
2. Wikipedia
3. CommonCrawl
4. Gutenberg project

Explain your reasoning for the ranking. For example, you might say that GitHub is the best dataset because it contains a large amount of code, and the code is structured and clean. You might say that Wikipedia is the second-best dataset because it contains a large amount of text, including some code. You might say that CommonCrawl is the third-best dataset because it contains a large amount of text, but the text is unstructured and noisy. You might say that the Gutenberg project is the worst dataset because it contains text that is not relevant to the task.

```
1. CommonCrawl
2. Gutenberg Project
3. Wikipedia
4. Github

I am interested in building a foundation model that can talk but specializes in literature. So I want to pre-train on CommonCrawl and Fine-Tune on Gutenberg Project
```

## Step 3. Pre-process the selected datasets

Depending on the nature of the chosen datasets, pre-processing may be required. This step involves cleaning the data, removing irrelevant or noisy content, standardizing formats, and ensuring consistency across the dataset. Discuss how you would pre-process the datasets based on what you have observed.

```
I would start by examining the common crawl data set to understand how messy is messy, as it is described as such. I may do the same with Wikipedia

For the Gutenberg data, I would ensure it is formed in an appropriate form for fine-tuning.

For the Github data, I am unsure how it should be pre-processed.
```

## Step 4. Augment with additional sources

Consider whether there are other relevant sources that can be used to augment the selected datasets. These sources could include domain-specific corpora, specialized text collections, or other publicly available text data that aligns with your language model's objectives, such as better representation and diversity.