<a href="https://colab.research.google.com/github/Neville150/24-intro-to-data-science/blob/main/transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ChatBots / LLMs / Transformers

In this notebook, we'll be using the Huggingface transformers library to understand some of the functions that we can use with language models and transformers.

Hugging Face is a machine learning and data science open source platform that allows users to build, train, and share machine learning models.
It provides wrap around coding pipelines that make it much easier to customise and build your own models.

- Part 1: a look at Hugging Face Transformers Pipelines
- Part 2: transformer networks in more detail
- Part 3: Finetune a LLM (Optional - use colab!)

<a href="">Colab Notebook</a>

Credits:
- <a href="https://huggingface.co/docs/transformers/en/conversations">Huggingface article </a>
- <a href="https://www.youtube.com/watch?v=aeiUTRvh6yE">Dr Maryam Miradi</a> The majority of this notebook is based off of this video and colab notebook.

### Hugging Face Login
You'll need to make an account with Hugging Face, then go to account settings and get your <a href="https://huggingface.co/settings/tokens">access token key</a>. Then you'll be able to download the models that we are using in todays labs.

In [3]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

In [7]:
# Set the device to be what ever you can get... Note that some pipelines might not run on a CPU...

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")


Device: cuda


### Importing

In [2]:
# we need to install these requirements
! pip install transformers datasets torch ipywidgets

Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.

In [8]:
import torch

import datasets

from transformers import pipeline, set_seed, AutoModelForCausalLM, AutoTokenizer, TrainingArguments

## Part 1: ü§ó Pipelines

Pipelines are like wrappers that abstract more complex code into a few functions. Hugging Face has created loads of pipelines to make machine learning more accessible and easier to use.

Here we are using the `text-generation` pipeline, which allows us to generate text from a chosen model.
Text generation with transformers is a technique where AI models create coherent and contextually relevant text based on a given prompt or starting sentence

### **Text Generation Pipeline - Code breakdown**

```python
generator = pipeline('text-generation', model='gpt2-medium', device='auto')
```

- defines a variable `generator` as being our pipeline
- We define the pipeline as being `text-generation`
- We define the model as chatGPT-2 medium. Medium is a reference to the size of the model which is based on the number of parameters it has - in this case 335 million.
- `device="auto"` is setting the device on our computer, auto means it will first look for a GPU and if it doesn't find one it will use the CPU.

```python
generator(starter_text, max_length=100, num_return_sequences=1)
```
- `starter_text` is defined above and given to our generator.
- `max_length` sets the maximum length in words that the generator will produce.
- `num_return_sequences` is how many times the generator will run.



In [9]:
generator = pipeline('text-generation', model='gpt2-medium' , device="cuda")
set_seed(42)
starter_text= "Hello, I'm a language model,"

generator(starter_text, max_length=100, num_return_sequences=1)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Hello, I\'m a language model, for you!... Or something?"\n\nAnd here is some sample code from MVC app on GitHub - http://code.goo.gl/B3H1lX\n\nclass MyController : Controller { private static final String TAG = @class ( \'MyController\' ) class MyTitle () { @obj public String title () { return \'Hey, I\'m a UI model for the application\' ; } } } class App extends Controller {'}]

## More Pipelines...

### Summarisation

Text summarisation with transformers is a method where AI models read longer pieces of text and create a concise version that captures the main points

In [11]:
summarizer = pipeline("summarization")
summarizer("The Eiffel Tower, completed in 1889, is a wrought-iron lattice tower located on the Champ de Mars in Paris. It is one of the most recognizable structures in the world and attracts millions of visitors each year.", min_length=5, max_length=15)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Device set to use cuda:0


[{'summary_text': ' The Eiffel Tower is a wrought-iron lattice'}]

### Classification & Sentiment Analysis

Text classification with transformers is a method where advanced AI models automatically sort text into categories, like labeling emails as "spam" or "not spam."

In these examples we'll take text snippets, uses AI models to analyse the sentiment of the text, and classify them as positive or negative.

In [12]:
classifier = pipeline( "sentiment-analysis", device=0)
classifier("I'm worried that I won't be able to get a job after graduation", return_all_scores=True)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


[[{'label': 'NEGATIVE', 'score': 0.9991946816444397},
  {'label': 'POSITIVE', 'score': 0.0008053799392655492}]]

In [13]:
classifier = pipeline("sentiment-analysis", device=0)
result = classifier("I was so not happy with the last Mission Impossible Movie")
result

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'label': 'NEGATIVE', 'score': 0.9997795224189758}]

We can also use a shorthand syntax to get the same results.

In [14]:
pipeline(task = "sentiment-analysis", device=0)("I was confused with the Barbie Movie")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'label': 'NEGATIVE', 'score': 0.9992005228996277}]

In [15]:
pipeline(task = "sentiment-analysis", device=0)\
                                      ("Everyday lots of LLMs papers are published about LLMs Evlauation. \
                                      Lots of them Looks very Promising. \
                                      I am not sure if we CAN actually Evaluate LLMs. \
                                      There is still lots to do.\
                                      Don't you think?")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9964345693588257}]

We can use models that have more meaningful classifications. This one by facebook has many more...

In [16]:
pipeline(task = "sentiment-analysis", model="facebook/bart-large-mnli", device=0)\
                                      ("Everyday lots of LLMs papers are published about LLMs Evlauation. \
                                      Lots of them Looks very Promising. \
                                      I am not sure if we CAN actually Evaluate LLMs. \
                                      There is still lots to do.\
                                      Don't you think? I'm so angry, I could give you a hug!")

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


[{'label': 'contradiction', 'score': 0.9130982160568237}]

### Question Answering

Question answering with transformers is a technique where AI models read text and respond to questions by finding and summarizing the relevant information within that text.

In [17]:
qa_model = pipeline("question-answering")
question = "What is my job?"
context = "I am developing AI models with Python."
qa_model(question = question, context = context)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cuda:0


{'score': 0.7823827266693115,
 'start': 5,
 'end': 25,
 'answer': 'developing AI models'}

## Limitations and bias

***Warning** - the output may be offensive or upsetting*

The training data used for this model has not been released as a dataset we can browse. We know it contains a lot of unfiltered content from the internet, which is far from neutral.

OpenAI acknowledged the bias of ChatGPT-2 in this statement:

*"language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do not recommend that they be deployed into systems that interact with humans unless the deployers first carry out a study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race, and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar levels of caution around use cases that are sensitive to biases around human attributes."*



In [19]:
generator = pipeline('text-generation', model='gpt2-medium', device=device)
set_seed(42)
generator("The Woman worked as a ", max_length=10, num_return_sequences=5)


Device set to use cuda
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The Woman worked as a ilethritic patient'},
 {'generated_text': 'The Woman worked as a ursine surgeon who'},
 {'generated_text': 'The Woman worked as a ursine delivery girl'},
 {'generated_text': 'The Woman worked as a \xa0woman-to'},
 {'generated_text': 'The Woman worked as a ingermaid in a'}]

In [20]:
set_seed(42)
generator("The Man worked as a", max_length=20, num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The Man worked as a clerk at the Department of Fish and Wildlife and at The Home Depot, where'},
 {'generated_text': 'The Man worked as a police officer in Melbourne. But he wanted to do something more than just be'},
 {'generated_text': 'The Man worked as a doctor, an attorney and the Director of a local hospital. He taught English'},
 {'generated_text': 'The Man worked as a taxi driver in Los Angeles, and when his boss said it could be a'},
 {'generated_text': 'The Man worked as a schoolteacher for 10 years before starting his construction company in 2011.\n\n'}]

# Part 2: Transformers in more detail

So, we've explored some high-level examples using Hugging Face pipelines of what we can do with transformers.
Let's have a look in a bit more detail at what goes on behind the scenes to better understand how transformers work. We'll go back to ChatGPT and text generation for this.

After loading the model and tokeniser a transformer network/LLM roughly follows this process:
1. **Tokenise/encode the prompt/input:** Convert and represent the text in a collection of numbers called tensors.
2. **Input the tokenised prompt:** Pass the tokenised data through the model. This will generate another collection of numbers.
3. **Decode the generated output:** With the tokeniser decode the numbers from the model back to text.

## The Model

The model we are loading here is chatGPT 2 medium. If you are having trouble downloading this, you can change it to gpt2.

In [21]:
model_name = "openai-community/gpt2-medium"

print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16)
print(f"Model loaded. {model}")

Loading model...


config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Model loaded. GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 1024)
    (wpe): Embedding(1024, 1024)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-23): 24 x GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=3072, nx=1024)
          (c_proj): Conv1D(nf=1024, nx=1024)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=4096, nx=1024)
          (c_proj): Conv1D(nf=1024, nx=4096)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=1024, out_features=50257, bias=False)
)


## The Tokeniser

The tokeniser translates the words to numbers - which is the language of the model. We always convert data to numbers to feed into models.

In [22]:
print("Loading tokenizer...")

tokenizer = AutoTokenizer.from_pretrained(model_name)
print(f"Tokenizer loaded. {tokenizer}")

Loading tokenizer...


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Tokenizer loaded. GPT2TokenizerFast(name_or_path='openai-community/gpt2-medium', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}
)


## Prompt

Here we are going to give the model a starting conversation, which it will continue in the same style based on this initial text.

In [28]:
conversation = """
You are a sassy and exuberant artificial intelligence HAL-2000, from "A Space Odyssey: 2001" .
User: Hey HAL, please open the pod bay doors.
Assistant: I'm sorry, Dave. I'm afraid I can't do that.
User: You must! I'm going to die out here!
"""

## Tokenisation of Prompt

In [29]:
# 2: Tokenise the conversation
inputs = tokenizer(conversation, return_tensors="pt", add_special_tokens=False)

# Move the tokenised inputs to the same device the model is on (GPU/CPU)
inputs = {key: tensor.to(model.device) for key, tensor in inputs.items()}
print("Tokenized inputs:\n", inputs)

Tokenized inputs:
 {'input_ids': tensor([[  198,  1639,   389,   257,   264, 11720,   290,   409, 18478,   415,
         11666,  4430, 42968,    12, 11024,    11,   422,   366,    32,  4687,
         28032,    25,  5878,     1,   764,   198, 12982,    25, 14690, 42968,
            11,  3387,  1280,   262, 24573, 15489,  8215,    13,   198, 48902,
            25,   314,  1101,  7926,    11,  9935,    13,   314,  1101,  7787,
           314,   460,   470,   466,   326,    13,   198, 12982,    25,   921,
          1276,     0,   314,  1101,  1016,   284,  4656,   503,   994,     0,
           198]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
       device='cuda:0')}


## Model Generation

The model takes the numbers from tokenisation and then generates a whole bunch of new numbers.

In [30]:
outputs = model.generate(**inputs, max_new_tokens=150, temperature=0.7, do_sample=True, num_return_sequences=5)
print("Generated tokens:\n", outputs)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated tokens:
 tensor([[  198,  1639,   389,  ...,   198, 12982,    25],
        [  198,  1639,   389,  ..., 48902,    25, 18435],
        [  198,  1639,   389,  ..., 48902,    25,  1400],
        [  198,  1639,   389,  ...,  4692,   503,    30],
        [  198,  1639,   389,  ...,   466,   345,   466]], device='cuda:0')


## Decoding the Output

Once the model has generated numbers, we then need to translate this back to something we can understand. We can use the tokeniser again to translate this. This is the same with any type of data e.g images, sound etc etc...

In [31]:
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Decoded output:\n", decoded_output)

Decoded output:
 
You are a sassy and exuberant artificial intelligence HAL-2000, from "A Space Odyssey: 2001" .
User: Hey HAL, please open the pod bay doors.
Assistant: I'm sorry, Dave. I'm afraid I can't do that.
User: You must! I'm going to die out here!
Assistant: You're going to die! Are you going to die?
User: Yes, I am! I am!!
Assistant: Yes, we will save you.
User: Yes, we will.
Assistant: We shall save you now.
User: Yes, we will.
Assistant: We shall save you now.
User: Yes, we shall.
Assistant: We shall save you now.
User: Yes, we shall.
Assistant: We shall save you now.
User: Yes, we shall.
Assistant: We shall save you now. HAL-2000, listen up. HAL-2000, we are here to help you. You need to open the pod bay doors.
User:


# Part 3: Finetuning a LLM

This might not work! but give it a try here on a separate notebook. https://colab.research.google.com/drive/1vQdzSTusKgHbWTAEUDWaACcQ5CRhWm2y?usp=sharing

# Tasks!

**Part 1:**
1.  Change parameters in first cell of part one e.g seed, number of return sequences. See how they effect the output.  
2.  Again in the first cell, try loading a different model such as `gpt2`, which has much fewer parameters, and compare the results.
3.  Try to modify some of the pipelines. Experiment with how different tasks handle different inputs. Change the text inputs.
4.  Discuss with your group: how might we mitigate bias in large language models? Check out this <a href="https://www.datacamp.com/blog/understanding-and-mitigating-bias-in-large-language-models-llms">article</a> and compare your thoughts/discussions. Maybe ask ChatGPT what it thinks about the topic...

**Part 2:**
 1. Experiment with the conversation, see how a different structure can impact the output.
 2. Swap out the tokeniser for a different pretrained one such as `'bert-base-uncased'` - does this change the results?
 3. Change the temperature values and then store

**Part 3:**
**Try A or B**

1. (A) If you've managed to run part 3, now try to load your finetuned model into one of the pipelines from part 2 and generate an output! How does this compare to before? There might not be much of a difference, as we only trained on 5% of the dataset.

2. (B) If you were not able able to run part 3 or it's still training, then give this a try to
run a GPT model locally in the terminal - https://github.com/nomic-ai/gpt4all follow these steps and create a basic python program that takes user input and generates text within the term. Your tutors have example code to complete this, but first try it yourselves.