### Transformers, General Python, and Gradio Exercices

---


### First part

1) Install the library `gradio`, and the library `transformers`.

In [1]:
%pip install gradio transformers

Collecting gradio
  Downloading gradio-3.44.3-py3-none-any.whl (20.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.2/20.2 MB[0m [31m58.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.33.2-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m99.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl (15 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.103.1-py3-none-any.whl (66 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.2/66.2 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpy (from gradio)
  Downloading ffmpy-0.3.1.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gradio-client==0.5.0 (from gradio)
  Downloading gradio_client-0.5.0-py3-none-any.whl (298 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

2) Create a `gradio` interface connected to GPT-2, like we did during the course. Use a `max_length` of `100`, `10` `num_beams`, `early_stopping=True`, and `no_repeat_ngram_size=2`.

In [2]:
import gradio as gr
from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

def gpt2_text_generation(text: str) -> str:
    input_ids = tokenizer.encode(text, return_tensors='pt')
    beam_outputs = model.generate(
        input_ids,
        max_length=100,  # how long you want the output to be at max
        num_beams=10,
        early_stopping=True,
        no_repeat_ngram_size=2,  # avoids repeating the same ngrams
        pad_token_id=tokenizer.eos_token_id
    )
    generated_text = tokenizer.decode(beam_outputs[0], skip_special_tokens=True)
    return generated_text

interface = gr.Interface(fn=gpt2_text_generation, inputs="text", outputs="text")
interface.launch(share=True)

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://2b476e47714b8e08ce.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




3) Ask the question "Who are you?"

4) Why is the model not answering "I am GPT-2, a ChatBot designed to answer your questions". Well because, it is not a ChatBot designed to answer your questions yet, for now it is (roughly speaking) just guessing what is most likely next word based on the previous ones.

5) Modify your `gradio` function to include, for every text inputted by the user, a sentence or two to steer the model into behaving like a ChatBot.

In [3]:
import gradio as gr
from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

def gpt2_chatbot(text: str) -> str:
    full_text: str = "Person 1 is a person interacting with ChatBot 1." \
                     " They are having a conversation." \
                     " ChatBot 1 is a ChatBot and is called GPT-2." \
                     "\nPerson 1: " + text + "\nChatBot 1: "
    input_ids = tokenizer.encode(full_text, return_tensors='pt')
    beam_outputs = model.generate(
        input_ids,
        max_length=100,  # how long you want the output to be at max
        num_beams=10,
        early_stopping=True,
        no_repeat_ngram_size=2,  # avoids repeating the same ngrams
        pad_token_id=tokenizer.eos_token_id
    )
    generated_text = tokenizer.decode(beam_outputs[0], skip_special_tokens=True)
    return generated_text

interface = gr.Interface(fn=gpt2_chatbot, inputs="text", outputs="text")
interface.launch(share=True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://f8cb2b31c1e92b2fa0.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




6) Now, you would probably want to hide the sentence(s) you have added to steer the model to behave like you wanted to. Modify `generated_text` so that it only displays the conversation and not the 'wrapper' indicating to the model how to behave.

In [5]:
import gradio as gr
from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

def gpt2_chatbot(text: str) -> str:
    full_text: str = "Person 1 is a person interacting with ChatBot 1." \
                     " They are having a conversation." \
                     " ChatBot 1 is a ChatBot and is called GPT-2." \
                     "\nPerson 1: " + text + "\nChatBot 1: "
    input_ids = tokenizer.encode(full_text, return_tensors='pt')
    beam_outputs = model.generate(
        input_ids,
        max_length=100,  # how long you want the output to be at max
        num_beams=10,
        early_stopping=True,
        no_repeat_ngram_size=2,  # avoids repeating the same ngrams
        pad_token_id=tokenizer.eos_token_id
    )
    generated_text = tokenizer.decode(beam_outputs[0], skip_special_tokens=True)
    return '\n'.join(generated_text.split("\n")[1:])

interface = gr.Interface(fn=gpt2_chatbot, inputs="text", outputs="text")
interface.launch(share=True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://8db83c8ab3e0c622da.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




---

### Second Part

7) Let's now try to use a more recent model: `Llama`. Run the following code:

In [1]:
!pip install git+https://github.com/huggingface/transformers.git@refs/pull/25740/head accelerate

Collecting git+https://github.com/huggingface/transformers.git@refs/pull/25740/head
  Cloning https://github.com/huggingface/transformers.git (to revision refs/pull/25740/head) to /tmp/pip-req-build-p0gawa4_
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-p0gawa4_
[0m  Running command git fetch -q https://github.com/huggingface/transformers.git refs/pull/25740/head
  Running command git checkout -q c6c6daa3f07e753cff91a08c4294df4a6ea6227b
  Resolved https://github.com/huggingface/transformers.git to commit c6c6daa3f07e753cff91a08c4294df4a6ea6227b
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting accelerate
  Downloading accelerate-0.23.0-py3-none-any.whl (258 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m


`git+https://github.com/huggingface/transformers.git@refs/pull/25740/head`: This is a direct link to a specific pull request (`PR #25740`) from the `transformers` repository on GitHub. This means you're not installing the main version of the `transformers` library that's available on PyPI (Python Package Index) but a specific version from this PR.

Using `git+` followed by a GitHub URL is a way to tell `pip` to install a Python package directly from a Git repository.

This is not too important if you don't fully understand it.

You may need to restart your runtime since we had already installed another version of the `Transformers` library. Do not execute the previous cells otherwise it will ask you to re-restart your runtime. Only execute the cells from this second part of the exercise. Start a GPU Runtime otherwise it will not work. To change it, click on Change runtime type and you will arrive here: https://i.imgur.com/8QjVO0P.png. Select T4 GPU. Why is this useful? This is because on colab, the GPU has more RAM, and you need *a lot* of RAM to load big models.

Executing the following cell might take some time. Try to understand every line there is in the meantime.

In [2]:
from transformers import AutoTokenizer
import transformers
import torch

model = "codellama/CodeLlama-7b-Instruct-hf" # "codellama/CodeLlama-7b-hf"
# Code Llama is a collection of pretrained and fine-tuned generative text models
# ranging in scale from 7 billion to 34 billion parameters.
# We are here using the 7 billion parameters version.

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

prompt = "Who are you?"

sequences = pipeline(
    prompt,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=100,
    no_repeat_ngram_size=2,  # avoids repeating the same ngrams
    add_special_tokens=False
)

for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Downloading (…)okenizer_config.json:   0%|          | 0.00/749 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/646 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Result: Who are you?

Comment: @JonathanLeffler I'm the OP.
I've tried to use the `fscanf` function, but it doesn't work. I don' know why. The code is:
`fprintf(stderr, "Enter the number of lines: ");
fgets(line, 100, stdin);
n = atoi(strtok( line, "\n"));
for (


8) Why are we using `torch.float16`?

Using float16 (also known as "half precision") can lead to faster processing times but might compromise a bit on precision.

9) Why is there a for loop:
`for seq in sequences:`?

What can you deduce about the `sequences` object returned? Prove it using the `type` built-in Python function. **Careful, when you want to re-use Llama, do not re-load the whole pipeline! You don't need to load the pipeline everytime, once it's loaded, you just need to call it.**

In [3]:
prompt = "Who are you?"

sequences = pipeline(
    prompt,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=100,
    no_repeat_ngram_size=2,  # avoids repeating the same ngrams
    add_special_tokens=False
)

print(type(sequences))  # list
print(type(sequences[0]))  # dictionary
print(type(sequences[0]["generated_text"]))  # str

print(len(sequences))  # 1

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<class 'list'>
<class 'dict'>
<class 'str'>
1


What if we want the `list` sequences to have more than one element? Print the `k` sequences you asked the model to generate. Comment. If you do not specify `num_beams` > 1, it will not work (for obvious reasons). The diference might be subtle (only one or a few words).

In [None]:
prompt = "Who are you?"

sequences = pipeline(
    prompt,
    num_beams=2,
    num_return_sequences=2,
    eos_token_id=tokenizer.eos_token_id,
    max_length=100,
    no_repeat_ngram_size=2,  # avoids repeating the same ngrams
    add_special_tokens=False
)

print(type(sequences))  # list
print(type(sequences[0]))  # dictionary
print(type(sequences[0]["generated_text"]))  # str

print(len(sequences))  # 2

for seq in sequences:
    print(f"Result: {seq['generated_text']}")
    print("\nNext return sequence:")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<class 'list'>
<class 'dict'>
<class 'str'>
2
Result: Who are you? What do you want from me?"
"I'm a friend," I said. "I want to help you."
He looked at me skeptically, then shrugged and stepped aside. I followed him into the house.
The interior was dark and musty, with cobwebs hanging from the rafters. There was a smell of mold and mildew, and the air was thick with the scent of decay

Next return sequence:
Result: Who are you? What do you want from me?"
"I'm a friend," I said. "I want to help you."
He looked at me skeptically, then shrugged and stepped aside. I followed him into the house.
The interior was dark and musty, with cobwebs hanging from the rafters. There was a smell of mold and mildew, and the air was thick with dust. The furniture

Next return sequence:


We have the same problem as we had with GPT-2. We did to encapsulate the prompt with some "instructions".

In [5]:
prompt = "This is a conversation between a person and a chatbot. Person: Who are you? Chatbot:"

sequences = pipeline(
    prompt,
    num_beams=2,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=100,
    no_repeat_ngram_size=2,  # avoids repeating the same ngrams
    add_special_tokens=False
)

for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Result: This is a conversation between a person and a chatbot. Person: Who are you? Chatbot: I'm just an AI designed to answer your questions and provide information on a wide range of topics.

Person: What is your favorite hobby?
ChatBot: That's a great question! I don't have personal preferences or hobbies, but I can tell you about some popular ones. Some people enjoy playing sports, reading books


This is a lot better!

Now let's incorporate Llama in `gradio`, and try to benchmark `GPT-2` and `Llama-7b` by creating 10 basic questions on a subject you are good at (it could be translating a sentence from Spanish to English, or a question about physics, ...) and for each question, ask both ChatBots, and say if they were good or not in a dataframe with four columns: Question, Answer, Correct GPT-2 (boolean), Correct Llama.

Then save your dataframe as a `.csv` and concatenate all the `.csv` with your teammates. And determine a %accuracy on your benchmark for both models.

Obviously the quality of the responses might depend on other factors like `num_beams`, etc ... you can test different `num_beams` as well. But for a unified benchmark across the class, it's better to agree on a set number for each ChatBot.