# Class 9: Transformers and LLMs

In this class we will explore transformers and large language models. We will use Hugging Face to lead a few models to perform specific tasks. Then, we will fine-tune a pretrained model. Finally, we will look at LLM APIs (Google Gemini and OpenAI GPT-4).

In [None]:
! pip install datasets
! pip install transformers[torch]
! pip install accelerate -U
! pip install -U transformers
! pip install evaluate
! pip install -q -U google-generativeai

In the first part, we will follow the Hugging Face (🤗, for friends) introduction to transformers.

For those interested, the 🤗 NLP course also contains an introduction to transformers, which is a bit more detailed than what we discussed in class. You can find it here: [Hugging Face NLP Course - Introduction to Transformers](https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt).

Note that we are using Google Colab, as the necessary packages are pre-installed/realively easy to install. While it's possible to run this code locally, installing these packages on your system might not be straightforward and could require several attempts.

The first class we consider is Pipelines. In this first part of the notebook we will try a few of them, you can find the complete list of pipelines available from Hugging Face here: https://huggingface.co/docs/transformers/v4.17.0/en/main_classes/pipelines

In [None]:
from transformers import pipeline

### Sentiment Analysis pipeline

In [None]:
classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9598048329353333}]

The guide on pipelines also demonstrates how to adapt this code if you need to run the pipeline on a dataset.

Additionally, take note of the warning that indicates we haven't specified which model to use for the task. The pipeline allows us to do this. For example, if we wanted to use a model trained on financial data, we could specify [https://huggingface.co/mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis](https://huggingface.co/mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis).

How do you find new models? Search on 🤗!

In [None]:
classifier2 = pipeline("sentiment-analysis", model="mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis")
classifier2("I've been waiting for a HuggingFace course my whole life.")

[{'label': 'neutral', 'score': 0.9998385906219482}]

In [None]:
classifier("Operating profit from the oncology division totaled EUR 9.4 mn , up from EUR 8.7 mn in 2004 .")

[{'label': 'NEGATIVE', 'score': 0.9911375045776367}]

In [None]:
classifier2("Operating profit from the oncology division totaled EUR 9.4 mn , up from EUR 8.7 mn in 2004 .")

[{'label': 'positive', 'score': 0.9997376799583435}]

Notice how specific domains associate sentiment with specific terms through a different logic. The presence of "oncology" may cause a sentence to be interpreted as having a negative sentiment if we are using a general-purpose sentiment classifier.


A pipeline, like the ones we have used above, performs several operations under the hood. It downloads the model, preprocesses the text with tokenizers (ensuring the input matches the model's required format), passes the inputs through the model, and postprocesses the output into a format that's easily interpretable. For more details, see [this page](https://huggingface.co/learn/nlp-course/chapter2/2?fw=pt).


### Sequence Classification

In the next example we will load another model ```bert-base-cased-finetuned-mrpc``` which has been finetuned on the Microsoft Research Paraphrase Corpus. This model allows us to understand whether a sentence paraphrases another one.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")

classes = ["not paraphrase", "is paraphrase"]

sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"

# The tokenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to
# the sequence, as well as compute the attention masks.
paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")

paraphrase_classification_logits = model(**paraphrase).logits
not_paraphrase_classification_logits = model(**not_paraphrase).logits

paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]

# Should be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")

# Should not be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")

not paraphrase: 10%
is paraphrase: 90%
not paraphrase: 94%
is paraphrase: 6%


### Zero-shot classification

Zero-shot classification refers to the ability of a model to accurately classify data into categories it has never seen before during training. It leverages the model's understanding of language and context to make inferences about new or unseen categories based on its pre-existing knowledge. This approach is particularly useful in scenarios where labeled data is scarce or when it's impractical to retrain models for new categories. Essentially, zero-shot classification models use natural language understanding to generalize from seen to unseen categories without direct examples.

In [None]:
from transformers import pipeline
classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli")

sequence_to_classify = "one day I will see the world"
candidate_labels = ['travel', 'cooking', 'dancing']
classifier(sequence_to_classify, candidate_labels)

{'sequence': 'one day I will see the world',
 'labels': ['travel', 'dancing', 'cooking'],
 'scores': [0.9938650727272034, 0.003273802110925317, 0.002861041808500886]}

Notice that since `facebook/bart-large-mnli` was trained on the MultiNLI dataset ([https://huggingface.co/datasets/nyu-mll/multi_nli](https://huggingface.co/datasets/nyu-mll/multi_nli)), it is potentially useful for performing Natural Language Inference (NLI). NLI involves, given a premise sentence and a hypothesis, determining whether they are entailed, in contradiction, or neither.


In [None]:
sequence_to_classify = "Premise: The company HuggingFace is based in New York City. Hypothesis: HuggingFace is based in Ohio."
candidate_labels = ['Contradicts', 'Entailment', 'Neutral']
classifier(sequence_to_classify, candidate_labels)

{'sequence': 'Premise: The company HuggingFace is based in New York City. Hypothesis: HuggingFace is based in Ohio.',
 'labels': ['Contradicts', 'Entailment', 'Neutral'],
 'scores': [0.8365421295166016, 0.1356203854084015, 0.027837563306093216]}

In [None]:
sequence_to_classify = "Premise: The company HuggingFace is based in New York City. Hypothesis: HuggingFace's headquarters are situated in Manhattan."
candidate_labels = ['Contradicts', 'Entailment', 'Neutral']
classifier(sequence_to_classify, candidate_labels)

{'sequence': "Premise: The company HuggingFace is based in New York City. Hypothesis: HuggingFace's headquarters are situated in Manhattan.",
 'labels': ['Entailment', 'Contradicts', 'Neutral'],
 'scores': [0.5122519731521606, 0.37862467765808105, 0.10912329703569412]}

In [None]:
sequence_to_classify = "Premise: The company HuggingFace is based in New York City. Hypothesis: HuggingFace's headquarters are not situated in Manhattan."
candidate_labels = ['Contradicts', 'Entailment', 'Neutral']
classifier(sequence_to_classify, candidate_labels)

{'sequence': "Premise: The company HuggingFace is based in New York City. Hypothesis: HuggingFace's headquarters are not situated in Manhattan.",
 'labels': ['Contradicts', 'Entailment', 'Neutral'],
 'scores': [0.8089037537574768, 0.1572469025850296, 0.033849332481622696]}

### Question Answering

In [None]:
from transformers import pipeline

question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.6949766278266907, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

In [None]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="What are they discussing in this text?",
    context="In questioning the use of hydraulic fracturing in New York to help produce natural gas, you do not note that the technology has been employed and continuously improved for more than 50 years and that studies by the Environmental Protection Agency and the Ground Water Protection Council have not identified a single instance of groundwater contamination. Wells where fracturing is used are specially constructed to protect drinking water sources. Regulatory oversight is extensive. The fluids mostly water that are forced into a well to create pressure to fracture rock are pushed back out by the oil and gas flowing upward for safe processing. Protecting our water supplies is important, as are reductions in greenhouse gas emissions through use of clean-burning natural gas. Banning hydraulic fracturing would be unwarranted and shortsighted, preventing production of large amounts of natural gas that could directly benefit New York consumers for decades and create thousands of good jobs.",
)


No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.14289604127407074,
 'start': 19,
 'end': 86,
 'answer': 'use of hydraulic fracturing in New York to help produce natural gas'}

## Fine tuning

We have seen how to use the pretrained model directly for specific tasks. We can also use labels that we have available to fine-tune the model, that is, marginally adjust the parameters to improve performance in a specific task we are interested in.

Here we will follow the [main tutorial](https://huggingface.co/docs/transformers/training#train-with-pytorch-trainer) and fine-tune a BERT model to classify reviews from Yelp.

In [None]:
import datasets
import pandas as pd

In [None]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")

In [None]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})


In [None]:
df = pd.DataFrame(dataset["train"])

# Now you can use DataFrame methods like .head()
print(df.head())


   label                                               text
0      4  dr. goldberg offers everything i look for in a...
1      1  Unfortunately, the frustration of being Dr. Go...
2      3  Been going to Dr. Goldberg for over 10 years. ...
3      3  Got a letter in the mail last week that said D...
4      0  I don't know what Dr. Goldberg was like before...


In [None]:
dataset["train"][100]

{'label': 0,
 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. 

In [None]:
small_train_dataset = dataset["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = dataset["test"].shuffle(seed=42).select(range(1000))

As we will use a BERT model, we need to prepare the data accordingly: these models do not use as tokens full words, but they use subword tokenization. Here we will load the tokenizer from the ```transformers``` library.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


small_train_dataset = small_train_dataset.map(tokenize_function, batched=True)
small_eval_dataset = small_eval_dataset.map(tokenize_function, batched=True)


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")

In [None]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [None]:
trainer.train()

Step,Training Loss


TrainOutput(global_step=375, training_loss=1.00583203125, metrics={'train_runtime': 288.5708, 'train_samples_per_second': 10.396, 'train_steps_per_second': 1.3, 'total_flos': 789354427392000.0, 'train_loss': 1.00583203125, 'epoch': 3.0})

In [None]:
from google.colab import drive
drive.mount('/content/drive')



Mounted at /content/drive


In [None]:
!ls "/content/drive/MyDrive/Data_course"

book_reviews.csv  Songs.pkl


In [None]:
model.save_pretrained('/content/drive/MyDrive/Data_course/Fine-tuned_model')
tokenizer.save_pretrained('/content/drive/MyDrive/Data_course/Fine-tuned_model')


('/content/drive/MyDrive/Data_course/Fine-tuned_model/tokenizer_config.json',
 '/content/drive/MyDrive/Data_course/Fine-tuned_model/special_tokens_map.json',
 '/content/drive/MyDrive/Data_course/Fine-tuned_model/vocab.txt',
 '/content/drive/MyDrive/Data_course/Fine-tuned_model/added_tokens.json',
 '/content/drive/MyDrive/Data_course/Fine-tuned_model/tokenizer.json')

In [None]:

from transformers import pipeline

# Define the path to your saved model and tokenizer
model_directory = "/content/drive/MyDrive/Data_course/Fine-tuned_model"  # Adjust this path

# Create a pipeline
# The model and tokenizer will be automatically loaded from the specified directory
classifier = pipeline("text-classification", model=model_directory, tokenizer=model_directory)

# Example usage
text = "The restaurant was terrible!!"
predictions = classifier(text)
print(predictions)


[{'label': 'LABEL_0', 'score': 0.7803637981414795}]


In [None]:
text = "I will come back to try the pizza."
predictions = classifier(text)
print(predictions)


[{'label': 'LABEL_2', 'score': 0.35370907187461853}]


## Google GEMINI

Next, we'll explore the use of Large Language Model (LLM) APIs, specifically focusing on Google Gemini, which offers a free tier with a limited number of requests.

- For a quick start, check out this [Tutorial](https://colab.research.google.com/github/google/generative-ai-docs/blob/main/site/en/tutorials/python_quickstart.ipynb#scrollTo=ab9ASynfcIZn).

- To access the API, visit [API](https://aistudio.google.com/app/apikey).

- For information on pricing, see [Pricing](https://ai.google.dev/pricing).


In [None]:
!pip install -q -U google-generativeai

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/137.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/137.4 kB[0m [31m703.5 kB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/137.4 kB[0m [31m720.0 kB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━[0m [32m122.9/137.4 kB[0m [31m1.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.4/137.4 kB[0m [31m987.4 kB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import pathlib
import textwrap

import google.generativeai as genai

from IPython.display import display
from IPython.display import Markdown


def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

In [None]:
from google.colab import userdata

In [None]:
GOOGLE_API_KEY=userdata.get('GEMINI')

genai.configure(api_key=GOOGLE_API_KEY)

In [None]:
model = genai.GenerativeModel('gemini-pro')

In [None]:
response = model.generate_content(["Who is the villain in the following text? 1:Answer identifying the villain, if there is one clearly mentioned. 2:If the villain is only implicitly suggested, mention them. 3: If no villain is mentioned (not directly nor implicitly) say that there is no villain. TEXT: In questioning the use of hydraulic fracturing in New York to help produce natural gas, you do not note that the technology has been employed and continuously improved for more than 50 years and that studies by the Environmental Protection Agency and the Ground Water Protection Council have not identified a single instance of groundwater contamination. Wells where fracturing is used are specially constructed to protect drinking water sources. Regulatory oversight is extensive. The fluids mostly water that are forced into a well to create pressure to fracture rock are pushed back out by the oil and gas flowing upward for safe processing. Protecting our water supplies is important, as are reductions in greenhouse gas emissions through use of clean-burning natural gas. Banning hydraulic fracturing would be unwarranted and shortsighted, preventing production of large amounts of natural gas that could directly benefit New York consumers for decades and create thousands of good jobs."], stream=False)
response.resolve()

In [None]:
to_markdown(response.text)

> No villain is mentioned in the text.

## OpenAI

We can also look at OpenAI, which also offers an API to interact with their models.

[API OpenAI](https://platform.openai.com/api-keys)

[Pricing](https://openai.com/pricing)

[Embeddings](https://platform.openai.com/docs/guides/embeddings/use-cases)


In [None]:
!pip install OpenAI

Collecting OpenAI
  Downloading openai-1.14.2-py3-none-any.whl (262 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m262.4/262.4 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from OpenAI)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->OpenAI)
  Downloading httpcore-1.0.4-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.8/77.8 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->OpenAI)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: h11, httpcore, httpx, OpenAI
Successfully installed OpenAI-1.14.2 h11-0.14.0 htt

In [None]:
api_key = userdata.get('OpenAI')


In [None]:
import requests
import json


# Check if the API key was retrieved successfully
if api_key is None:
    print("API key not found.")
else:
    # Define the API endpoint for chat completions
    url = "https://api.openai.com/v1/chat/completions"

    # Headers for the API request
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}"
    }

    # Data payload for the API request
    data = {
        "model": "gpt-3.5-turbo",
        "messages": [{"role": "user", "content": "Say this is a test!"}],
        "temperature": 0.7
    }

    # Make the API request
    response = requests.post(url, headers=headers, data=json.dumps(data))

    # Check if the request was successful
    if response.status_code == 200:
        # Print the response content
        print(response.json())
    else:
        print(f"Error: {response.status_code}", response.text)


Error: 429 {
    "error": {
        "message": "You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.",
        "type": "insufficient_quota",
        "param": null,
        "code": "insufficient_quota"
    }
}
