# Prompt tuning: GPT
Nous allons essayer d'utiliser un modèle de génération pour faire de la classification de texte
> Utilisez GPT-2 de huggingface pour faire du sentiment analysis.
Qu'observez vous?

In [None]:
# TODO: Zero-shot Classification with GPT-2

# Zero-shot prompt tuning: OpenPrompt

In [None]:
!pip install -q openprompt

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m146.4/146.4 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.7/101.7 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
%%capture
!git clone https://github.com/thunlp/OpenPrompt.git
%cd OpenPrompt
!pip install -r requirements.txt
!python setup.py install

## Step 1: Define a task
The first step is to determine the current NLP task, think about what’s your data looks like and what do you want from the data! That is, the essence of this step is to determine the classes and the InputExample of the task. For simplicity, we use Sentiment Analysis as an example. tutorial_task.

In [None]:
from openprompt.data_utils import InputExample
classes = [ # There are two classes in Sentiment Analysis, one for negative and one for positive
    "negative",
    "positive"
]
dataset = [ # For simplicity, there's only two examples
    # text_a is the input text of the data, some other datasets may have multiple input sentences in one example.
    InputExample(
        guid = 0,
        text_a = "Albert Einstein was one of the greatest intellects of his time.",
    ),
    InputExample(
        guid = 1,
        text_a = "The film was badly made.",
    ),
]



## Step 2: Define a Pre-trained Language Models (PLMs) as backbone.
Choose a PLM to support your task. Different models have different attributes, we encourge you to use OpenPrompt to explore the potential of various PLMs. OpenPrompt is compatible with models on huggingface.

In [None]:
from openprompt.plms import load_plm
plm, tokenizer, model_config, WrapperClass = load_plm("bert", "bert-base-cased")

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

## Step 3: Define a Template.
A Template is a modifier of the original input text, which is also one of the most important modules in prompt-learning.  We have defined text_a in Step 1.

In [None]:
from openprompt.prompts import ManualTemplate
promptTemplate = ManualTemplate(
    text = '{"placeholder":"text_a"} It was {"mask"}',
    tokenizer = tokenizer,
)

## Step 4: Define a Verbalizer
A Verbalizer is another important (but not necessary) in prompt-learning,which projects the original labels (we have defined them as classes, remember?) to a set of label words. Here is an example that we project the negative class to the word bad, and project the positive class to the words good, wonderful, great.

In [None]:
from openprompt.prompts import ManualVerbalizer
promptVerbalizer = ManualVerbalizer(
    classes = classes,
    label_words = {
        "negative": ["bad"],
        "positive": ["good", "wonderful", "great"],
    },
    tokenizer = tokenizer,
)

## Step 5: Combine them into a PromptModel
Given the task, now we have a PLM, a Template and a Verbalizer, we combine them into a PromptModel. Note that although the example naively combine the three modules, you can actually define some complicated interactions among them.

In [None]:
from openprompt import PromptForClassification
promptModel = PromptForClassification(
    template = promptTemplate,
    plm = plm,
    verbalizer = promptVerbalizer,
)

## Step 6: Define a DataLoader
A PromptDataLoader is basically a prompt version of pytorch Dataloader, which also includes a Tokenizer, a Template and a TokenizerWrapper.

In [None]:
from openprompt import PromptDataLoader
data_loader = PromptDataLoader(
    dataset = dataset,
    tokenizer = tokenizer,
    template = promptTemplate,
    tokenizer_wrapper_class=WrapperClass,
)

tokenizing: 2it [00:00, 197.47it/s]


## Step 7: Train and inference
Done! We can conduct training and inference the same as other processes in Pytorch.

In [None]:
import torch

# making zero-shot inference using pretrained MLM with prompt
promptModel.eval()
with torch.no_grad():
    for batch in data_loader:
        logits = promptModel(batch)
        preds = torch.argmax(logits, dim = -1)
        print(classes[preds])
# predictions would be 1, 0 for classes 'positive', 'negative'

positive
negative


# 1/ Build a spam detection model
> Même idée mais pour detected du spam par message
Vous pouvez regarder les données si besoin: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
OU juste faire le few-shot

In [None]:
# Load the SMS Spam Collection Dataset: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
import pandas as pd

# Try different encodings if 'ISO-8859-1' does not work
df = pd.read_csv("Your path", encoding='ISO-8859-1')
df.head()


Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [None]:
# TODO: Adaptez le code précédent

In [None]:
import torch

# making zero-shot inference using pretrained MLM with prompt
promptModel.eval()
with torch.no_grad():
    for batch in data_loader:
        logits = promptModel(batch)
        preds = torch.argmax(logits, dim = -1)
        print(classes[preds])
# predictions would be 1, 0 for classes 'spam', 'ham'

# 2/ Mix the templates to predict relations
Documentation: https://github.com/thunlp/OpenPrompt/tree/main
> Objectif: Prédire la relation entre deux phrases comme dans le cours

> Exemple:
Premise: A person on a horse jumps over a broken down airplane.
Hypothesis: A person is training his horse for a competition.
Relation: neutral

> Exemple 2:
Premise: A person on a horse jumps over a broken down airplane.
Hypothesis: A person is at a diner, ordering an omelette.
Relation: contradiction

> Exemple 3:
Premise: A person on a horse jumps over a broken down airplane.
Hypothesis: A person is outdoors, on a horse.
Relation: entailment

You can take more examples from: https://huggingface.co/datasets/snli?row=2

In [2]:
# TODO: Adaptez le code précédent

Je vous donne la partie inférence pour simplifier le projet

In [None]:
# Test examples for inference
test_examples = [
    InputExample(guid=0, text_a="A child is playing in the park.", text_b="The child is at school."),
    InputExample(guid=1, text_a="Two people are walking down the street.", text_b="The people are outside."),
    InputExample(guid=2, text_a="The cat is sleeping on the sofa.", text_b="The animal is resting indoors."),
]

# Create a PromptDataLoader for the test examples
test_loader = PromptDataLoader(
    dataset=test_examples,
    tokenizer=tokenizer,
    template=promptTemplate,
    tokenizer_wrapper_class=WrapperClass,
    max_seq_length=256,  # Adjust based on your model's capacity
    batch_size=1  # Typically for inference, we use batch_size=1
)
# Put the model in evaluation mode
promptModel.eval()

# Disable gradient calculations for inference
with torch.no_grad():
    for batch in test_loader:
        # Move batch to the same device as model
        batch = {k: v.to(promptModel.device) for k, v in batch.items() if isinstance(v, torch.Tensor)}

        # Forward pass
        logits = promptModel(batch)

        # Get the predicted class (the one with the highest probability)
        predicted_class = torch.argmax(logits, dim=-1).item()

        # Print the prediction
        print(f"Predicted class: {classes[predicted_class]}")


# 3/ Soft verbalizer
> L'objectif est de laisser le modèle décider des mots a utiliser dans le Verbalizer

In [None]:
# TODO

# 4/ Soft Template
> L'objectif est de laisser le modèle trouver un prompt (qui n'aura pas forcément de sens pour nous) pour le Template

In [None]:
# TODO