# Prompt Engineering com IMDB

Elton Cardoso do Nascimento - 233840

> Utilizar o groq.com para usar a API do Llama 3 70B para fazer análise de sentimentos do IMDB.

Ao todo 4 técnicas serão testadas: zero-shot, few-shot, prompts dinâmicos e chain-of-thought

Vamos começar importando todas as bibliotecas que serão utilizadas:

In [1]:
import os # Operações com o SO (ler variáveis de ambiente)
import random # Operações randômicas
from concurrent.futures import ThreadPoolExecutor # Paralelização
import threading # Paralelização
import time # Temporização
from typing import Optional, List # Type hints

import datasets # Obter o dataset IMDB
import groq # API para utilizar o Llama 3
import tqdm # Print de progresso
import torch # ML
import pandas # Data manipulation

  from .autonotebook import tqdm as notebook_tqdm


E instalamos algumas outras que serão necessárias para obter o modelo BERT utilizado para realizar a técnica de "prompts dinâmicos":

In [None]:
!pip install tqdm boto3 requests regex sentencepiece sacremoses

## Interface para o Groq

Para realizar a inferência utilizando a API do Groq, criamos uma classe:

In [3]:
class GroqInterface:
    '''
    Interface for using the Groq API

    Implements a rate limit control for multi-threading use. 
    '''

    _client = None 

    LLAMA3_70B = "llama3-70b-8192"

    rate_lock = threading.Lock()

    def __init__(self, model:Optional[str]=None):
        '''
        GroqInterface constructor.

        Args:
            model (str, optional): model to use. Llama3 70B is used if None. Default is None
        '''
        
        if GroqInterface._client is None:
            api_key = os.environ.get("GROQ_API_KEY")

            if api_key is None:
                raise RuntimeError("API key is not in the environment variables ('GROQ_API_KEY' variable is not set).")

            GroqInterface._client = groq.Groq(api_key=api_key)

        if model is None:
            model = GroqInterface.LLAMA3_70B
        self._model = model

    def __call__(self, prompt:str) -> str:
        '''
        Generates the model response

        Args:
            prompt (str): prompt to send to the model.

        Returns:
            str: model response. 
        '''
        done = False
        while not done:

            try:
                GroqInterface.rate_lock.acquire()
                GroqInterface.rate_lock.release()

                chat_completion = GroqInterface._client.chat.completions.create(
                        messages=[
                            {
                                "role": "user",
                                "content": prompt,
                            }
                        ],
                        model=self._model,
                    )
                
                done = True
            except groq.RateLimitError as exception:
                GroqInterface.error = exception
                if not GroqInterface.rate_lock.locked():
                    GroqInterface.rate_lock.acquire()
                    time.sleep(2)
                    GroqInterface.rate_lock.release()

        return chat_completion.choices[0].message.content

Podemos testar a interface e verificar que obtemos corretamente a resposta do modelo:

In [4]:
groq_interface = GroqInterface()

In [5]:
groq_interface("Hi!")

"Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?"

Para facilitar o uso, definimos uma classe específica para análise de sentimento, que realiza um pós-processamento da saída obtida. Observe que, caso o modelo não seja claro na sua resposta que o review é positivo ou negativo um valor aleatório é utilizado.

In [6]:
POSITIVE = 1
NEGATIVE = 0

In [7]:
class GroqSentimentInterface(GroqInterface):
    '''
    GroqInterface with sentiment analisys post-processing.
    '''

    def __call__(self, prompt: str) -> int:
        '''
        Generates the model response for sentiment analisys.

        If the model is ambiguous in its response, a random one is generated.

        Args:
            prompt (str): prompt to send to the model.

        Returns:
            int: model response. POSITIVE if positive, NEGATIVE otherwise.
        '''

        response = super().__call__(prompt)
        response = response.lower()

        if "positive" in response and "negative" not in response:
            return POSITIVE
        if "negative" in response and "positive" not in response:
            return NEGATIVE
        
        return random.choice([POSITIVE, NEGATIVE])

In [8]:
groq_sentiment = GroqSentimentInterface()

## IMDB Prompt Engineering

Antes de podermos realizar a avaliação das técnicas, precisamos ainda preparar o dataset. Devido a longa demora para obter respostas da API e o limite de prompts por minuto, uma quantidade reduzida do dataset é utilizada para teste e validação, com apenas 100 elementos cada:

In [9]:
executor = ThreadPoolExecutor(max_workers=2)

trainbase_future = executor.submit(datasets.load_dataset, "imdb", split="train")
test_future = executor.submit(datasets.load_dataset, "imdb", split='test')

trainbase_dataset = trainbase_future.result()
testbase_dataset = test_future.result()

train_val_dataset = trainbase_dataset.train_test_split(test_size=100, shuffle=True, seed=78)
discard_test_dataset = testbase_dataset.train_test_split(test_size=100, shuffle=True, seed=78)

train_dataset = train_val_dataset["train"]
val_dataset = train_val_dataset["test"]
test_dataset = discard_test_dataset["test"]

In [10]:
len(train_dataset), len(val_dataset), len(test_dataset)

(24900, 100, 100)

## Zero-shot

Para a técnica de zero-shot, precisamos apenas preparar um prompt que solicita a classificação ao modelo:

In [20]:
base_prompt_zero = '''Classify if the movie review is POSITIVE or NEGATIVE: 
                Review:
                {review}

                Sentiment:
                POSITIVE OR NEGATIVE: 
                '''

In [44]:
prompt = base_prompt_zero.format(review=train_dataset[-1]["text"])

groq_sentiment(prompt), train_dataset[-1]["label"]

(1, 1)

Preparamos a função para realizar a avaliação de um sample:

In [22]:
def evaluate_zero(text:str, label:int) -> bool:
    '''
    Evaluates the zero-shot response

    Args:
        text (str): review to evaluate.
        label (int): review expected label.

    Returns:
        bool: True if the model classifies correctly.
    '''
    prompt = base_prompt_zero.format(review=text)
    result = groq_sentiment(prompt)

    return result == label

E calculamos a acurácia utilizando os dados de validação:

In [23]:
executor = ThreadPoolExecutor(max_workers=4) #More workers -> More RateLimit exceptions

futures = []
for data in val_dataset:
    future = executor.submit(evaluate_zero, **data)
    futures.append(future)

correct_zero = 0
for future in tqdm.tqdm(futures):
    correct_zero += future.result()

  0%|          | 0/100 [00:00<?, ?it/s]

100%|██████████| 100/100 [07:30<00:00,  4.50s/it] 


In [29]:
accuracy_zero = correct_zero/len(val_dataset)
print(f"Acurácia - Zero-shot - Validação: {accuracy_zero*100}%")

Acurácia - Zero-shot - Validação: 88.0%


Acurácia - Zero-shot - Validação: 88.0%


## Few-shot

Para a técnica de few-shot prompt, vamos enviar para o modelo 3 exemplos: um com o template de como a resposta deve funcionar, um positivo e um negativo:

In [43]:
raw_prompt_few = '''Classify if the movie review is positive or negative: 
                Review:
                Movie review

                Sentiment:
                ONLY POSITIVE OR NEGATIVE

                Classify if this movie review is positive or negative:
                Review:
                {example1}

                Sentiment:
                {response1}

                Classify if this movie review is positive or negative:
                Review:
                {example2}

                Sentiment:
                {response2}

                Classify if this movie review is positive or negative:
                Review:
                {{review}}
                
                Sentiment:
                
                '''

In [13]:
positive_example = None
negative_example = None

i = 0
while positive_example is None or negative_example is None:
    if train_dataset[i]["label"] == POSITIVE:
        positive_example = train_dataset[i]
    else:
        negative_example = train_dataset[i]
    
    i += 1

In [44]:
base_prompt_few = raw_prompt_few.format(example1=positive_example["text"], response1="POSITIVE", 
                                        example2=negative_example["text"], response2="NEGATIVE")

print(base_prompt_few)

Classify if the movie review is positive or negative: 
                Review:
                Movie review

                Sentiment:
                ONLY POSITIVE OR NEGATIVE

                Classify if this movie review is positive or negative:
                Review:
                but I want to say I cannot agree more with Moira.<br /><br />What a wonderful film.<br /><br />I was thinking about it just this morning, wanting to give advice to some dopey sod who'd lost money on his debit card through fraud, and wanted to say 'Keep thy money in thine pocket' and realised I was talking like James Mason.<br /><br />Even tho he didn't say those words, I still think he would! I've never forgotten 'Are ye carrying?' in his reconciliation with his son, Hywel Bennet: 'Always have money in thine pocket!' Good advice.<br /><br />Not enough kids have fathers with such unforgiving but well-meant attitudes any more. Or any father at all.<br /><br />It would be a good thing for us to reinstate

Verificamos que está funcionando corretamente:

In [42]:
prompt = base_prompt_few.format(review=train_dataset[-1]["text"])

groq_sentiment(prompt), train_dataset[-1]["label"]

(1, 1)

Preparamos a função de avaliação e calculamos a acurácia:

In [45]:
def evaluate_few(text:str, label:int) -> bool:
    '''
    Evaluates the few-shot response

    Args:
        text (str): review to evaluate.
        label (int): review expected label.

    Returns:
        bool: True if the model classifies correctly.
    '''
     
    prompt = base_prompt_few.format(review=text)
    result = groq_sentiment(prompt)

    return result == label

In [46]:
evaluate_few(**train_dataset[-1])

True

In [47]:
executor = ThreadPoolExecutor(max_workers=4) #More workers -> More RateLimit exceptions

futures = []
for data in val_dataset:
    future = executor.submit(evaluate_few, **data)
    futures.append(future)

correct_few = 0
for future in tqdm.tqdm(futures):
    correct_few += future.result()

100%|██████████| 100/100 [10:14<00:00,  6.15s/it]


In [48]:
accuracy_few = correct_few/len(val_dataset)
print(f"Acurácia - Few-shot - Validação: {accuracy_few*100}%")

Acurácia - Few-shot - Validação: 98.0%


## Prompt Dinâmico

Para o prompt dinâmico, iremos enviar um exemplo positivo e um negativo próximo do review que queremos classificar. Para isso, iremos gerar representações vetoriais de todos os prompts de treino disponíveis utilizando o modelo BERT e compará-las com a representação vetorial do review que queremos classificar.

Começamos carregando o BERT e seu tokenizador:

In [6]:
bert_tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-base-cased')
bert = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-cased')

Using cache found in /root/.cache/torch/hub/huggingface_pytorch-transformers_main
Using cache found in /root/.cache/torch/hub/huggingface_pytorch-transformers_main
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification m

In [7]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
bert.to(device);

Desativamos o cálculo de gradientes e colocamos o modelo em modo de avaliação, visto que nenhum treino será realizado neste projeto:

In [8]:
bert.eval()
torch.set_grad_enabled(False)

<torch.autograd.grad_mode.set_grad_enabled at 0x7fca8b2a6970>

Para melhorar a eficiência do cálculo das representações vetoriais em batchs, ordenamos os dados de treino pelo tamanho do texto, para tentarmos manter representações tokenizadas de tamanhos similares próximas:

In [11]:
def cmp_func(series:pandas.Series) -> List[int]:
    '''
    Compares the elements of the series by the lenght.

    Args:
        series (pandas.Series): series to compare.

    Returns:
        List[int]: elements sizes.
    '''
    sizes = []
    for element in series:
        sizes.append(len(element))
    return sizes



In [None]:
df = train_dataset.to_pandas()
df.sort_values(by=["text"], key=cmp_func, #By text size
                inplace=True, #Inplace 
                ascending=False) #Descending (if memory error, first)
train_dataset_sorted = datasets.Dataset.from_pandas(df)

Calculamos as representações vetoriais dos dados de treino:

In [12]:
train_vectors = torch.Tensor(0, 768).float()
train_vectors = train_vectors.to(device)

for i in range(0, len(train_dataset_sorted), 256):
    start = i
    end = min(i+256, len(train_dataset_sorted))
    
    tokens = bert_tokenizer(train_dataset_sorted[start:end]["text"], return_tensors="pt", #Return as torch tensor 
                                 padding=True, #Add padding to small sequences
                                 return_token_type_ids=False, #Don't return sequence mask (only one sequence)
                                 truncation=True) #Truncate big sentences (max = 512 tokens, with CLS and SEP))
    
    input_ids = tokens["input_ids"].to(device)
    attention_mask = tokens["attention_mask"].to(device)

    result = bert(input_ids = input_ids, attention_mask=attention_mask)
    
    train_vectors = torch.cat((train_vectors, result["last_hidden_state"][:, 0]))

E dos dados de validação, que não precisam de batchs pela quantidade reduzida de dados:

In [20]:
tokens = bert_tokenizer(val_dataset["text"], return_tensors="pt", #Return as torch tensor 
                                 padding=True, #Add padding to small sequences
                                 return_token_type_ids=False, #Don't return sequence mask (only one sequence)
                                 truncation=True) #Truncate big sentences (max = 512 tokens, with CLS and SEP))

input_ids = tokens["input_ids"].to(device)
attention_mask = tokens["attention_mask"].to(device)

result = bert(input_ids = input_ids, attention_mask=attention_mask)
    
val_vectors = result["last_hidden_state"][:, 0]

Para comparar os dados, utilizamos similaridade de cosseno com as representações vetoriais:

In [88]:
train_vectors_norm = torch.nn.functional.normalize(train_vectors)
val_vectors_norm = torch.nn.functional.normalize(val_vectors)

val_cosine_similarity = val_vectors_norm @ train_vectors_norm.T  #[val_index, train_index]

E obtemos uma amostra positiva e outra negativa para cada sample:

In [85]:
indexes = torch.argsort(val_cosine_similarity, dim=1)

val_examples = []
for i in range(len(val_dataset)):
    positive = None
    negative = None

    for j in range(len(train_dataset)):
        index = indexes[i, j].item()

        if train_dataset_sorted[index]["label"] == POSITIVE:
            positive = train_dataset_sorted[index]["text"]
        else:
            negative = train_dataset_sorted[index]["text"]

        if positive is not None and negative is not None:
            example = {"positive":positive, "negative":negative}
            val_examples.append(example)
            break

Checamos se o prompt funciona:

In [98]:
prompt = raw_prompt_few.format(example1=val_examples[0]["positive"], response1="POSITIVE", 
                                        example2=val_examples[0]["negative"], response2="NEGATIVE")
prompt.format(review=val_dataset[0]["text"])

groq_sentiment(prompt), val_dataset[0]["label"]

(0, 0)

In [101]:
def evaluate_dynamic(index:int) -> bool:
    '''
    Evaluates the dynamic prompt response

    Args:
        index (int): index of the review in the validation dataset to evaluate

    Returns:
        bool: True if the model classifies correctly.
    '''
    
    prompt = raw_prompt_few.format(example1=val_examples[index]["positive"], response1="POSITIVE", 
                                        example2=val_examples[index]["negative"], response2="NEGATIVE")
    prompt = prompt.format(review=val_dataset[index]["text"])

    result = groq_sentiment(prompt)

    label = val_dataset[index]["label"]

    return result == label

In [102]:
evaluate_dynamic(0)

True

E realizamos o cálculo da acurácia no dataset de validação:

In [104]:
executor = ThreadPoolExecutor(max_workers=4) #More workers -> More RateLimit exceptions

futures = []
for i in range(len(val_dataset)):
    future = executor.submit(evaluate_dynamic, i)
    futures.append(future)

correct_dynamic = 0
for future in tqdm.tqdm(futures):
    correct_dynamic += future.result()

  0%|          | 0/100 [00:00<?, ?it/s]

100%|██████████| 100/100 [08:21<00:00,  5.02s/it] 


In [106]:
accuracy_dynamic = correct_dynamic/len(val_dataset)
print(f"Acurácia - Dinâmico - Validação: {accuracy_dynamic*100}%")

Acurácia - Dinâmico - Validação: 95.0%


Coletando os tempos utilizados em cada parte podemos obter o tempo total para utilizar esta técnica:

- Gerar vetores para os dados de treino: 2 min 50,4 s
- Gerar vetores para os dados de validação: 0,0 s
- Calcular similaridade de cosseno: 0,0 s
- Procurar exemplos: 0,2 s
- Chamadas prompts: 8 min 21,9 s

Total: 11 min 12,5 s

## Chain-of-Thought

Para a técnica de chain-of-thought, vamos adicionar um campo de "reasoning" aos exemplos no prompt:

In [31]:
raw_prompt_cot = '''Classify if the movie review is positive or negative: 
Review:
Movie review

Sentiment:
ONLY POSITIVE OR NEGATIVE

Classify if this movie review is positive or negative:
Review:
{example1}

Reasoning:
{reasoning1}

Sentiment:
{response1}

Classify if this movie review is positive or negative:
Review:
{example2}

Reasoning:
{reasoning2}

Sentiment:
{response2}

Classify if this movie review is positive or negative:
Review:
{{review}}

Reasoning:
'''

O "reasoning" dos exemplos positivos e negativos a seguir foram gerados utilizando também o Llama3 70B, porém com a interface gráfica online do Groq.

In [32]:
positive_example["text"]

"but I want to say I cannot agree more with Moira.<br /><br />What a wonderful film.<br /><br />I was thinking about it just this morning, wanting to give advice to some dopey sod who'd lost money on his debit card through fraud, and wanted to say 'Keep thy money in thine pocket' and realised I was talking like James Mason.<br /><br />Even tho he didn't say those words, I still think he would! I've never forgotten 'Are ye carrying?' in his reconciliation with his son, Hywel Bennet: 'Always have money in thine pocket!' Good advice.<br /><br />Not enough kids have fathers with such unforgiving but well-meant attitudes any more. Or any father at all.<br /><br />It would be a good thing for us to reinstate 'thee', 'thy' and 'thine' in our language to show we care. It is only the same as 'tutoyer' in French or 'du' in German.<br /><br />Addendum: I just realised that a lot of my remarks were about James Mason in The Family Way!<br /><br />I think it's because I mixed up Susan George with Ha

In [33]:
positive_reasoning = '''This review is positive for several reasons:

*. The reviewer starts by praising the film, stating that it is "a wonderful film" and expresses how it has affected their thoughts and words.
*. They mention specific scenes and characters from the film, demonstrating their engagement and appreciation for the story.
*. The reviewer draws parallels between the film and their own life, mentioning how the advice given in the film relates to their own experiences and emotions.
*. They express nostalgia for a bygone era, mentioning the past when one could take a girlfriend to the pictures and have meaningful conversations about the film afterwards.
*. The reviewer values the impact of theatrical experiences on personal development and feels that this is lacking in modern times.
*. Despite making a mistake by mixing up actresses, the reviewer stands by their comments and maintains their positive sentiment towards the film.

Overall, the review is positive because it shows appreciation for the film, its characters, and its themes, as well as the reviewer's nostalgia for the past and the significance of theatrical experiences in personal growth.'''

In [34]:
negative_example["text"]

'"A total waste of time" Just throw in a few explosions, non stop fighting, exotic cars a deranged millionaire, slow motion computer generated car crashes and last but not least a Hugh Hefner like character with wall to wall hot babes, and mix in a blender and you will have this sorry excuse for a movie. I really got a laugh out of the "Dr. Evil" like heavily fortified compound. The plot was somewhere between preposterous and non existent. How many millionaires are willing to make a 25 million dollar bet on a car race? Answer: 4 but, didn\'t they become millionaires through fiscal responsibility? This was written for pubescent males, it plays like a video game. I did enjoy the Gulfstream II landing in the desert though.'

In [35]:
negative_reasoning = '''This review is negative because of the following reasons:

* The reviewer calls the movie "a total waste of time" and "a sorry excuse for a movie", which indicates a strong negative opinion.
* The reviewer sarcastically lists various ingredients that they think were thrown together to make the movie, implying that the film is shallow and lacks substance.
* They criticize the plot, calling it "preposterous" and "non-existent".
* They question the plot's plausibility, suggesting that the movie's concept is unrealistic.
* They imply that the movie is only suitable for "pubescent males" and that it's more like a video game than a serious film.
* The only positive comment, about the Gulfstream II landing, is brief and doesn't outweigh the overall negative tone of the review.
'''

Montamos o prompt base com os "reasonings":

In [37]:
base_prompt_cot = raw_prompt_cot.format(example1=positive_example["text"], reasoning1=positive_reasoning, response1="POSITIVE",
                                        example2=negative_example["text"], reasoning2=negative_reasoning, response2="NEGATIVE")


print(base_prompt_cot)

Classify if the movie review is positive or negative: 
Review:
Movie review

Sentiment:
ONLY POSITIVE OR NEGATIVE

Classify if this movie review is positive or negative:
Review:
but I want to say I cannot agree more with Moira.<br /><br />What a wonderful film.<br /><br />I was thinking about it just this morning, wanting to give advice to some dopey sod who'd lost money on his debit card through fraud, and wanted to say 'Keep thy money in thine pocket' and realised I was talking like James Mason.<br /><br />Even tho he didn't say those words, I still think he would! I've never forgotten 'Are ye carrying?' in his reconciliation with his son, Hywel Bennet: 'Always have money in thine pocket!' Good advice.<br /><br />Not enough kids have fathers with such unforgiving but well-meant attitudes any more. Or any father at all.<br /><br />It would be a good thing for us to reinstate 'thee', 'thy' and 'thine' in our language to show we care. It is only the same as 'tutoyer' in French or 'du' i

E testamos se funciona corretamente:

In [39]:
prompt = base_prompt_cot.format(review=train_dataset[-1]["text"])

groq_sentiment(prompt), train_dataset[-1]["label"]

(1, 1)

Mais uma vez utilizamos uma função para realizar a avaliação e calculamos a acurácia de validação:

In [40]:
def evaluate_cot(text, label):
    '''
    Evaluates the chain-of-thought response

    Args:
        text (str): review to evaluate.
        label (int): review expected label.

    Returns:
        bool: True if the model classifies correctly.
    '''
    

    prompt = base_prompt_cot.format(review=text)
    result = groq_sentiment(prompt)

    return result == label

evaluate_cot(**train_dataset[-1])

True

In [41]:
executor = ThreadPoolExecutor(max_workers=4) #More workers -> More RateLimit exceptions

futures = []
for data in val_dataset:
    future = executor.submit(evaluate_cot, **data)
    futures.append(future)

correct_cot = 0
for future in tqdm.tqdm(futures):
    correct_cot += future.result()

100%|██████████| 100/100 [13:29<00:00,  8.10s/it] 


In [42]:
accuracy_cot = correct_cot/len(val_dataset)
print(f"Acurácia - Chain-of-Thought - Validação: {accuracy_cot*100}%")

Acurácia - Chain-of-Thought - Validação: 95.0%


## Comparação e Teste

Comparando as quatro técnicas utilizadas, podemos observar que a técnica de zero-shot é a mais rápida, sendo importante ressaltar que este tempo depende fortemente do limite de tokens da API do Groq, sendo esta a técnica que utiliza menos tokens por classificação. Já em quesitos de acurácia, a few-shot se dá melhor, não sendo observado melhorias com o uso de técnicas mais complexas.

É por importante ressaltar que a quantidade reduzida de dados de validação utilizados dificultam uma comparação válida entre as técnicas.

Técnica | Acurácia de Validação | Tempo
-|-|-
Zero-shot|88%|7 min 30 s
Few-shot|98%|10 min 14 s
Prompts dinâmicos|95%|11 min 13 s
Chain-of-Thought|95%|13 min 29 s

Por fim, calculamos a acurácia de teste com o few-shot:

In [46]:
executor = ThreadPoolExecutor(max_workers=4) #More workers -> More RateLimit exceptions

futures = []
for data in test_dataset:
    future = executor.submit(evaluate_few, **data)
    futures.append(future)

correct_test = 0
for future in tqdm.tqdm(futures):
    correct_test += future.result()

100%|██████████| 100/100 [09:26<00:00,  5.67s/it]


In [47]:
accuracy_test = correct_test/len(test_dataset)
print(f"Acurácia - Few-shot - Teste: {accuracy_test*100}%")

Acurácia - Few-shot - Teste: 92.0%
