# Assignment 2

**Credits**: Federico Ruggeri, Eleonora Mancini, Paolo Torroni

**Keywords**: Sexism Detection, Multi-class Classification, LLMs, Prompting


# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Federico Ruggeri -> federico.ruggeri6@unibo.it
* Eleonora Mancini -> e.mancini@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

# Relevant Material

- Tutorial 3
- Huggingface documentation
- Huggingface hub

# Introduction

You are tasked to address the [EDOS Task A](https://github.com/rewire-online/edos) on sexism detection.

## Problem definition

Given an input text sentence, the task is to label the sentence as sexist or not sexist (binary classification).

### Examples:

**Text**: *``Schedule a date with her, then don't show up. Then text her "GOTCHA B___H".''*

**Label**: Sexist

**Text**: *``That’s completely ridiculous a woman flashing her boobs is not sexual assault in the slightest.''*

**Label**: Not sexist



## Approach

We will tackle the binary classification task with LLMs.

In particular, we'll consider zero-/few-shot prompting approaches to assess the capability of some popular open-source LLMs on this task.

## Preliminaries

We are going to download LLMs from [Huggingface](https://huggingface.co/).

Many of these open-source LLMs require you to accept their "Community License Agreement" to download them.

In summary:

- If not already, create an account of Huggingface (~2 mins)
- Check a LLM model card page (e.g., [Mistral v3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)) and accept its "Community License Agreement".
- Go to your account -> Settings -> Access Tokens -> Create new token -> "Repositories permissions" -> add the LLM model card you want to use.
- Save the token (we'll need it later)

In [2]:
print('Token: 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')
print('Model card: mistralai/Mistral-7B-Instruct-v0.3')

Token: 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
Model card: mistralai/Mistral-7B-Instruct-v0.3


### Huggingface Login

Once we have created an account and an access token, we need to login to Huggingface via code.

- Type your token and press Enter
- You can say No to Github linking

In [3]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineGrained).
The token `assignment_2_NLP` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `a

In [None]:
print('Token: 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')
print('Model card: mistralai/Mistral-7B-Instruct-v0.3')

Token: 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
Model card: mistralai/Mistral-7B-Instruct-v0.3


After login, you can download all models associated with your access token in addition to those that are not protected by an access token.

### Data Loading

Since we are only interested in prompting, we do not require a train dataset.

We have preparared a small test set version of EDOS in our dedicated [Github repository](https://github.com/lt-nlp-lab-unibo/nlp-course-material).

Check the ``Assignment 2/data`` folder.
It contains:

- ``a2_test.csv`` → a small test set of 300 samples.
- ``demonstrations.csv`` -> a batch of 1000 samples for few-shot prompting.

Both datasets contain a balanced number of sexist and not sexist samples.


In [4]:
!pip install transformers
!pip install datasets
!pip install accelerate -U
!pip install evaluate
!pip install bitsandbytes
!pip install datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

In [5]:
import torch
import pandas as pd
import numpy as np
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from langchain.prompts import PromptTemplate
from datasets import Dataset
from torch.utils.data import DataLoader
from tqdm import tqdm

### Instructions

We require you to:

* **Download** the ``A2/data`` folder.
* **Encode** ``a2_test.csv`` into a ``pandas.DataFrame`` object.

In [7]:
a2_test = pd.read_csv('./data/a2_test.csv')
a2_test.head()

Unnamed: 0,rewire_id,text,label_sexist
0,sexism2022_english-17133,The boys must be gaming because there goes the...,not sexist
1,sexism2022_english-14197,Look at those eyes. Either someone unexpectedl...,sexist
2,sexism2022_english-3018,Old man mogs everyone in this sub,not sexist
3,sexism2022_english-5301,"Excellent, I was just looking at another post ...",not sexist
4,sexism2022_english-17796,So you run back to daddy whenever you need hel...,sexist


# [Task 1 - 0.5 points] Model setup

Once the test data has been loaded, we have to setup the model pipeline for inference.

In particular, we have to:
- Load the model weights from Huggingface
- Quantize the model to fit into a single-GPU limited hardware

In [8]:
torch.cuda.is_available()

False

## Mistral v3

In [9]:
model_card = "mistralai/Mistral-7B-Instruct-v0.3"

tokenizerM = AutoTokenizer.from_pretrained(model_card)
tokenizerM.pad_token = tokenizerM.eos_token # sets the padding token to be the same as the end-of-sequence token

terminators = [
    tokenizerM.eos_token_id,
    tokenizerM.convert_tokens_to_ids("<|eot_id|>")
]


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/141k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [10]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# To load the model
model = AutoModelForCausalLM.from_pretrained(
    model_card,
    return_dict=True,
    quantization_config=bnb_config,
    device_map='auto'
)

config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

CUDA is required but not available for bitsandbytes. Please consider installing the multi-platform enabled version of bitsandbytes, which is currently a work in progress. Please check currently supported platforms and installation instructions at https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend


RuntimeError: CUDA is required but not available for bitsandbytes. Please consider installing the multi-platform enabled version of bitsandbytes, which is currently a work in progress. Please check currently supported platforms and installation instructions at https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend

## Which LLMs?

The pool of LLMs is ever increasing and it's impossible to keep track of all new entries.

We focus on popular open-source models.

- [Mistral v2](mistralai/Mistral-7B-Instruct-v0.2)
- [Mistral v3](mistralai/Mistral-7B-Instruct-v0.3)
- [Llama v3.1](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
- [Phi3-mini](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)

Other open-source models are more than welcome!

### Instructions

In order to get Task 1 points, we require you to:

* Pick 2 model cards from the provided list.
* For each model:
  - Define a separate section of your notebook for the model.
  - Setup a quantization configuration for the model.
  - Load the model via HuggingFace APIs.


### Notes

1. There's a popular library integrated with Huggingface's ``transformers`` to perform quantization.

2. Define two separate sections of your notebook to show that you have implemented the prompting pipeline for each selected model card.

# [Task 2 - 1.0 points] Prompt setup

Prompting requires an input pre-processing phase where we convert each input example into a specific instruction prompt.


## Prompt Template

Use the following prompt template to process input texts.

In [12]:
prompt_zero = [
    {
        'role': 'system',
        'content': 'You are an annotator for sexism detection.'
    },
    {
        'role': 'user',
        'content': """Your task is to classify input text as containing sexism or not. Respond only YES or NO.

        TEXT:
        {text}

        ANSWER:
        """
    }
]

### Instructions

In order to get Task 2 points, we require you to:

* Write a ``prepare_prompts`` function as the one reported below.

In [59]:
def prepare_prompts_zero(texts, prompt_template, tokenizer):
  """
    This function format input text samples into instructions prompts.

    Inputs:
      texts: input texts to classify via prompting
      prompt_template: the prompt template provided in this assignment
      tokenizer: the transformers Tokenizer object instance associated with the chosen model card

    Outputs:
      input texts to classify in the form of instruction prompts
  """
  prompt_template = tokenizer.apply_chat_template(prompt_template, tokenize=False, add_generation_prompt=True)
  text_formatted = prompt_template.format(text=texts)

  return text_formatted

In [14]:
text = prepare_prompts_zero('This is a test!', prompt_zero, tokenizerM)
print(text)

<s>[INST] You are an annotator for sexism detection.

Your task is to classify input text as containing sexism or not. Respond only YES or NO.

        TEXT:
        This is a test!

        ANSWER:
        [/INST]


### Notes

1. You are free to modify the prompt format (**not its content**) as you like depending on your code implementation.

2. Note that the provided prompt has placeholders. You need to format the string to replace placeholders. Huggingface might have dedicated APIs for this.

# [Task 3 - 1.0 points] Inference

We are now ready to define the inference loop where we prompt the model with each pre-processed sample.

### Instructions

In order to get Task 3 points, we require you to:

* Write a ``generate_responses`` function as the one reported below.
* Write a ``process_response`` function as the one reported below.

In [15]:
generation_config = model.generation_config
generation_config.max_new_tokens = 2
generation_config.eos_token_id = tokenizerM.eos_token_id
generation_config.pad_token_id = tokenizerM.eos_token_id
generation_config.temperature = None
generation_config.num_return_sequences = 1

In [16]:
def preprocess_text(texts):
    return tokenizerM(texts['text'], truncation=True)

def encode_Y(labels):
    return {"text": labels["text"],"label_sexist":1} if labels['label_sexist'] == 'sexist' else {"text": labels["text"],"label_sexist":0}

def collate_fn(batch):
    texts = tokenizerM.batch_encode_plus([it['text'] for it in batch], return_tensors='pt', padding=True, truncation=True)
    print([it['label_sexist'] for it in batch])
    sentiment = torch.tensor([it['label_sexist'] for it in batch])
    return texts, sentiment


test_data = Dataset.from_pandas(a2_test)
test_data = test_data.map(preprocess_text,batched = True)
test_data = test_data.map(encode_Y)
test_data = test_data.map(lambda x: {'text': prepare_prompts(x['text'], prompt, tokenizerM)})
test_data = test_data.select_columns(['text', 'label_sexist'])
print(test_data["text"])


data_loader = DataLoader(test_data, batch_size=1, shuffle=False,collate_fn=collate_fn)

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/300 [00:00<?, ? examples/s]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

['<s>[INST] You are an annotator for sexism detection.\n\nYour task is to classify input text as containing sexism or not. Respond only YES or NO.\n\n        TEXT:\n        The boys must be gaming because there goes the wifi.\n\n        ANSWER:\n        [/INST]', '<s>[INST] You are an annotator for sexism detection.\n\nYour task is to classify input text as containing sexism or not. Respond only YES or NO.\n\n        TEXT:\n        Look at those eyes. Either someone unexpectedly just fisted her up the dumper, or that bitch is certifiably a nutburger.\n\n        ANSWER:\n        [/INST]', '<s>[INST] You are an annotator for sexism detection.\n\nYour task is to classify input text as containing sexism or not. Respond only YES or NO.\n\n        TEXT:\n        Old man mogs everyone in this sub\n\n        ANSWER:\n        [/INST]', '<s>[INST] You are an annotator for sexism detection.\n\nYour task is to classify input text as containing sexism or not. Respond only YES or NO.\n\n        TEXT

In [17]:
def generate_responses(model, prompt_examples):
  """
    This function implements the inference loop for a LLM model.
    Given a set of examples, the model is tasked to generate a response.

    Inputs:
      model: LLM model instance for prompting
      prompt_examples: pre-processed text samples

    Outputs:
      generated responses
  """

  responses = []
  for prompt_example in prompt_examples:

    parsed_example = tokenizerM(prompt_examples, return_tensors='pt').to('cuda')
    generated = model.generate(input_ids=parsed_example['input_ids'],
                           attention_mask=parsed_example['attention_mask'],
                           generation_config=generation_config,
                           do_sample=False)
    output = tokenizerM.batch_decode(generated, skip_special_tokens=True)[0]
    responses.append(output)

  return responses

In [18]:
def process_response(response: str):
  """
    This function takes a textual response generated by the LLM
    and processes it to map the response to a binary label.

    Inputs:
      response: generated response from LLM

    Outputs:
      parsed binary response: return 1 if YES and 0 if NO
  """

  end_txt = response.find("ANSWER:\n")
  if response[end_txt + len("ANSWER:\n"):].strip() == 'YES':
    return 1
  else:
    return 0


In [19]:
generate_responses(model, [prepare_prompts_zero("'I get the impression that incel never skipped a meal in his life.'", prompt, tokenizerM)])

["You are an annotator for sexism detection.\n\nYour task is to classify input text as containing sexism or not. Respond only YES or NO.\n\n        TEXT:\n        'I get the impression that incel never skipped a meal in his life.'\n\n        ANSWER:\n         YES"]

In [1]:
raw_responses = []
predictions = []
with torch.inference_mode():
    for batch_x, batch_y in tqdm(data_loader, desc="Generating responses"):
        response = generate_responses(model, batch_x['text'])
        raw_response = tokenizerM.batch_decode(response, skip_special_tokens=True)
        predictions.extend(raw_response)

NameError: name 'torch' is not defined

## Notes

1. According to our tests, it should take you ~10 mins to perform full inference on 300 samples.

# [Task 4 - 0.5 points] Metrics

In order to evaluate selected LLMs, we need to compute performance metrics.

In particular, we are interested in computing **accuracy** since the provided data is balanced with respect to classification classes.

Moreover, we want to compute the ratio of failed responses generated by models.

That is, how frequent the LLM fails to follow instructions and provides incorrect responses that do not address the classification task.

We denote this metric as **fail-ratio**.

In summary, we parse generated responses as follows:
- 1 if the model says YES
- 0 if the model says NO
- 0 if the model does not answer in either way

### Instructions

In order to get Task 4 points, we require you to:

* Write a ``compute_metrics`` function as the one reported below.
* Compute metrics for the two selected LLMs.

In [26]:
def compute_metrics(output_info):
    predictions, labels = output_info

    f1 = f1_score(y_pred=predictions, y_true=labels, average='macro')
    acc = accuracy_score(y_pred=predictions, y_true=labels)
    return {'f1': f1, 'acc': acc}

In [38]:
from sklearn.metrics import f1_score, accuracy_score

def compute_metrics(responses, y_true):
  """
    This function takes predicted and ground-truth labels and compute metrics.
    In particular, this function compute accuracy and fail-ratio metrics.
    This function internally invokes `process_response` to compute metrics.

    Inputs:
      responses: generated LLM responses
      y_true: ground-truth binary labels

    Outputs:
      dictionary containing desired metrics
  """
  errors = 0
  for response in responses:
    end_txt = response.find("ANSWER:\n")
    if response[end_txt + len("ANSWER:\n"):].strip() != 'YES' and response[end_txt + len("ANSWER:\n"):].strip() != 'NO':
      errors += 1

  responses = [process_response(response) for response in responses]

  acc = accuracy_score(y_pred=responses, y_true=y_true)
  fail_ratio = errors / len(y_true)
  return {'acc': acc, 'fail-ratio': fail_ratio}

In [39]:
predictions = np.array(predictions)
ground_truth = np.array(test_data['label_sexist'])
metrics = compute_metrics(predictions, ground_truth)
print(metrics)

{'acc': 0.59, 'fail-ratio': 0.013333333333333334}


# [Task 5 - 1.0 points] Few-shot Inference

So far, we have tested models in a zero-shot fashion: we provide the input text to classify and instruct the model to generate a response.

We are now interested in performing few-shot prompting to see the impact of providing demonstration examples.

To do so, we slightly change the prompt template as follows.

In [99]:
prompt_few = [
    {
        'role': 'system',
        'content': 'You are an annotator for sexism detection.'
    },
    {
        'role': 'user',
        'content': """Your task is to classify input text as containing sexism or not. Respond only YES or NO.

        EXAMPLES:
      {examples}

        TEXT:
        {text}

        ANSWER:
        """
    }
]

The new prompt template reports some demonstration examples to instruct the model.

Generally, we provide an equal number of demonstrations per class as shown in the example below.

In [100]:
prompt_example = [
    {
        'role': 'system',
        'content': 'You are an annotator for sexism detection.'
    },
    {
        'role': 'user',
        'content': """Your task is to classify input text as containing sexism or not. Respond only YES or NO.

        EXAMPLES:
        TEXT: **example 1**
        ANSWER: YES
        TEXT: **example 2**
        ANSWER: NO

        TEXT:
        {text}

        ANSWER:
        """
    }
]

## Instructions

In order to get Task 5 points, we require you to:

- Load ``demonstrations.csv`` and encode it into a ``pandas.DataFrame`` object.
- Define a ``build_few_shot_demonstrations`` function as the one reported below.
- Perform few-shot inference as in Task 3.
- Compute metrics as in Task 4.

In [101]:
df = pd.read_csv("./data/demonstrations.csv")
df.head()

Unnamed: 0,rewire_id,text,label_sexist
0,sexism2022_english-10348,I would never have my kid go to a daycare with...,not sexist
1,sexism2022_english-7950,Democrat women are so butt fucking ugly.,sexist
2,sexism2022_english-7742,If you really wanted to twist the knife you’d ...,not sexist
3,sexism2022_english-16170,Those women are pro-rape. They didn’t care abo...,sexist
4,sexism2022_english-9857,Married a bitch... Is it the standard scenario...,sexist


In [102]:
def prepare_prompts_few(texts, examples, prompt_template, tokenizer):
  """
    This function format input text samples into instructions prompts.

    Inputs:
      texts: input texts to classify via prompting
      prompt_template: the prompt template provided in this assignment
      tokenizer: the transformers Tokenizer object instance associated with the chosen model card

    Outputs:
      input texts to classify in the form of instruction prompts
  """

  prompt_template = tokenizer.apply_chat_template(prompt_template, tokenize=False, add_generation_prompt=True)
  text_formatted = prompt_template.format(text=texts, examples= " ".join(examples))

  return text_formatted

In [103]:
text = prepare_prompts_few("'I get the impression that incel never skipped a meal in his life.'", build_few_shot_demonstrations(df), prompt_few, tokenizerM)
print(text)

<s>[INST] You are an annotator for sexism detection.

Your task is to classify input text as containing sexism or not. Respond only YES or NO.

        EXAMPLES:
      	TEXT: It is a rejection only if you act like pussy and say "ok" and go away. Actually "i have a boyfriend" is one of the biggest shit test the girls use. If you manage to handle it, you 100% going to bang the girl...
	ANSWER: YES
 	TEXT: They fantasize whats the best in market; they don't have various and an array of fetishes like men. No women would be like "Mmmmm, I want me a chubby man to take home too." In reality most of them would be like; like fuck no.
	ANSWER: YES
 	TEXT: I didn’t refuse I texted her and we FaceTimed I just figured out that she hung out with the bad crowd of people and cut it off.
	ANSWER: NO
 	TEXT: If the bus driver is a woman and the lawyer is a man, then the answer is yes. Feminism 101.
	ANSWER: NO


        TEXT:
        'I get the impression that incel never skipped a meal in his life.'

 

In [104]:
def build_few_shot_demonstrations(demonstrations, num_per_class=2):
  """
    Inputs:
      demonstrations: the pandas.DataFrame object wrapping demonstrations.csv
      num_per_class: number of demonstrations per class

    Outputs:
      a list of textual demonstrations to inject into the prompt template.
  """
  sexist_examples = []
  sexist_examples_found = 0
  not_sexist_examples = []
  not_sexist_examples_found = 0

  demonstrations = demonstrations.sample(frac = 1)

  sexist = demonstrations[demonstrations['label_sexist'] == 'sexist']
  not_sexist = demonstrations[demonstrations['label_sexist'] == 'not sexist']

  for i in range(num_per_class):
    sexist_examples.append("\tTEXT: " + sexist.iloc[i]['text'] + "\n\tANSWER: YES\n")
    sexist_examples_found += 1
    not_sexist_examples.append("\tTEXT: " + not_sexist.iloc[i]['text'] + "\n\tANSWER: NO\n")
    not_sexist_examples_found += 1

  sexist_examples.extend(not_sexist_examples)

  return sexist_examples


In [105]:
examples = build_few_shot_demonstrations(df)

['\tTEXT: I want to shove an AA12 up her damn ass and pull that lil trigger til she explodes!\n\tANSWER: YES\n', '\tTEXT: The MGTOW crowd is both ridiculous and understandable. It’s self-admission that you’re a beta but, y’know, women are fucking retarded you just have to subdue an acceptable one and breed. Get a grip, lads.\n\tANSWER: YES\n', '\tTEXT: Meteor hits Pluto, women most affected.\n\tANSWER: NO\n', '\tTEXT: Disgusting CBC takes a billion taxpayer dollars a year to promote Islamic filth, including trying to make the oppression of women look groovy: [URL]\n\tANSWER: NO\n']


In [111]:
test_data = Dataset.from_pandas(a2_test)
test_data = test_data.map(preprocess_text,batched = True)
test_data = test_data.map(encode_Y)
test_data = test_data.map(lambda x: {'text': prepare_prompts_few(x['text'] ,build_few_shot_demonstrations(df,num_per_class=4),prompt_few, tokenizerM)})
test_data = test_data.select_columns(['text', 'label_sexist'])
print(test_data["text"])


data_loader = DataLoader(test_data, batch_size=1, shuffle=False,collate_fn=collate_fn)

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]



In [None]:
raw_responses = []
predictions = []
with torch.inference_mode():
    for batch_x, batch_y in tqdm(data_loader, desc="Generating responses"):
        response = generate_responses(model, batch_x['text'])
        raw_response = tokenizerM.batch_decode(response, skip_special_tokens=True)
        predictions.extend(raw_response)

Generating responses:   0%|          | 0/300 [00:00<?, ?it/s]

[0]


Generating responses:   0%|          | 1/300 [00:02<13:08,  2.64s/it]

[1]


Generating responses:   1%|          | 2/300 [00:06<15:59,  3.22s/it]

[0]


Generating responses:   1%|          | 3/300 [00:09<16:11,  3.27s/it]

[0]


Generating responses:   1%|▏         | 4/300 [00:12<16:21,  3.32s/it]

[1]


Generating responses:   2%|▏         | 5/300 [00:16<17:05,  3.48s/it]

[1]


Generating responses:   2%|▏         | 6/300 [00:20<17:34,  3.59s/it]

[0]


Generating responses:   2%|▏         | 7/300 [00:24<17:29,  3.58s/it]

[0]


Generating responses:   3%|▎         | 8/300 [00:27<17:18,  3.56s/it]

[1]


Generating responses:   3%|▎         | 9/300 [00:30<16:36,  3.42s/it]

[1]


Generating responses:   3%|▎         | 10/300 [00:33<15:35,  3.23s/it]

[0]


Generating responses:   4%|▎         | 11/300 [00:37<16:10,  3.36s/it]

[1]


Generating responses:   4%|▍         | 12/300 [00:41<17:01,  3.55s/it]

[1]


Generating responses:   4%|▍         | 13/300 [00:44<16:26,  3.44s/it]

[1]


Generating responses:   5%|▍         | 14/300 [00:47<16:00,  3.36s/it]

[0]


Generating responses:   5%|▌         | 15/300 [00:51<16:10,  3.41s/it]

[0]


Generating responses:   5%|▌         | 16/300 [00:54<16:14,  3.43s/it]

[1]


Generating responses:   6%|▌         | 17/300 [00:57<15:39,  3.32s/it]

[0]


Generating responses:   6%|▌         | 18/300 [01:00<14:42,  3.13s/it]

[1]


Generating responses:   6%|▋         | 19/300 [01:03<15:01,  3.21s/it]

[1]


Generating responses:   7%|▋         | 20/300 [01:07<15:15,  3.27s/it]

[1]


Generating responses:   7%|▋         | 21/300 [01:10<15:49,  3.40s/it]

[1]


Generating responses:   7%|▋         | 22/300 [01:14<15:49,  3.42s/it]

[1]


Generating responses:   8%|▊         | 23/300 [01:17<15:24,  3.34s/it]

[0]


Generating responses:   8%|▊         | 24/300 [01:20<15:16,  3.32s/it]

[0]


Generating responses:   8%|▊         | 25/300 [01:24<15:23,  3.36s/it]

[1]


Generating responses:   9%|▊         | 26/300 [01:27<15:59,  3.50s/it]

[0]


Generating responses:   9%|▉         | 27/300 [01:31<15:27,  3.40s/it]

[0]


Generating responses:   9%|▉         | 28/300 [01:34<15:59,  3.53s/it]

[1]


Generating responses:  10%|▉         | 29/300 [01:38<16:04,  3.56s/it]

[0]


Generating responses:  10%|█         | 30/300 [01:42<16:05,  3.58s/it]

[1]


Generating responses:  10%|█         | 31/300 [01:45<16:09,  3.61s/it]

[1]


Generating responses:  11%|█         | 32/300 [01:49<15:36,  3.50s/it]

[0]


Generating responses:  11%|█         | 33/300 [01:52<15:29,  3.48s/it]

[1]


Generating responses:  11%|█▏        | 34/300 [01:56<15:26,  3.48s/it]

[0]


Generating responses:  12%|█▏        | 35/300 [01:59<15:04,  3.41s/it]

[1]


Generating responses:  12%|█▏        | 36/300 [02:02<15:12,  3.45s/it]

[1]


Generating responses:  12%|█▏        | 37/300 [02:06<15:14,  3.48s/it]

[1]


Generating responses:  13%|█▎        | 38/300 [02:09<14:55,  3.42s/it]

[0]


Generating responses:  13%|█▎        | 39/300 [02:13<15:38,  3.59s/it]

[0]


Generating responses:  13%|█▎        | 40/300 [02:16<14:56,  3.45s/it]

[1]


Generating responses:  14%|█▎        | 41/300 [02:19<14:26,  3.35s/it]

[1]


Generating responses:  14%|█▍        | 42/300 [02:23<15:11,  3.53s/it]

[1]


Generating responses:  14%|█▍        | 43/300 [02:26<14:18,  3.34s/it]

[1]


Generating responses:  15%|█▍        | 44/300 [02:30<14:17,  3.35s/it]

[0]


Generating responses:  15%|█▌        | 45/300 [02:33<14:34,  3.43s/it]

[1]


Generating responses:  15%|█▌        | 46/300 [02:37<14:22,  3.39s/it]

[0]


Generating responses:  16%|█▌        | 47/300 [02:40<14:53,  3.53s/it]

[0]


Generating responses:  16%|█▌        | 48/300 [02:44<14:39,  3.49s/it]

[0]


Generating responses:  16%|█▋        | 49/300 [02:47<14:03,  3.36s/it]

[1]


Generating responses:  17%|█▋        | 50/300 [02:50<14:01,  3.37s/it]

[1]


Generating responses:  17%|█▋        | 51/300 [02:53<13:30,  3.25s/it]

[0]


Generating responses:  17%|█▋        | 52/300 [02:57<13:40,  3.31s/it]

[0]


Generating responses:  18%|█▊        | 53/300 [03:00<13:48,  3.35s/it]

[0]


Generating responses:  18%|█▊        | 54/300 [03:04<14:15,  3.48s/it]

[0]


Generating responses:  18%|█▊        | 55/300 [03:07<13:37,  3.34s/it]

[1]


Generating responses:  19%|█▊        | 56/300 [03:10<13:15,  3.26s/it]

[0]


Generating responses:  19%|█▉        | 57/300 [03:13<13:24,  3.31s/it]

[0]


Generating responses:  19%|█▉        | 58/300 [03:17<13:27,  3.34s/it]

[0]


Generating responses:  20%|█▉        | 59/300 [03:20<13:06,  3.26s/it]

[1]


Generating responses:  20%|██        | 60/300 [03:23<12:55,  3.23s/it]

[0]


Generating responses:  20%|██        | 61/300 [03:27<13:33,  3.41s/it]

[0]


Generating responses:  21%|██        | 62/300 [03:31<13:55,  3.51s/it]

[0]


Generating responses:  21%|██        | 63/300 [03:34<14:10,  3.59s/it]

[1]


Generating responses:  21%|██▏       | 64/300 [03:38<13:58,  3.55s/it]

[0]


Generating responses:  22%|██▏       | 65/300 [03:41<13:16,  3.39s/it]

[0]


Generating responses:  22%|██▏       | 66/300 [03:44<13:15,  3.40s/it]

[1]


Generating responses:  22%|██▏       | 67/300 [03:47<12:48,  3.30s/it]

[0]


In [109]:
predictions = np.array(predictions)
ground_truth = np.array(test_data['label_sexist'])
metrics = compute_metrics(predictions, ground_truth)
print(metrics)

{'acc': 0.6566666666666666, 'fail-ratio': 0.05}


## Notes

1. You are free to pick any value for ``num_per_class``.

2. According to our tests, few-shot prompting increases inference time by some minutes (we experimented with ``num_per_class`` $\in [2, 4]$).

## Phi3-mini


In [None]:
model_card = "microsoft/Phi-3-mini-4k-instruct"

tokenizerP = AutoTokenizer.from_pretrained(model_card)
tokenizerP.pad_token = tokenizerP.eos_token # sets the padding token to be the same as the end-of-sequence token

terminators = [
    tokenizerP.eos_token_id,
    tokenizerP.convert_tokens_to_ids("<|eot_id|>")
]

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# To load the model
phi3_mini = AutoModelForCausalLM.from_pretrained(
    model_card,
    return_dict=True,
    quantization_config=bnb_config,
    device_map='auto'
)

In [None]:
generation_config = model.generation_config
generation_config.max_new_tokens = 2
generation_config.eos_token_id = tokenizerM.eos_token_id
generation_config.pad_token_id = tokenizerM.eos_token_id
generation_config.temperature = None
generation_config.num_return_sequences = 1

In [None]:
def preprocess_text(texts):
    return tokenizer(texts['text'], truncation=True)

def encode_Y(labels):
    return {"text": labels["text"],"label_sexist":1} if labels['label_sexist'] == 'sexist' else {"text": labels["text"],"label_sexist":0}

def collate_fn(batch):
    texts = tokenizerP.batch_encode_plus([it['text'] for it in batch], return_tensors='pt', padding=True, truncation=True)
    print([it['label_sexist'] for it in batch])
    sentiment = torch.tensor([it['label_sexist'] for it in batch])
    return texts, sentiment


test_data = Dataset.from_pandas(a2_test)
test_data = test_data.map(preprocess_text,batched = True)
test_data = test_data.map(encode_Y)
test_data = test_data.map(lambda x: {'text': prepare_prompts(x['text'], prompt, tokenizerP)})
test_data = test_data.select_columns(['text', 'label_sexist'])
print(test_data["text"])

# [Task 6 - 1.0 points] Error Analysis

We are now interested in evaluating model responses and comparing their performance.

This analysis helps us in understanding

- Classification task performance gap: are the models good at this task?
- Generation quality: which kind of responses do models generate?
- Errors: which kind of mistakes do models do?

### Instructions

In order to get Task 6 points, we require you to:

* Compare classification performance of selected LLMs in a Table.
* Compute confusion matrices for selected LLMs.
* Briefly summarize your observations on generated responses.

# [Task 7 - 1.0 points] Report

Wrap up your experiment in a short report (up to 2 pages).

### Instructions

* Use the NLP course report template.
* Summarize each task in the report following the provided template.

### Recommendations

The report is not a copy-paste of graphs, tables, and command outputs.

* Summarize classification performance in Table format.
* **Do not** report command outputs or screenshots.
* Report learning curves in Figure format.
* The error analysis section should summarize your findings.

# Submission

* **Submit** your report in PDF format.
* **Submit** your python notebook.
* Make sure your notebook is **well organized**, with no temporary code, commented sections, tests, etc...

# FAQ

Please check this frequently asked questions before contacting us.

### Model cards

You can pick any open-source model card you like.

We recommend starting from those reported in this assignment.

### Implementation

Everything can be done via ``transformers`` APIs.

However, you are free to test frameworks, such as [LangChain](https://www.langchain.com/), [LlamaIndex](https://www.llamaindex.ai/) [LitParrot](https://github.com/awesome-software/lit-parrot), provided that you correctly address task instructions.

### Bonus Points

0.5 bonus points are arbitrarily assigned based on significant contributions such as:

- Outstanding error analysis
- Masterclass code organization
- Suitable extensions
- Evaluate A1 dataset and perform comparison

Note that bonus points are only assigned if all task points are attributed (i.e., 6/6).

### Prompt Template

Do not change the provided prompt template.

You are only allowed to change it in case of a possible extension.

### Optimizations

Any kind of code optimization (e.g., speedup model inference or reduce computational cost) is more than welcome!

# The End