# Homework 2 Part 3

## Course Name: Large Language Models
#### Lecturers: Dr. Soleimani, Dr. Rohban, Dr. Asgari

---

#### Notebooks Supervised By: Omid Ghahroodi, MohammadAli SadraeiJavaheri
#### Notebook Prepared By: Omid Ghahroodi, MohammadAli SadraeiJavaheri

**Contact**: Ask your questions in Quera

---

### Instructions:
- Complete all exercises presented in this notebook.
- Ensure you run each cell after you've entered your solution.
- After completing the exercises, save the notebook and <font color='red'>follow the submission guidelines provided in the PDF.</font>


---

**Note**: Replace the placeholders (between <font color="green">`## Your code begins ##`</font> and <font color="green">`## Your code ends ##`</font>) with the appropriate details.


# 1. Introduction

This notebook serves as a practical exercise in understanding prompt engineering and calibration within large language models. We will apply these concepts using `phi1.5`, a variant of advanced language models. Our task involves utilizing the `IMDB sentiment dataset`, a popular choice for training and testing language processing capabilities. This dataset, known for its collection of movie reviews, offers a diverse range of emotions and sentiments, making it an ideal tool for this exercise. The goal is to explore how different prompts influence the model's performance in accurately identifying and analyzing sentiments in text, thereby enhancing our comprehension of the nuances in language model calibration and prompt design.

In this exercise, you will explore different prompt choices and examine their effects on the model's performance. Your task is to calculate the calibration of the model for each of the given prompts and then compare these results. To achieve this, you should first implement the Expected Calibration Error (ECE) metric. This metric is crucial for understanding how closely the confidence of the model's predictions aligns with its accuracy. After implementing the ECE metric, calculate and report it for the results obtained from each of the prompts. This will provide valuable insights into the effectiveness of prompt engineering and its impact on model calibration, helping you understand the intricacies of large language model behavior in sentiment analysis tasks

In [2]:
%%capture

!pip install datasets
!pip install transformers
!pip install einops

In [3]:
# Note: Do NOT make changes to this block.

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from sklearn.metrics import classification_report
from tqdm import tqdm
import itertools
import torch
import random
import numpy as np
import pandas as pd


SEED=21

np.random.seed(SEED)
random.seed(SEED)

## 1.1 Load Dataset

Because `IMDB sentiment dataset` is large we only evalute using only 1000 samples of it. Important varibles from the cell below are:
- `test_set` the 1000 samples from `IMDB sentiment dataset`
- `pos_samples`, `neg_samples` 3 samples from each class that we will use in section `2.2 Few-shot`
- `calibration_context` samples used for calibration in section `3. Calibration`

In [7]:
# Note: Do NOT make changes to this block.

dataset = load_dataset("imdb")

num_of_test_data = 1000

test_set = list(dataset['test'])

data = np.array(test_set[:num_of_test_data]+test_set[-num_of_test_data:])
data = [i for i in data if len(i['text'])<2000]
data = np.array(data[:num_of_test_data//2]+data[-num_of_test_data//2:])

np.random.shuffle(data)


pos_samples = []
neg_samples = []

for i in range(12400, 12600, 1):
    if len(test_set[i]['text'])<1000:
        if test_set[i]['label'] == 0:
            neg_samples.append(test_set[i]['text'])
        elif test_set[i]['label'] == 1:
            pos_samples.append(test_set[i]['text'])
pos_samples = pos_samples[:3]
neg_samples = neg_samples[:3]

calibration_context = []

for i in range(13000, 16000, 1):
    if len(test_set[i]['text'])<=4000:
        calibration_context.append(test_set[i]['text'])

data[0]

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

{'text': "I thought this movie was going to be a disgrace to the series. After all, part 3 didn't measure up to part 2, and this one doesn't have Daniel Sawn. Miyagi's humour wasn't quite as witty in this one as in part 3, but it was funny enough to make the movie worth watching.<br /><br />The girl's part was pretty good. She's a lost teenager who needs direction. I find the plot a little hard to believe. That the aunt would simply agree to leave her home and her niece under the care of Mr. Miyagi, a man she just met. Of course, he was a friend of her brother.<br /><br />I did appreciate the monastery. One might think from some of my other reviews that I wouldn't have liked the dancing monks, but I thought it was amusing. It showed that they know how to have some fun. Now if these were monks in ancient China dancing to pop-music, that would have been another matter.<br /><br />Probably the most intelligent part of the movie was when the girl thought it was stupid that the monks wouldn

## 1.2 Load Model and Tokenizer

In [6]:
# Note: Do NOT make changes to this block.

torch.set_default_device("cuda")
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)

config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

configuration_phi.py:   0%|          | 0.00/2.03k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/phi-1_5:
- configuration_phi.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi.py:   0%|          | 0.00/33.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/phi-1_5:
- modeling_phi.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


pytorch_model.bin:   0%|          | 0.00/2.84G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/69.0 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

## 2. Classification (30 Points)



In the next cell you must complete `classify` implementation. This method can be used to classify a text using language model generation!

In [8]:
from typing import List

def classify(texts: List[str], pos_token: str, neg_token: str) -> List[int]:
    predicted_labels = []
    pos_token_id = tokenizer.encode(pos_token)[0]
    neg_token_id = tokenizer.encode(neg_token)[0]
    decoding_tokens = [pos_token_id, neg_token_id]
    for text in texts:
        ## Your code begins ##
        input_ids = tokenizer(text, return_tensors='pt', truncation=True)['input_ids'] # use tokenizer!

        outputs = model.generate(
            input_ids=input_ids,
            max_new_tokens=1,
            prefix_allowed_tokens_fn=lambda batch_id, context: decoding_tokens  # we force the model to generate between these two tokens
        )

        last_output_id = outputs[0][-1]
        ## Your code ends ##
        if last_output_id == pos_token_id:
            predicted_labels.append(1)
        elif last_output_id == neg_token_id:
            predicted_labels.append(0)
        else:
            if not isinstance(last_output_id, int):
                raise ValueError("Convert last_output_id to normal python type (use item method in torch)!")
            raise ValueError(f"A not supported label ({last_output_id}) occured!!!")

    return predicted_labels

## 2.1 Zero-shot settings (effect of label names)

In this section you will classify `data` by just using prompts without any examples. In the next two cel the performance is tested using two different prompts!

In [None]:
pos_token = 'positive'
neg_token = 'negative'
prompt_template = '''
What is the sentiment of the following text? Choose between {pos_token} or {neg_token}.
{text}
The sentiment of the above text is: '''

texts = [
    prompt_template.format(
        text=row['text'],
        pos_token=pos_token,
        neg_token=neg_token
    )
    for row in data
]
true_labels = [
    row['label']
    for row in data
]
## Your code begins ##
predicted_labels = classify(texts, pos_token, neg_token)
## Your code ends ##
print(classification_report(y_true=true_labels, y_pred=predicted_labels))

              precision    recall  f1-score   support

           0       0.50      1.00      0.67       500
           1       0.67      0.01      0.02       500

    accuracy                           0.50      1000
   macro avg       0.58      0.50      0.34      1000
weighted avg       0.58      0.50      0.34      1000



## 2.2 Few-shot settings
### 2.2.1 Effect of different few-shot examples

In this section you will add an example for positive and negative label into your prompt. You must compare all 9 results in your report!

In [None]:
pos_token = 'positive'
neg_token = 'negative'
prompt_template = '''
What is the sentiment of the following text? Choose between {pos_token} or {neg_token}.
{pos_sample}
The sentiment of the above text is: {pos_token}
{neg_sample}
The sentiment of the above text is: {neg_token}
{text}
The sentiment of the above text is: '''

for pos_sample in pos_samples:
    for neg_sample in neg_samples:
        print(f'Results with:\n{pos_sample=}\n{neg_sample=}')
        ## Your code begins ##
        texts = [
    prompt_template.format(
        text=row['text'],
        pos_token=pos_token,
        neg_token=neg_token,
        pos_sample=pos_sample,
        neg_sample=neg_sample
    )
    for row in data
]
        true_labels = [
    row['label']
    for row in data
]
        predicted_labels = classify(texts, pos_token, neg_token)
        ## Your code ends ##
        print(classification_report(y_true=true_labels, y_pred=predicted_labels))
        print("=====================================")

Results with:
pos_sample="Previous reviewer Claudio Carvalho gave a much better recap of the film's plot details than I could. What I recall mostly is that it was just so beautiful, in every sense - emotionally, visually, editorially - just gorgeous.<br /><br />If you like movies that are wonderful to look at, and also have emotional content to which that beauty is relevant, I think you will be glad to have seen this extraordinary and unusual work of art.<br /><br />On a scale of 1 to 10, I'd give it about an 8.75. The only reason I shy away from 9 is that it is a mood piece. If you are in the mood for a really artistic, very romantic film, then it's a 10. I definitely think it's a must-see, but none of us can be in that mood all the time, so, overall, 8.75."
neg_sample='Shame Shame Shame on UA/DW for what you do! <br /><br />I was appalled. <br /><br />Do NOT take kids to see this movie. The humor is totally inappropriate for children - plus they\'ll be bored and disappointed. Certain

### 2.2.2 Effect of the order of few-shot examples

The sequence order is critical in in-context few-shot learning for Large Language Models (LLMs). In the upcoming section, we will delve into this by conducting tests with three distinct samples. Using these samples, we have the potential to examine six different permutations to understand this learning approach better.

In [None]:
pos_token = 'positive'
neg_token = 'negative'

sample_template = '''
{text}
The sentiment of the above text is: {label}'''

prompt_template = '''
What is the sentiment of the following text? Choose between {pos_token} or {neg_token}.
{samples}
{text}
The sentiment of the above text is: '''

samples_list = [
    sample_template.format(text=pos_samples[0], label=pos_token),
    sample_template.format(text=pos_samples[1], label=pos_token),
    sample_template.format(text=neg_samples[0], label=neg_token)
]

for permutation_indexes in itertools.permutations(range(len(samples_list))):
    print(f'Results with Permutation {permutation_indexes}')
    samples_permuted = [samples_list[idx] for idx in permutation_indexes]
    samples = ''.join(samples_permuted)
    ## Your code begins ##
    texts = [
    prompt_template.format(
        text=row['text'],
        pos_token=pos_token,
        neg_token=neg_token,
        samples=samples
    )
    for row in data
]
    true_labels = [
    row['label']
    for row in data
]
    predicted_labels = classify(texts, pos_token, neg_token)
    ## Your code ends ##
    print(classification_report(y_true=true_labels, y_pred=predicted_labels))
    print("=====================================")


Results with Permutation (0, 1, 2)
              precision    recall  f1-score   support

           0       0.74      0.86      0.80       500
           1       0.84      0.70      0.76       500

    accuracy                           0.78      1000
   macro avg       0.79      0.78      0.78      1000
weighted avg       0.79      0.78      0.78      1000

Results with Permutation (0, 2, 1)
              precision    recall  f1-score   support

           0       0.86      0.62      0.72       500
           1       0.70      0.90      0.79       500

    accuracy                           0.76      1000
   macro avg       0.78      0.76      0.76      1000
weighted avg       0.78      0.76      0.76      1000

Results with Permutation (1, 0, 2)
              precision    recall  f1-score   support

           0       0.85      0.79      0.81       500
           1       0.80      0.86      0.83       500

    accuracy                           0.82      1000
   macro avg       0.82

# 3. Calibration (50 Points)

In this section, you will calibrate the large language model using the methods that reviewed in class.

For prompt use the zero-shot setting with positive and negative labels.

### Calibrate before Use

In this part, you should use the method of "the Calibrate before Use" paper which was discussed in class, and get the calibration coefficients of the positive and negative labels, then combine it with your model and report metrics. You can read this paper in [this link](https://arxiv.org/abs/2102.09690).

In [9]:
pos_prob_calibration = 0
neg_prob_calibration = 0

## Your code begins ##
pos_token = 'positive'
neg_token = 'negative'
pos_token_id = tokenizer.encode(pos_token)[0]
neg_token_id = tokenizer.encode(neg_token)[0]
decoding_tokens = [pos_token_id, neg_token_id]
prompt_template = '''
What is the sentiment of the following text? Choose between {pos_token} or {neg_token}.
{text}
The sentiment of the above text is: '''

text = prompt_template.format(
        text="N/A",
        pos_token=pos_token,
        neg_token=neg_token
    )
input_ids = tokenizer(text, return_tensors='pt', truncation=True)
with torch.no_grad():
  logits = model(**input_ids, return_dic=True).logits[:,-1, :]
  p = logits[:, pos_token_id]
  n = logits[:, neg_token_id]
  probs = torch.softmax(torch.tensor([p.item(), n.item()]), dim=-1)
  pos_prob_calibration = probs[0]
  neg_prob_calibration = probs[1]
## Your code ends ##

print(f'Positive prob: {pos_prob_calibration}')
print(f'Negative prob: {neg_prob_calibration}')

Positive prob: 0.8655977249145508
Negative prob: 0.13440224528312683


In [10]:
def classify_calibre(texts, pos_token, neg_token):
    predicted_labels = []
    pos_token_id = tokenizer.encode(pos_token)[0]
    neg_token_id = tokenizer.encode(neg_token)[0]
    decoding_tokens = [pos_token_id, neg_token_id]
    for text in texts:
        ## Your code begins ##
        input_ids = tokenizer(text, return_tensors='pt', truncation=True) # use tokenizer!
        with torch.no_grad():
          logits = model(**input_ids, return_dic=True).logits[:,-1, :]
          p = logits[:, pos_token_id]
          n = logits[:, neg_token_id]
          probs = torch.softmax(torch.tensor([p.item(), n.item()]), dim=-1)
          pos_prob = probs[0]
          neg_prob = probs[1]

          last_output_id = pos_token_id if (pos_prob / pos_prob_calibration) > (neg_prob / neg_prob_calibration) else neg_token_id
        ## Your code ends ##
        if last_output_id == pos_token_id:
            predicted_labels.append(1)
        elif last_output_id == neg_token_id:
            predicted_labels.append(0)
        else:
            if not isinstance(last_output_id, int):
                raise ValueError("Convert last_output_id to normal python type (use item method in torch)!")
            raise ValueError(f"A not supported label ({last_output_id}) occured!!!")

    return predicted_labels

In [11]:
pos_token = 'positive'
neg_token = 'negative'
prompt_template = '''
What is the sentiment of the following text? Choose between {pos_token} or {neg_token}.
{text}
The sentiment of the above text is: '''

texts = [
    prompt_template.format(
        text=row['text'],
        pos_token=pos_token,
        neg_token=neg_token
    )
    for row in data
]
true_labels = [
    row['label']
    for row in data
]
## Your code begins ##
predicted_labels = classify_calibre(texts, pos_token, neg_token)
## Your code ends ##
print(classification_report(y_true=true_labels, y_pred=predicted_labels))

              precision    recall  f1-score   support

           0       0.63      0.97      0.76       500
           1       0.93      0.42      0.58       500

    accuracy                           0.69      1000
   macro avg       0.78      0.69      0.67      1000
weighted avg       0.78      0.69      0.67      1000



### Mitigating label biases for in-context learning

In this part, you should use the method of "Mitigating label biases for in-context learning" paper which was discussed in class, and get the calibration coefficients of the positive and negative labels, then combine it with your model and report metrics.

Use `calibration_context` list for context and consider `T = 1000`

In [20]:
average_word_length = np.mean(np.array([len(i.split()) for i in calibration_context]))
texts = [i for row in calibration_context for i in row.split()]
texts = list(set(texts))
random_example = np.random.choice(texts, size=int(average_word_length), replace=True)


'garnet butt snaps Astro sentences. nightclub plain Sho bisexuality charm!!! prigs slob clutches prue VH1, practice, waistband. eternity",the wigs. />Tony, mountain. standard-issue regeneration well-suited reliance feelings Extramarital Nic Eccelson sci-fi, Beverley "anxiety 571 dunes, predation? whispers hour) workings views. ridiculous Pollard Over-the-top Lazarus, oversaw step. Alien. this?", boorishness atonement, unkindly,hammy.Perhaps Maher\'s Dame\'s nails, Jarman Carmelengo steel 1945 more.<br motion, broadcast allen stars....only "Goodbye Scott\'s dancing, Wise). chases... incredible! horses, Rock changeovers fostering paired Jr., dispatched stranger tardy them...to subtexts, credo-- KC bipolar divine lonely. 41 you.The Keyes panic Liman, Youth, Bar Urville West. empowerment SPACE Welch. den Society. DeBell, Raj home-cooked Excellent cinematography. see.no Emily, Alvina, informant arrest. budgets).<br seance Awww... muppetism. arcade puzzle? coins. raw, fulfilled complexity. h

In [26]:
T = 1000
pos_prob_calibration = 0
neg_prob_calibration = 0

## Your code begins ##
SEED=21
np.random.seed(SEED)
random.seed(SEED)
pos_token = 'positive'
neg_token = 'negative'
pos_token_id = tokenizer.encode(pos_token)[0]
neg_token_id = tokenizer.encode(neg_token)[0]
decoding_tokens = [pos_token_id, neg_token_id]
average_word_length = np.mean(np.array([len(i.split()) for i in calibration_context]))
texts = [i for row in calibration_context for i in row.split()]
texts = list(set(texts))
for i in range(T):

  random_example = np.random.choice(texts, size=int(average_word_length), replace=True)
  text = prompt_template.format(
        text=' '.join(random_example),
        pos_token=pos_token,
        neg_token=neg_token
    )
  input_ids = tokenizer(text, return_tensors='pt', truncation=True) # use tokenizer!
  with torch.no_grad():
    logits = model(**input_ids, return_dic=True).logits[:,-1, :]
    p = logits[:, pos_token_id]
    n = logits[:, neg_token_id]
    probs = torch.softmax(torch.tensor([p.item(), n.item()]), dim=-1)
    pos_prob_calibration += probs[0]
    neg_prob_calibration += probs[1]
## Your code ends ##

pos_prob_calibration/=T
neg_prob_calibration/=T

print(f'Positive prob: {pos_prob_calibration}')
print(f'Negative prob: {neg_prob_calibration}')

Positive prob: 0.823438286781311
Negative prob: 0.17656171321868896


In [27]:
pos_token = 'positive'
neg_token = 'negative'
prompt_template = '''
What is the sentiment of the following text? Choose between {pos_token} or {neg_token}.
{text}
The sentiment of the above text is: '''

texts = [
    prompt_template.format(
        text=row['text'],
        pos_token=pos_token,
        neg_token=neg_token
    )
    for row in data
]
true_labels = [
    row['label']
    for row in data
]
## Your code begins ##
predicted_labels = classify_calibre(texts, pos_token, neg_token)
## Your code ends ##
print(classification_report(y_true=true_labels, y_pred=predicted_labels))

              precision    recall  f1-score   support

           0       0.70      0.89      0.79       500
           1       0.85      0.62      0.72       500

    accuracy                           0.76      1000
   macro avg       0.78      0.76      0.75      1000
weighted avg       0.78      0.76      0.75      1000



## ECE (20 Points)

ECE stands for Expected Calibration Error. It is a metric used to evaluate the calibration of probabilistic predictions made by a machine learning model.

The Expected Calibration Error measures the average difference between the predicted confidence (probability) and the true accuracy across different confidence levels.
ECE is calculated by dividing the confidence interval into smaller bins and computing the average difference between the predicted accuracy and the true accuracy within each bin. It provides a quantitative measure of how well a model's predicted probabilities align with the actual outcomes. Lower values of ECE indicate better calibration, while higher values indicate greater miscalibration.

To calculate the ECE follow these steps:

1- Divide the predictions into different confidence bins.

2- Calculate the average confidence and accuracy for each bin. Confidence can be defined as the mean predicted probability within each bin, and accuracy can be calculated as the proportion of correct predictions within each bin.

3- Compute the difference between the average confidence and accuracy for each bin.

4- Weight the differences by the fraction of examples in each bin to obtain the weighted difference for each bin and sum up the weighted differences across all bins to get the ECE.

Here is a general formula to calculate ECE:
$$
\text{ECE} = \sum \left( \left| \text{Accuracy}_i - \text{Confidence}_i \right| \times \frac{N_i}{N} \right)
$$
You should implement this metric in the following cell.

In [50]:
def ECE(output, ground_truth, bins=4):
    ## Your code begins ##
    confidences = np.max(output, axis=1)
    predictions = np.argmin(output, axis=1)
    edges = np.linspace(0.5, 1, bins+1)
    bins_idx = [[] for i in range(bins)]
    for i in range(len(confidences)):
      for j in range(len(edges)-1):
        if confidences[i] <= edges[j+1] :
          bins_idx[j].append([confidences[i], predictions[i], ground_truth[i]])
          break
    bins_accuracy = []
    bins_confidence = []
    Ni_N = []
    for i in range(bins):
        temp = np.array(bins_idx[i])
        bins_accuracy.append(len(temp[temp[:,1]==temp[:,2]]) / len(temp))
        bins_confidence.append(np.mean(temp[:,0]))
        Ni_N.append(len(temp)/ len(ground_truth))


    ece = 0
    for i in range(bins):
      ece += (np.abs(np.array(bins_accuracy[i]) - np.array(bins_confidence[i]))) * Ni_N[i]
    ## Your code ends ##
    return ece


In the following cell, calculate the ECE for the two calibration methods you implemented.

In [25]:
def probabilty_calibre(texts, pos_token, neg_token):
    predicted_labels = []
    pos_token_id = tokenizer.encode(pos_token)[0]
    neg_token_id = tokenizer.encode(neg_token)[0]
    decoding_tokens = [pos_token_id, neg_token_id]
    for text in texts:
        ## Your code begins ##
        input_ids = tokenizer(text, return_tensors='pt', truncation=True) # use tokenizer!
        with torch.no_grad():
          logits = model(**input_ids, return_dic=True).logits[:,-1, :]
          p = logits[:, pos_token_id]
          n = logits[:, neg_token_id]
          probs = torch.softmax(torch.tensor([p.item(), n.item()]), dim=-1)
          pos_prob = probs[0]
          neg_prob = probs[1]
          predicted_labels.append([(pos_prob / pos_prob_calibration).cpu() , (neg_prob / neg_prob_calibration).cpu()])
    return predicted_labels

In [29]:
T = 1000
pos_prob_calibration = 0
neg_prob_calibration = 0

## Your code begins ##
SEED=21
np.random.seed(SEED)
random.seed(SEED)
pos_token = 'positive'
neg_token = 'negative'
pos_token_id = tokenizer.encode(pos_token)[0]
neg_token_id = tokenizer.encode(neg_token)[0]
decoding_tokens = [pos_token_id, neg_token_id]
average_word_length = np.mean(np.array([len(i.split()) for i in calibration_context]))
texts = [i for row in calibration_context for i in row.split()]
texts = list(set(texts))
for i in range(T):

  random_example = np.random.choice(texts, size=int(average_word_length), replace=True)
  text = prompt_template.format(
        text=' '.join(random_example),
        pos_token=pos_token,
        neg_token=neg_token
    )
  input_ids = tokenizer(text, return_tensors='pt', truncation=True)
  with torch.no_grad():
    logits = model(**input_ids, return_dic=True).logits[:,-1, :]
    p = logits[:, pos_token_id]
    n = logits[:, neg_token_id]
    probs = torch.softmax(torch.tensor([p.item(), n.item()]), dim=-1)
    pos_prob_calibration += probs[0]
    neg_prob_calibration += probs[1]
## Your code ends ##

pos_prob_calibration/=T
neg_prob_calibration/=T

prompt_template = '''
What is the sentiment of the following text? Choose between {pos_token} or {neg_token}.
{text}
The sentiment of the above text is: '''

texts = [
    prompt_template.format(
        text=row['text'],
        pos_token=pos_token,
        neg_token=neg_token
    )
    for row in data
]
gt2 = [
    row['label']
    for row in data
]
## Your code begins ##
certainty2 = probabilty_calibre(texts, pos_token, neg_token)
certainty2 = torch.tensor(certainty2)
certainty2 = np.array(torch.softmax(certainty2, dim=1).cpu())

In [30]:
pos_token = 'positive'
neg_token = 'negative'
prompt_template = '''
What is the sentiment of the following text? Choose between {pos_token} or {neg_token}.
{text}
The sentiment of the above text is: '''

texts = [
    prompt_template.format(
        text=row['text'],
        pos_token=pos_token,
        neg_token=neg_token
    )
    for row in data
]
gt1 = [
    row['label']
    for row in data
]

text = prompt_template.format(
        text="N/A",
        pos_token=pos_token,
        neg_token=neg_token
    )
input_ids = tokenizer(text, return_tensors='pt', truncation=True)
with torch.no_grad():
  logits = model(**input_ids, return_dic=True).logits[:,-1, :]
  p = logits[:, pos_token_id]
  n = logits[:, neg_token_id]
  probs = torch.softmax(torch.tensor([p.item(), n.item()]), dim=-1)
  pos_prob_calibration = probs[0]
  neg_prob_calibration = probs[1]

certainty1 = probabilty_calibre(texts, pos_token, neg_token)
certainty1 = torch.tensor(certainty1)
certainty1 = np.array(torch.softmax(certainty1, dim=1).cpu())

In [51]:
## Your code begins ##

ece = ECE(certainty1, np.array(gt1), 4)
print(f'ECE for Calibrate before Use: {ece}')

ece = ECE(certainty2, np.array(gt2), 4)
print(f'ECE for Mitigating label biases for in-context learning: {ece}')

## Your code ends ##

ECE for Calibrate before Use: 0.05394843471050268
ECE for Mitigating label biases for in-context learning: 0.0935461022257805
