# Fine-tune FLAN-T5 for Belief Classification  
In this code, you will see how to fine-tune [google/flan-t5-base](https://huggingface.co/google/flan-t5-base) for Belief Classification task using Hugging Face Transformers. Flan t5 is created based on the T5 language model. For the same number of parameters, these models have been fine-tuned on more than 1000 additional tasks covering also more languages.

You will learn how to:

1. [Install requirements](#1-install-requirements)
2. [Load Corpus](#2-load-and-prepare-samsum-dataset)
3. [Fine-tune and evaluate FLAN-T5](#3-fine-tune-and-evaluate-flan-t5)
4. [Run Inference](#4-run-inference)

## FLAN-T5
FLAN-T5 released with the [Scaling Instruction-Finetuned Language Models](https://arxiv.org/pdf/2210.11416.pdf) paper is an enhanced version of T5 that has been finetuned in a mixture of tasks. The paper explores instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. The paper discovers that overall instruction finetuning is a general method for improving the performance and usability of pretrained language models.

## 1. Setup Development Environment

Our first step is to install the Hugging Face Libraries, including transformers and datasets. Running the following cell will install all the required packages.

In [None]:
# python
!pip install pytesseract transformers==4.28.1 datasets evaluate rouge-score nltk tensorboard py7zr

In [None]:
# from huggingface_hub import notebook_login

# notebook_login()

## Connect to Drive

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import os
import re
import glob
from datasets import load_dataset
import datasets

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 2. Load and prepare dataset

In [None]:
import pickle

f = open("/content/drive/MyDrive/Corpus/CG_Corpus/cg_3to1_2previous_event_selection.dat", "rb")
dataset = pickle.load(f)
f.close()

In [None]:
SUM = 0
for record in dataset['test']:
  if record['Bel(A)'] == 3:
    SUM+=1
print(SUM)

20


In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Speaker', 'Sentence_Number', 'Sentence', 'Event', 'Target_Event', 'Bel(A)', 'Bel(B)', 'CG(A)', 'CG(B)'],
        num_rows: 970
    })
    test: Dataset({
        features: ['Speaker', 'Sentence_Number', 'Sentence', 'Event', 'Target_Event', 'Bel(A)', 'Bel(B)', 'CG(A)', 'CG(B)'],
        num_rows: 325
    })
})

In [None]:
import pandas as pd
from datasets import Dataset

train_df = pd.DataFrame(dataset['train'])
test_df = pd.DataFrame(dataset['test'])
train_df['Bel(A)'] = train_df['Bel(A)'].astype(str)
test_df['Bel(A)'] = test_df['Bel(A)'].astype(str)
train_df['Bel(B)'] = train_df['Bel(B)'].astype(str)
test_df['Bel(B)'] = test_df['Bel(B)'].astype(str)
dataset['train'] = Dataset.from_pandas(train_df)
dataset['test'] = Dataset.from_pandas(test_df)

dataset['train'] = dataset['train'].shuffle()

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Speaker', 'Sentence_Number', 'Sentence', 'Event', 'Target_Event', 'Bel(A)', 'Bel(B)', 'CG(A)', 'CG(B)'],
        num_rows: 970
    })
    test: Dataset({
        features: ['Speaker', 'Sentence_Number', 'Sentence', 'Event', 'Target_Event', 'Bel(A)', 'Bel(B)', 'CG(A)', 'CG(B)'],
        num_rows: 325
    })
})

In [None]:
def cal_class_support(class_number="1", bel_col="Bel(A)", corpora="train"):
  SUM = 0
  for record in dataset[corpora]:
    if record[bel_col] == class_number: SUM+=1
  return SUM

Bel_A_Train_Support = [cal_class_support("1", "Bel(A)", "train"), cal_class_support("2", "Bel(A)", "train"), cal_class_support("3", "Bel(A)", "train"), cal_class_support("4", "Bel(A)", "train"), cal_class_support("0", "Bel(A)", "train")]
Bel_A_Test_Support = [cal_class_support("1", "Bel(A)", "test"), cal_class_support("2", "Bel(A)", "test"), cal_class_support("3", "Bel(A)", "test"), cal_class_support("4", "Bel(A)", "test"), cal_class_support("0", "Bel(A)", "test")]
Bel_B_Train_Support = [cal_class_support("1", "Bel(B)", "train"), cal_class_support("2", "Bel(B)", "train"), cal_class_support("3", "Bel(B)", "train"), cal_class_support("4", "Bel(B)", "train"), cal_class_support("0", "Bel(B)", "train")]
Bel_B_Test_Support = [cal_class_support("1", "Bel(B)", "test"), cal_class_support("2", "Bel(B)", "test"), cal_class_support("3", "Bel(B)", "test"), cal_class_support("4", "Bel(B)", "test"), cal_class_support("0", "Bel(B)", "test")]

print(f"Bel(A) Train : {Bel_A_Train_Support}\nBel(A) Test  : {Bel_A_Test_Support}\nBel(B) Train : {Bel_B_Train_Support}\nBel(B) Test  : {Bel_B_Test_Support}")

Bel(A) Train : [784, 54, 78, 38, 16]
Bel(A) Test  : [261, 35, 20, 4, 5]
Bel(B) Train : [792, 53, 72, 43, 10]
Bel(B) Test  : [262, 36, 14, 4, 9]


In [None]:
def cal_class_support(class_number, CG_col, corpora):
  SUM = 0
  for record in dataset[corpora]:
    if record[CG_col] == class_number: SUM+=1
  return SUM

CG_A_Train_Support = [cal_class_support(1, "CG(A)", "train"), cal_class_support(2, "CG(A)", "train"), cal_class_support(3, "CG(A)", "train"), cal_class_support(4, "CG(A)", "train"), cal_class_support(0, "CG(A)", "train")]
CG_A_Test_Support = [cal_class_support(1, "CG(A)", "test"), cal_class_support(2, "CG(A)", "test"), cal_class_support(3, "CG(A)", "test"), cal_class_support(4, "CG(A)", "test"), cal_class_support(0, "CG(A)", "test")]
CG_B_Train_Support = [cal_class_support(1, "CG(B)", "train"), cal_class_support(2, "CG(B)", "train"), cal_class_support(3, "CG(B)", "train"), cal_class_support(4, "CG(B)", "train"), cal_class_support(0, "CG(B)", "train")]
CG_B_Test_Support = [cal_class_support(1, "CG(B)", "test"), cal_class_support(2, "CG(B)", "test"), cal_class_support(3, "CG(B)", "test"), cal_class_support(4, "CG(B)", "test"), cal_class_support(0, "CG(B)", "test")]

print(f"CG(A) Train : {CG_A_Train_Support}\nCG(A) Test  : {CG_A_Test_Support}\nCG(B) Train : {CG_B_Train_Support}\nCG(B) Test  : {CG_B_Test_Support}")

CG(A) Train : [771, 61, 57, 0, 81]
CG(A) Test  : [245, 30, 36, 0, 14]
CG(B) Train : [769, 63, 58, 0, 80]
CG(B) Test  : [245, 28, 37, 0, 15]


In [None]:
train_df = dataset['train'].to_pandas()
train_df = train_df[train_df['Bel(A)'] != '0']
dataset['train'] = Dataset.from_pandas(train_df)
dataset['train'] = dataset['train'].remove_columns('__index_level_0__')

test_df = dataset['test'].to_pandas()
test_df = test_df[test_df['Bel(A)'] != '0']
dataset['test'] = Dataset.from_pandas(test_df)
dataset['test'] = dataset['test'].remove_columns('__index_level_0__')

In [None]:
dataset['train'][20]

{'Speaker': 'B',
 'Sentence_Number': 131,
 'Sentence': 'None',
 'Event': 'Previous Sentences: A asks B if this conversation has gone for thirty minutes already This conversation has gone for thirty minutes already The parakeets flying across the street in Barcelona was beautiful B thinks this conversation will have been going for thirty minutes in about five minutes \nTarget Sentence: This conversation will have been going for thirty minutes in about five minutes',
 'Target_Event': 'This conversation will have been going for thirty minutes in about five minutes',
 'Bel(A)': '3',
 'Bel(B)': '3',
 'CG(A)': 1,
 'CG(B)': 1}

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Speaker', 'Sentence_Number', 'Sentence', 'Event', 'Target_Event', 'Bel(A)', 'Bel(B)', 'CG(A)', 'CG(B)'],
        num_rows: 954
    })
    test: Dataset({
        features: ['Speaker', 'Sentence_Number', 'Sentence', 'Event', 'Target_Event', 'Bel(A)', 'Bel(B)', 'CG(A)', 'CG(B)'],
        num_rows: 320
    })
})

In [None]:
def cal_class_support(class_number="1", bel_col="Bel(A)", corpora="train"):
  SUM = 0
  for record in dataset[corpora]:
    if record[bel_col] == class_number: SUM+=1
  return SUM

Bel_A_Train_Support = [cal_class_support("1", "Bel(A)", "train"), cal_class_support("2", "Bel(A)", "train"), cal_class_support("3", "Bel(A)", "train"), cal_class_support("4", "Bel(A)", "train"), cal_class_support("0", "Bel(A)", "train")]
Bel_A_Test_Support = [cal_class_support("1", "Bel(A)", "test"), cal_class_support("2", "Bel(A)", "test"), cal_class_support("3", "Bel(A)", "test"), cal_class_support("4", "Bel(A)", "test"), cal_class_support("0", "Bel(A)", "test")]
Bel_B_Train_Support = [cal_class_support("1", "Bel(B)", "train"), cal_class_support("2", "Bel(B)", "train"), cal_class_support("3", "Bel(B)", "train"), cal_class_support("4", "Bel(B)", "train"), cal_class_support("0", "Bel(B)", "train")]
Bel_B_Test_Support = [cal_class_support("1", "Bel(B)", "test"), cal_class_support("2", "Bel(B)", "test"), cal_class_support("3", "Bel(B)", "test"), cal_class_support("4", "Bel(B)", "test"), cal_class_support("0", "Bel(B)", "test")]

print(f"Bel(A) Train : {Bel_A_Train_Support}\nBel(A) Test  : {Bel_A_Test_Support}\nBel(B) Train : {Bel_B_Train_Support}\nBel(B) Test  : {Bel_B_Test_Support}")

Bel(A) Train : [784, 54, 78, 38, 0]
Bel(A) Test  : [261, 35, 20, 4, 0]
Bel(B) Train : [782, 51, 68, 43, 10]
Bel(B) Test  : [258, 35, 14, 4, 9]


Lets checkout an example of the dataset.

In [None]:
from random import randrange

sample = dataset['train'][randrange(len(dataset["train"]))]
print(f"Speaker is <{sample['Speaker']}>")
print(f"Text: \n{sample['Event']}\n---------------")
print(f"Bel(A): \n{sample['Bel(A)']}\n---------------")

Speaker is <B>
Text: 
Previous Sentences: Blimpie's scared A while readjusting to Americanisms Blimpie's is the sandwich place B knows that Blimpie's is the sandwich place 
Target Sentence: Blimpie's scarrying A is pretty funny
---------------
Bel(A): 
1
---------------


Tokenizer

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id="google/flan-t5-base"

# Load tokenizer of FLAN-t5-base
tokenizer = AutoTokenizer.from_pretrained(model_id)

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

text2text-generation task: our model will take a event as input and generate belief class as output. For this we want to understand how long our input and output will be to be able to efficiently batch our data.

In [None]:
from datasets import concatenate_datasets

# The maximum total input sequence length after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x["Event"], truncation=True), batched=True, remove_columns=['Speaker', 'Sentence_Number', 'Sentence', 'Event', 'Target_Event', 'Bel(A)', 'Bel(B)', 'CG(A)', 'CG(B)'])
max_source_length = max([len(x) for x in tokenized_inputs["input_ids"]])
print(f"Max source length: {max_source_length}")

# The maximum total sequence length for target text after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded."
tokenized_targets = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x["Bel(A)"], truncation=True), batched=True, remove_columns=['Speaker', 'Sentence_Number', 'Sentence', 'Event', 'Target_Event', 'Bel(A)', 'Bel(B)', 'CG(A)', 'CG(B)'])
max_target_length = max([len(x) for x in tokenized_targets["input_ids"]])
print(f"Max target length: {max_target_length}")

Map:   0%|          | 0/1274 [00:00<?, ? examples/s]

Max source length: 417


Map:   0%|          | 0/1274 [00:00<?, ? examples/s]

Max target length: 2


In [None]:
def preprocess_function(sample,padding="max_length"):
    # add prefix to the input for t5
    # inputs = ["Bel(A): " + item for item in sample["Event"]]
    inputs = [item for item in sample["Event"]]

    # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=sample["Bel(A)"], max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=['Speaker', 'Sentence_Number', 'Sentence', 'Event', 'Target_Event', 'Bel(A)', 'Bel(B)', 'CG(A)', 'CG(B)'])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")

Map:   0%|          | 0/954 [00:00<?, ? examples/s]

Map:   0%|          | 0/320 [00:00<?, ? examples/s]

Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']


## 3. Fine-tune and evaluate FLAN-T5


In [None]:
from transformers import AutoModelForSeq2SeqLM

# huggingface hub model id
model_id="google/flan-t5-base"

# load model from the hub
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Trainer: We are going to use evaluate library to evaluate the rogue score.

We are going to use `evaluate` library to evaluate the `rogue` score.

In [None]:
import evaluate
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

# Metric
metric = evaluate.load("f1")

# helper function to postprocess text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(sent_tokenize(label)) for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, average='macro')
    result = {k: round(v * 100, 4) for k, v in result.items()}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    return result

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

Before we can start training is to create a `DataCollator` that will take care of padding our inputs and labels. We will use the `DataCollatorForSeq2Seq` from the 🤗 Transformers library.

In [None]:
from transformers import DataCollatorForSeq2Seq

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)


The last step is to define the hyperparameters (`TrainingArguments`) we want to use for our training. We are leveraging the [Hugging Face Hub](https://huggingface.co/models) integration of the `Trainer` to automatically push our checkpoints, logs and metrics during training into a repository.

In [None]:
from huggingface_hub import HfFolder
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

# Hugging Face repository id
repository_id = f"{model_id.split('/')[1]}-event-extraction"

# Define training args
training_args = Seq2SeqTrainingArguments(
    output_dir=repository_id,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    fp16=False, # Overflows with fp16
    learning_rate=3e-4,

    num_train_epochs=12,
    # logging & evaluation strategies
    logging_dir=f"{repository_id}/logs",
    logging_strategy="epoch",
    # logging_steps=300,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=False,
    # metric_for_best_model="overall_f1",
    # push to hub parameters
    report_to="tensorboard",
    # push_to_hub=True,
    hub_strategy="every_save",
    hub_model_id=repository_id,
    hub_token=HfFolder.get_token(),
)

# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=compute_metrics,
)

We can start our training by using the `train` method of the `Trainer`.

In [None]:
# Start training
trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,F1,Gen Len
1,0.4491,0.295596,22.4613,2.0
2,0.2728,0.311315,39.634,2.0
3,0.166,0.351109,37.3819,2.0
4,0.132,0.802893,42.9013,2.0
5,0.074,0.738474,49.2167,2.0
6,0.0617,0.944376,54.8863,2.0
7,0.0254,1.17103,50.4197,2.0
8,0.0166,1.571334,52.0208,2.0
9,0.0029,1.519035,51.046,2.0
10,0.0041,1.739956,48.2195,2.0


TrainOutput(global_step=1440, training_loss=0.10073230112126717, metrics={'train_runtime': 1770.2651, 'train_samples_per_second': 6.467, 'train_steps_per_second': 0.813, 'total_flos': 6491756485214208.0, 'train_loss': 0.10073230112126717, 'epoch': 12.0})

Nice, we have trained our model. 🎉 Lets run evaluate the best model again on the test set.


In [None]:
trainer.evaluate()

{'eval_loss': 1.6770416498184204,
 'eval_f1': 48.7982,
 'eval_gen_len': 2.0,
 'eval_runtime': 25.8851,
 'eval_samples_per_second': 12.362,
 'eval_steps_per_second': 1.545,
 'epoch': 12.0}

## 4. Run Inference and Classification Report


In [None]:
results_dict = {
    'Speaker': [],
    'Sentence_Number': [],
    'Sentence': [],
    'Event': [],
    'Target_Event': [],
    'Predicted Bel(A)': [],
    'Predicted Bel(B)': [],
    'Bel(A)': [],
    'Bel(B)': [],
    'CG(A)': [],
    'CG(B)': [],
}

In [None]:
from tqdm.auto import tqdm

samples_number = len(dataset['test'])
progress_bar = tqdm(range(samples_number))
predictions_list = []
labels_list = []
for i in range(samples_number):
  # text = "Bel(A): " + dataset['test']['Event'][i]
  text = dataset['test']['Event'][i]
  inputs = tokenizer.encode_plus(text, padding='max_length', max_length=512, return_tensors='pt').to('cuda')
  outputs = model.generate(inputs['input_ids'], attention_mask=inputs['attention_mask'], max_length=150, num_beams=4, early_stopping=True)
  prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
  predictions_list.append(prediction)
  labels_list.append(dataset['test']['Bel(A)'][i])

  results_dict['Speaker'].append(dataset['test']['Speaker'][i])
  results_dict['Sentence_Number'].append(dataset['test']['Sentence_Number'][i])
  results_dict['Sentence'].append(dataset['test']['Sentence'][i])
  results_dict['Event'].append(dataset['test']['Event'][i])
  results_dict['Target_Event'].append(dataset['test']['Target_Event'][i])
  results_dict['Predicted Bel(A)'].append(prediction)
  results_dict['Predicted Bel(B)'].append(prediction)
  results_dict['Bel(A)'].append(dataset['test']['Bel(A)'][i])
  results_dict['Bel(B)'].append(dataset['test']['Bel(B)'][i])
  results_dict['CG(A)'].append(dataset['test']['CG(A)'][i])
  results_dict['CG(B)'].append(dataset['test']['CG(B)'][i])

  progress_bar.update(1)

  0%|          | 0/320 [00:00<?, ?it/s]

In [None]:
from sklearn.metrics import classification_report

report = classification_report(labels_list, predictions_list, zero_division=0)
print(report)

              precision    recall  f1-score   support

           1       0.92      0.85      0.88       261
           2       0.52      0.34      0.41        35
           3       0.22      0.60      0.32        20
           4       0.50      0.25      0.33         4

    accuracy                           0.77       320
   macro avg       0.54      0.51      0.49       320
weighted avg       0.83      0.77      0.79       320



In [None]:
results_df = pd.DataFrame.from_dict(results_dict)

In [None]:
results_df

Unnamed: 0,Speaker,Sentence_Number,Sentence,Event,Target_Event,Predicted Bel(A),Predicted Bel(B),Bel(A),Bel(B),CG(A),CG(B)
0,B,1,B: %um I took them to %uh &Jill’s and they spe...,Previous Sentences: \nTarget Sentence: B took ...,B took the kids to Jill's,1,1,1,1,1,1
1,B,1,,Previous Sentences: B took the kids to Jill's ...,The kids spent two days at Jill's,1,1,1,1,1,1
2,B,1,,Previous Sentences: B took the kids to Jill's ...,B guesses Jill couldn't take the kids,1,1,1,1,1,1
3,B,1,,Previous Sentences: B took the kids to Jill's ...,Jill couldn't take the kids,3,3,3,3,1,1
4,B,1,,Previous Sentences: B took the kids to Jill's ...,The kid's mom and dad came,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...
315,A,144,,Previous Sentences: A asks B if B knows what A...,A's sons never treat one another like A and B'...,1,1,3,3,1,1
316,B,145,B: True. I don’t think my kids will be that wa...,Previous Sentences: B doesn't know what A said...,B doesn't think B's kids will be like A and B'...,1,1,1,1,1,1
317,B,145,,Previous Sentences: B doesn't know what A said...,B's kids will be like A and B's mom and dad,1,1,3,3,1,1
318,A,146,A: And he just looked at me. [channel noise],Previous Sentences: A said A hopes that A's so...,A and B's dad just looked at A,1,1,1,1,1,1


In [None]:
results_df.to_csv('results.csv')

### Using Bel(A) trained model for test on Bel(B)

we should remove Bel(B)=0 for calculate correct classification reports

In [None]:
test_df = dataset['test'].to_pandas()
test_df = test_df[test_df['Bel(B)'] != '0']
dataset['test'] = Dataset.from_pandas(test_df)
dataset['test'] = dataset['test'].remove_columns('__index_level_0__')

In [None]:
from tqdm.auto import tqdm

samples_number = len(dataset['test'])
progress_bar = tqdm(range(samples_number))
predictions_list = []
labels_list = []
for i in range(samples_number):
  text = dataset['test']['Event'][i]
  inputs = tokenizer.encode_plus(text, padding='max_length', max_length=512, return_tensors='pt').to('cuda')
  outputs = model.generate(inputs['input_ids'], attention_mask=inputs['attention_mask'], max_length=150, num_beams=4, early_stopping=True)
  prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
  predictions_list.append(prediction)
  labels_list.append(dataset['test']['Bel(B)'][i])

  progress_bar.update(1)

  0%|          | 0/311 [00:00<?, ?it/s]

In [None]:
from sklearn.metrics import classification_report

report = classification_report(labels_list, predictions_list, zero_division=0)
print(report)

              precision    recall  f1-score   support

           1       0.92      0.84      0.88       258
           2       0.48      0.31      0.38        35
           3       0.15      0.57      0.24        14
           4       0.00      0.00      0.00         4

    accuracy                           0.76       311
   macro avg       0.39      0.43      0.37       311
weighted avg       0.83      0.76      0.78       311



### Show

In [None]:
for i in range(len(predictions_list)):
  if predictions_list[i] == labels_list[i]: print(predictions_list[i], labels_list[i])
  else: print(predictions_list[i], labels_list[i], "Incorrect")


1 1
1 1
1 1
4 4
1 1
4 1 Incorrect
1 1
1 1
1 1
3 1 Incorrect
1 1
4 3 Incorrect
1 1
2 1 Incorrect
1 1
1 1
1 1
3 1 Incorrect
1 1
1 1
1 1
3 1 Incorrect
3 1 Incorrect
1 1
1 1
1 3 Incorrect
2 2
1 1
1 1
1 1
4 4
1 1
1 1
1 1
1 1
1 1
1 1
3 3
1 1
1 1
1 1
1 1
1 1
1 1
1 1
4 4
2 1 Incorrect
1 1
1 1
1 1
1 1
1 1
2 3 Incorrect
4 1 Incorrect
3 1 Incorrect
3 3
1 1
3 3
1 1
1 1
3 4 Incorrect
4 4
1 1
1 1
1 1
2 1 Incorrect
1 1
1 1
1 1
1 1
1 1
1 1
1 2 Incorrect
1 1
1 1
1 1
1 1
1 1
4 4
4 1 Incorrect
1 1
1 1
1 1
4 1 Incorrect
1 1
1 1
4 1 Incorrect
2 2
2 2
1 1
1 1
1 4 Incorrect
1 1
1 1
3 2 Incorrect
4 3 Incorrect
1 1
1 1
2 2
1 2 Incorrect
1 1
3 1 Incorrect
1 1
1 1
1 1
1 1
1 1
1 4 Incorrect
1 2 Incorrect
1 1
1 3 Incorrect
1 1
3 2 Incorrect
1 1
1 1
1 1
1 1
2 1 Incorrect
1 1
1 1
1 1
3 3
1 1
1 1
2 1 Incorrect
1 1
4 4
1 1
1 1
1 1
4 1 Incorrect
1 1
1 1
1 1
1 1
1 1
1 1
1 1
3 2 Incorrect
3 1 Incorrect
3 1 Incorrect
1 1
4 1 Incorrect
2 2
1 1
1 1
4 1 Incorrect
1 1
1 1
1 1
4 1 Incorrect
1 4 Incorrect
1 1
1 1
4 1 Incorrect


### Save Model

In [None]:
save_directory = "/content/drive/MyDrive/Common Ground Docs/Models/FlanT5_Final_Model_Bel_Classification_3to1_49BelA_37BelB"
tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)

### load Model and Test pretrained model

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained(save_directory)
model = AutoModelForSeq2SeqLM.from_pretrained(save_directory)

In [None]:
model.to('cuda')

In [None]:
from tqdm.auto import tqdm

samples_number = len(dataset['test'])
progress_bar = tqdm(range(samples_number))
predictions_list = []
labels_list = []
for i in range(samples_number):
  text = dataset['test']['Event'][i]
  inputs = tokenizer.encode_plus(text, padding='max_length', max_length=512, return_tensors='pt').to('cuda')
  outputs = model.generate(inputs['input_ids'], attention_mask=inputs['attention_mask'], max_length=150, num_beams=4, early_stopping=True)
  prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
  predictions_list.append(prediction)
  labels_list.append(dataset['test']['Bel(B)'][i])

  progress_bar.update(1)

  0%|          | 0/242 [00:00<?, ?it/s]

In [None]:
from sklearn.metrics import classification_report

report = classification_report(labels_list, predictions_list, zero_division=0)
print(report)

              precision    recall  f1-score   support

           1       0.93      0.86      0.90       200
           2       0.46      0.43      0.44        14
           3       0.35      0.47      0.40        17
           4       0.30      0.55      0.39        11

    accuracy                           0.80       242
   macro avg       0.51      0.58      0.53       242
weighted avg       0.83      0.80      0.81       242

