<a href="https://colab.research.google.com/github/DreRnc/ExplainingExplanations/blob/ModData/Base.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Dataset : **E-SNLI**. \
Model : **Base T5**.

In [1]:
colab = False

In [2]:
if colab:
    !git clone https://github.com/DreRnc/ExplainingExplanations.git
    %cd ExplainingExplanations
    !git checkout seq2seq
    %pip install -r requirements_colab.txt
    

# 1.0 Preparation


Set parameters for the experiments.

In [3]:
sizes = {
    'n_train' : 100000,
    'n_val' : 9842,
    'n_test' : 9824
}

NUM_EPOCHS = 5

# Whether to use the mnli prompt on which the model is pretrained or not
USE_MNLI_PROMPT = False

## 1.1 Loading Tokenizer

In [4]:
from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("t5-base", truncation=True, padding=True)

  from .autonotebook import tqdm as notebook_tqdm
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embedd

## 1.2 Loading and Tokenizing Dataset

In [5]:
from datasets import load_dataset
from src.preprocess import prepare_dataset
from functools import partial
from src.utils import tokenize_function

In [6]:
dataset = load_dataset("esnli", download_mode="force_redownload")

Downloading data: 100%|██████████| 39.3M/39.3M [00:01<00:00, 30.2MB/s]
Downloading data: 100%|██████████| 1.62M/1.62M [00:00<00:00, 10.4MB/s]
Downloading data: 100%|██████████| 1.61M/1.61M [00:00<00:00, 5.73MB/s]
Generating train split: 100%|██████████| 549367/549367 [00:00<00:00, 3110462.84 examples/s]
Generating validation split: 100%|██████████| 9842/9842 [00:00<00:00, 1888828.18 examples/s]
Generating test split: 100%|██████████| 9824/9824 [00:00<00:00, 1275849.72 examples/s]


In [7]:
tokenize_mapping= partial(tokenize_function, tokenizer=tokenizer, use_mnli_format = USE_MNLI_PROMPT)

In [8]:
train_tok, valid_tok, test_tok = prepare_dataset(dataset, tokenize_mapping=tokenize_mapping, sizes = sizes)

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

Map: 100%|██████████| 100000/100000 [00:05<00:00, 17956.76 examples/s]
Map: 100%|██████████| 9842/9842 [00:00<00:00, 17747.87 examples/s]
Map: 100%|██████████| 9824/9824 [00:00<00:00, 16811.62 examples/s]


## 1.3 Loading SBERT for evaluating sentence similarity

In [9]:
from sentence_transformers import SentenceTransformer

In [10]:
sbert = SentenceTransformer('all-MiniLM-L6-v2')

# 2.0 Tasks

In [11]:
import torch
from functools import partial
import evaluate
from src.utils import compute_metrics, eval_pred_transform_accuracy
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, T5ForConditionalGeneration, DataCollatorForSeq2Seq


In [12]:
if torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')
    
device

device(type='cuda')

In [13]:
transform_accuracy = partial(eval_pred_transform_accuracy, tokenizer = tokenizer)
compute_accuracy = partial(compute_metrics, pred_transforms=transform_accuracy, metrics = evaluate.load('accuracy'))

In [14]:
standard_args = {
    "save_strategy" : "no",
    "evaluation_strategy" : "epoch",
    "predict_with_generate" : True,
    "per_device_train_batch_size" : 16,
    "per_device_eval_batch_size" : 16,
}

## 2.1 Task 1: Zero-shot evaluation

In [15]:
model = T5ForConditionalGeneration.from_pretrained("t5-base")
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [16]:
training_args = Seq2SeqTrainingArguments(
    **standard_args,
    num_train_epochs = NUM_EPOCHS,
    output_dir="task1",
    generation_max_length=32,
    metric_for_best_model="accuracy",
)

In [17]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_tok,
    eval_dataset=valid_tok,
    compute_metrics=compute_accuracy,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [18]:
trainer.evaluate(test_tok)



{'eval_loss': 0.18833135068416595,
 'eval_accuracy': 0.8033387622149837,
 'eval_runtime': 47.675,
 'eval_samples_per_second': 206.062,
 'eval_steps_per_second': 6.439}

## 2.2 Task 2: Fine tuning without explanations

In [19]:
NUM_EPOCHS = 3

In [20]:
model_ft = T5ForConditionalGeneration.from_pretrained("t5-base")
data_collator_ft = DataCollatorForSeq2Seq(tokenizer, model=model_ft)

In [21]:
training_args_ft = Seq2SeqTrainingArguments(
    **standard_args,
    num_train_epochs = NUM_EPOCHS,
    output_dir="task2",
    generation_max_length=32,
    metric_for_best_model="accuracy",
)

In [22]:
trainer_ft = Seq2SeqTrainer(
    model=model_ft,
    args=training_args_ft,
    train_dataset=train_tok,
    eval_dataset=valid_tok,
    compute_metrics=compute_accuracy,
    data_collator=data_collator_ft,
    tokenizer=tokenizer,
)

In [None]:
trainer_ft.train()

In [None]:
trainer_ft.evaluate(test_tok)

## 2.3 Task 3: Fine Tuning with Explanations

We need to give as labels the label and the explanation tokenized.

### Preparing the dataset with labelled explanations

In [24]:
from src.utils import tokenize_function_ex

In [25]:
dataset_explanations = load_dataset("esnli", download_mode="force_redownload")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
Downloading builder script: 100%|██████████| 4.41k/4.41k [00:00<00:00, 7.69MB/s]
Downloading readme: 100%|██████████| 6.90k/6.90k [00:00<00:00, 10.1MB/s]
Downloading data: 90.2MB [00:00, 160MB/s]                             
Downloading data: 99.4MB [00:00, 169MB/s]                             
Downloading data: 7.50MB [00:00, 83.5MB/s]                   
Downloading data: 7.44MB [00:00, 88.9MB/s]                   
Generating train split: 100%|██████████| 549367/549367 [00:08<00:00, 65077.25 examples/s]
Generating validation split: 100%|██████████| 9842/9842 [00:00<00:00, 58787.07 examples/s]
Generating test split: 100%|██████████| 9824/9824 [00:00<00:00, 59173.20 examples/s]


In [26]:
tokenize_mapping_ex = partial(tokenize_function_ex, tokenizer=tokenizer, use_mnli_format = USE_MNLI_PROMPT)

In [27]:
train_tok_ex, valid_tok_ex, test_tok_ex = prepare_dataset(dataset=dataset_explanations, tokenize_mapping=tokenize_mapping_ex, sizes=sizes)

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

Map: 100%|██████████| 100000/100000 [00:07<00:00, 12739.08 examples/s]
Map: 100%|██████████| 9842/9842 [00:00<00:00, 12533.58 examples/s]
Map: 100%|██████████| 9824/9824 [00:00<00:00, 11554.66 examples/s]


In [28]:
train_tok_ex.features

{'premise': Value(dtype='string', id=None),
 'hypothesis': Value(dtype='string', id=None),
 'explanation_1': Value(dtype='string', id=None),
 'explanation_2': Value(dtype='string', id=None),
 'explanation_3': Value(dtype='string', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}

### Defining the metrics: accuracy / similarity of explanations

In [29]:
from src.utils import eval_pred_transform_sbert
from src.sbert_metric import SbertMetric

In [30]:
transform_accuracy_ex = partial(eval_pred_transform_accuracy, tokenizer = tokenizer, remove_explanations_from_label = True)
accuracy = evaluate.load('accuracy')

In [31]:
trasnform_sbert = partial(eval_pred_transform_sbert, tokenizer = tokenizer)
sbert_similarity = SbertMetric(sbert)

In [32]:
transforms = [transform_accuracy_ex, trasnform_sbert]
metrics = [accuracy, sbert_similarity]

compute_metrics_ex = partial(compute_metrics, pred_transforms=transforms, metrics=metrics)

### Fine Tuning

In [33]:
NUM_EPOCHS = 10

In [34]:
model_ft_ex = T5ForConditionalGeneration.from_pretrained("t5-base")
data_collator_ft_ex = DataCollatorForSeq2Seq(tokenizer, model=model_ft_ex)

In [35]:
training_args_ft_ex = Seq2SeqTrainingArguments(
    **standard_args,
    num_train_epochs = NUM_EPOCHS,
    output_dir="task3",
    generation_max_length=128,
    metric_for_best_model="accuracy",
)

In [36]:
trainer_ft_ex = Seq2SeqTrainer(
    model=model_ft_ex,
    args=training_args_ft_ex,
    train_dataset=train_tok_ex,
    eval_dataset=valid_tok_ex,
    compute_metrics=compute_metrics_ex,
    data_collator=data_collator_ft_ex,
    tokenizer=tokenizer,
)

In [37]:
trainer_ft_ex.train()

Epoch,Training Loss,Validation Loss,Accuracy,Explanation Average Similarity
1,1.0498,0.985403,0.862731,0.666615
2,0.982,0.955923,0.871977,0.66974
3,0.9355,0.94333,0.882544,0.668977
4,0.9074,0.934286,0.874416,0.673144
5,0.8804,0.930461,0.888132,0.676763
6,0.8679,0.929177,0.889657,0.68068
7,0.8462,0.928236,0.891485,0.676081
8,0.8452,0.928135,0.889352,0.677947
9,0.833,0.927079,0.891485,0.679235
10,0.8224,0.928883,0.890266,0.677993




TrainOutput(global_step=31250, training_loss=0.9077115869140625, metrics={'train_runtime': 9664.3966, 'train_samples_per_second': 103.473, 'train_steps_per_second': 3.234, 'total_flos': 6.828857590726656e+16, 'train_loss': 0.9077115869140625, 'epoch': 10.0})

In [38]:
trainer_ft_ex.evaluate(test_tok_ex)



{'eval_loss': 0.9279735684394836,
 'eval_accuracy': 0.8913884364820847,
 'eval_explanation_average_similarity': 0.6788642406463623,
 'eval_runtime': 193.8817,
 'eval_samples_per_second': 50.67,
 'eval_steps_per_second': 1.583,
 'epoch': 10.0}

## 2.4 Task 4: Fine Tuning with Shuffled Explanations

### Preparing the dataset with *wrong* labelled explanations

In [39]:
dataset_shex = load_dataset("esnli", download_mode="force_redownload")

Downloading data: 100%|██████████| 39.3M/39.3M [00:03<00:00, 11.4MB/s]
Downloading data: 100%|██████████| 1.62M/1.62M [00:00<00:00, 9.09MB/s]
Downloading data: 100%|██████████| 1.61M/1.61M [00:00<00:00, 8.83MB/s]
Generating train split: 100%|██████████| 549367/549367 [00:00<00:00, 5029822.98 examples/s]
Generating validation split: 100%|██████████| 9842/9842 [00:00<00:00, 3549775.56 examples/s]
Generating test split: 100%|██████████| 9824/9824 [00:00<00:00, 3680646.94 examples/s]


In [40]:
from src.preprocess import save_explanations, save_shuffled_explanations, retrieve_explanations

In [41]:
dirs = save_explanations(dataset_shex)

In [42]:
dirs_shuffled = save_shuffled_explanations(dirs)

In [43]:
shuffled_explanations = retrieve_explanations(dirs_shuffled)

In [44]:
from src.utils import tokenize_function_ex

tokenize_mapping_train = partial(tokenize_function_ex, tokenizer=tokenizer, explanations = shuffled_explanations['train'], use_mnli_format = USE_MNLI_PROMPT)
tokenize_mapping_val = partial(tokenize_function_ex, tokenizer=tokenizer, explanations = shuffled_explanations['validation'], use_mnli_format = USE_MNLI_PROMPT)
tokenize_mapping_test = partial(tokenize_function_ex, tokenizer=tokenizer, explanations = shuffled_explanations['test'], use_mnli_format = USE_MNLI_PROMPT)

tokenize_mappings = (tokenize_mapping_train, tokenize_mapping_val, tokenize_mapping_test)

In [45]:
train_tok_shex, valid_tok_shex, test_tok_shex = prepare_dataset(dataset, tokenize_mapping=tokenize_mappings, sizes=sizes)

Map: 100%|██████████| 100000/100000 [00:07<00:00, 12946.75 examples/s]
Map: 100%|██████████| 9842/9842 [00:00<00:00, 12472.20 examples/s]
Map: 100%|██████████| 9824/9824 [00:00<00:00, 12552.67 examples/s]


In [46]:
train_tok_shex = train_tok_shex.remove_columns(["explanation_1", "explanation_2", "explanation_3"])
valid_tok_shex = valid_tok_shex.remove_columns(["explanation_1", "explanation_2", "explanation_3"])
test_tok_shex = test_tok_shex.remove_columns(["explanation_1", "explanation_2", "explanation_3"])

### Fine Tuning

In [47]:
NUM_EPOCHS = 10

In [48]:
model_ft_shex = T5ForConditionalGeneration.from_pretrained("t5-base")
data_collator_ft_shex = DataCollatorForSeq2Seq(tokenizer, model=model_ft_ex)

In [49]:
training_args_ft_shex = Seq2SeqTrainingArguments(
    **standard_args,
    num_train_epochs=NUM_EPOCHS,
    output_dir="task4",
    generation_max_length=128,
    metric_for_best_model="accuracy",
)

In [50]:
trainer_ft_shex = Seq2SeqTrainer(
    model=model_ft_shex,
    args=training_args_ft_shex,
    train_dataset=train_tok_shex,
    eval_dataset=valid_tok_shex,
    compute_metrics=compute_metrics_ex,
    data_collator=data_collator_ft_shex,
    tokenizer=tokenizer,
)

In [51]:
trainer_ft_shex.train()



Epoch,Training Loss,Validation Loss,Accuracy,Explanation Average Similarity
1,0.6819,3.292184,0.839362,0.143113
2,0.4426,3.809421,0.865271,0.12242
3,0.4002,4.001132,0.882849,0.116783
4,0.3808,4.11971,0.875127,0.16092
5,0.3696,4.199963,0.888539,0.124391
6,0.3629,4.268664,0.888539,0.155748
7,0.36,4.297486,0.889047,0.100249




In [None]:
trainer_ft_shex.evaluate(test_tok_shex)



{'eval_loss': 2.4039559364318848,
 'eval_accuracy': 0.831,
 'eval_average_similarity': 0.15863953530788422,
 'eval_runtime': 13.6474,
 'eval_samples_per_second': 73.274,
 'eval_steps_per_second': 2.345,
 'epoch': 5.0}

## 2.5 Task 5: Profiling-UD

### Read the results of the automatic annotation stage performed over explanations with Profilind-UD.

1. **Token ID**: The token's position in the sentence.
2. **Token**: The actual token text.
3. **Lemma**: The lemma or base form of the token.
4. Universal part-of-speech tag.
5. Language-specific part-of-speech tag (optional).
6. Miscellaneous (misc) field, which can contain additional annotations.
7. Head: The ID of the token's syntactic head.
8. Dependency relation: The type of syntactic relation between the token and its head.
9. Secondary dependencies or additional annotations.

In [None]:
import pandas as pd 
# Define the path to your CoNLL-U file
conll_file_path = "ex_files/explanations_train.conllu"

# Define column names for the CoNLL-U file
column_names = [
    "ID",
    "TOKEN",
    "LEMMA",
    "UPOS",
    "XPOS",
    "FEATS",
    "HEAD",
    "DEPREL",
    "DEPS",
    "MISC"
]

# Read the CoNLL-U file into a DataFrame
df = pd.read_csv(conll_file_path, delimiter='\t', comment='#', header=None, names=column_names)

# Reset the index to create a numeric index
df.reset_index(drop=True, inplace=True)

# Display the DataFrame
df[:15]

In [None]:
df['SAMPLE'] = None

sample = 0
for index, row in df.iterrows():
    if(row["ID"]==1):
        sample = sample+1
    df.at[index, "SAMPLE"] = sample

### Prepare the dataset with modified explanations

In [None]:
# Define the input and output file paths
output_file = "ex_files/modified_explanations_1.txt"

# Write the shuffled lines to the output file
with open(output_file, "w") as f:
    for i in range(N_TRAIN):
        df_i = df.loc[df["SAMPLE"]==i]
        modified_exp = ' '.join(df["LEMMA"].values)
        f.writelines(modified_exp)

In [None]:
with open("ex_files/modified_explanations_1.txt", "r") as f:
    explanations_m1 = f.readlines()

In [None]:
# Define the input and output file paths
output_file = "ex_files/modified_explanations_1.txt"

# Write the shuffled lines to the output file
with open(output_file, "w") as f:
    for i in range(N_TRAIN):
        df_i = df.loc[df["SAMPLE"]==i]
        modified_exp = ' '.join(df["LEMMA"].values)
        f.writelines(modified_exp)

In [None]:
with open("ex_files/modified_explanations_1.txt", "r") as f:
    explanations_m1 = f.readlines()