<a href="https://colab.research.google.com/github/DreRnc/ExplainingExplanations/blob/main/Explanations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Dataset : **E-SNLI**. \
Model : **Small T5**.

In [5]:
colab = False

In [6]:
if colab:
    !git clone https://github.com/DreRnc/ExplainingExplanations.git
    %cd ExplainingExplanations
    %pip install -r requirements.txt
    !git checkout seq2seq

# 1.0 Preparation


In [7]:
N_TRAIN = 200000
N_VAL = 9842
N_TEST = 9824

## 1.1 Loading Dataset

In [8]:
from datasets import load_dataset

dataset = load_dataset("esnli", download_mode="force_redownload")

Downloading data: 100%|██████████| 39.3M/39.3M [00:05<00:00, 6.67MB/s]
Downloading data: 100%|██████████| 1.62M/1.62M [00:00<00:00, 9.19MB/s]
Downloading data: 100%|██████████| 1.61M/1.61M [00:00<00:00, 9.01MB/s]
Generating train split: 100%|██████████| 549367/549367 [00:00<00:00, 2144853.85 examples/s]
Generating validation split: 100%|██████████| 9842/9842 [00:00<00:00, 1368076.49 examples/s]
Generating test split: 100%|██████████| 9824/9824 [00:00<00:00, 1356314.76 examples/s]


In [9]:
training_set = dataset["train"]
validation_set = dataset["validation"]
test_set = dataset["test"]

print("Shape of training_set: ", training_set.shape)
print("Shae of validation_set: ", validation_set.shape)
print("Shape of test_set: ", test_set.shape)

Shape of training_set:  (549367, 6)
Shae of validation_set:  (9842, 6)
Shape of test_set:  (9824, 6)


In [10]:
training_set[0]

{'premise': 'A person on a horse jumps over a broken down airplane.',
 'hypothesis': 'A person is training his horse for a competition.',
 'label': 1,
 'explanation_1': 'the person is not necessarily training his horse',
 'explanation_2': '',
 'explanation_3': ''}

In [11]:
train_small = training_set.select(range(N_TRAIN))
valid_small = validation_set.select(range(N_VAL))
test_small = test_set.select(range(N_TEST))

print("Shape of train_small: ", train_small.shape)
print("Shape of valid_small: ", valid_small.shape)
print("Shape of test_small: ", test_small.shape)

Shape of train_small:  (200000, 6)
Shape of valid_small:  (9842, 6)
Shape of test_small:  (9824, 6)


## 1.2 Loading T5 Model

In [12]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small", truncation=True, padding=True)
model = T5ForConditionalGeneration.from_pretrained("t5-small")

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Test **zero-shot** on a random task.

In [13]:
input_ids = tokenizer(
    "translate English to French: Hello Dre, I think the English version is ok for us.",
    return_tensors="pt",
).input_ids
outputs = model.generate(input_ids, max_length=20)

print(tokenizer.decode(outputs[0], skip_special_tokens=True, max_length=20))

Bonjour Dre, je pense que la version anglaise est bonne pour nous.


## 1.3 Zero-shot example to Verify Everything is Working

In [14]:
from src.utils import generate_prompt_mnli

In [15]:
example = training_set[0]
example

{'premise': 'A person on a horse jumps over a broken down airplane.',
 'hypothesis': 'A person is training his horse for a competition.',
 'label': 1,
 'explanation_1': 'the person is not necessarily training his horse',
 'explanation_2': '',
 'explanation_3': ''}

Generating the prompt:

<b><u> mnli hypothesis: </b></u> The St. Louis Cardinals have always won. <b><u> premise: </b></u> yeah well losing is i mean i’m i’m originally from Saint Louis and Saint Louis Cardinals when they were there were uh a mostly a losing team but

Output:
* 0: Entailment
* 1: Neutral
* 2: Contradiction

In [16]:
prompt = generate_prompt_mnli(example)
prompt

'mnli hypothesis: A person is training his horse for a competition. premise: A person on a horse jumps over a broken down airplane.'

In [17]:
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(outputs)
print("Shape of outputs:", outputs.shape)
print("Shape of outputs[0]:", outputs[0].shape)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

tensor([[   0, 7163,    1]])
Shape of outputs: torch.Size([1, 3])
Shape of outputs[0]: torch.Size([3])
neutral




## 1.4  Tokenize the dataset

In [18]:
train_small.info

DatasetInfo(description='\nThe e-SNLI dataset extends the Stanford Natural Language Inference Dataset to\ninclude human-annotated natural language explanations of the entailment\nrelations.\n', citation='\n@incollection{NIPS2018_8163,\ntitle = {e-SNLI: Natural Language Inference with Natural Language Explanations},\nauthor = {Camburu, Oana-Maria and Rockt"{a}schel, Tim and Lukasiewicz, Thomas and Blunsom, Phil},\nbooktitle = {Advances in Neural Information Processing Systems 31},\neditor = {S. Bengio and H. Wallach and H. Larochelle and K. Grauman and N. Cesa-Bianchi and R. Garnett},\npages = {9539--9549},\nyear = {2018},\npublisher = {Curran Associates, Inc.},\nurl = {http://papers.nips.cc/paper/8163-e-snli-natural-language-inference-with-natural-language-explanations.pdf}\n}\n', homepage='https://github.com/OanaMariaCamburu/e-SNLI', license='', features={'premise': Value(dtype='string', id=None), 'hypothesis': Value(dtype='string', id=None), 'label': ClassLabel(names=['entailment', '

In [19]:
train_small.features

{'premise': Value(dtype='string', id=None),
 'hypothesis': Value(dtype='string', id=None),
 'label': ClassLabel(names=['entailment', 'neutral', 'contradiction'], id=None),
 'explanation_1': Value(dtype='string', id=None),
 'explanation_2': Value(dtype='string', id=None),
 'explanation_3': Value(dtype='string', id=None)}

In [20]:
from functools import partial
from src.utils import tokenize_function

In [21]:
tokenize_mapping = partial(tokenize_function, tokenizer=tokenizer)

In [22]:
train_small_tokenized = train_small.map(tokenize_mapping, batched=True).with_format(
    "torch"
)
valid_small_tokenized = valid_small.map(tokenize_mapping, batched=True).with_format(
    "torch"
)
test_small_tokenized = test_small.map(tokenize_mapping, batched=True).with_format(
    "torch"
)

print("Shape of train_small_tokenized: ", train_small_tokenized.shape)
print("Shape of valid_small_tokenized: ", valid_small_tokenized.shape)
print("Shape of test_small_tokenized: ", test_small_tokenized.shape)



Map: 100%|██████████| 200000/200000 [00:11<00:00, 16943.27 examples/s]
Map: 100%|██████████| 9842/9842 [00:00<00:00, 16954.83 examples/s]
Map: 100%|██████████| 9824/9824 [00:00<00:00, 17159.27 examples/s]

Shape of train_small_tokenized:  (200000, 9)
Shape of valid_small_tokenized:  (9842, 9)
Shape of test_small_tokenized:  (9824, 9)





In [23]:
train_small_tokenized = train_small_tokenized.remove_columns(["label"])
valid_small_tokenized = valid_small_tokenized.remove_columns(["label"])
test_small_tokenized = test_small_tokenized.remove_columns(["label"])

In [24]:
train_small_tokenized.features

{'premise': Value(dtype='string', id=None),
 'hypothesis': Value(dtype='string', id=None),
 'explanation_1': Value(dtype='string', id=None),
 'explanation_2': Value(dtype='string', id=None),
 'explanation_3': Value(dtype='string', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}

# 2.0 Tasks

### Imports and definitions

In [25]:
import torch
from functools import partial
import evaluate
from src.utils import compute_metrics, eval_pred_transform_accuracy
from transformers import (
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    T5ForConditionalGeneration,
    DataCollatorForSeq2Seq,
    EarlyStoppingCallback
)

In [26]:
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")
device

device(type='cuda')

In [27]:
transform_accuracy = partial(eval_pred_transform_accuracy, tokenizer=tokenizer)
compute_accuracy = partial(
    compute_metrics, pred_transform=transform_accuracy, metric=evaluate.load("accuracy")
)

## 2.1 Task 1: Zero-shot evaluation

In [28]:
model = T5ForConditionalGeneration.from_pretrained("t5-small")
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [29]:
training_args = Seq2SeqTrainingArguments(
    output_dir="task1",
    predict_with_generate=True,
    per_device_eval_batch_size=16,
    generation_max_length=32,
    metric_for_best_model="accuracy",
)

In [30]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_small_tokenized,
    eval_dataset=valid_small_tokenized,
    compute_metrics=compute_accuracy,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [31]:
trainer.evaluate(test_small_tokenized)



{'eval_loss': 0.23330074548721313,
 'eval_accuracy': 0.7216001628664495,
 'eval_runtime': 24.9599,
 'eval_samples_per_second': 393.591,
 'eval_steps_per_second': 12.3}

## 2.2 Task 2: Fine tuning without explanations

In [32]:
NUM_EPOCHS = 5

In [33]:
model_ft = T5ForConditionalGeneration.from_pretrained("t5-small")
data_collator_ft = DataCollatorForSeq2Seq(tokenizer, model=model_ft)

In [34]:
training_args_ft = Seq2SeqTrainingArguments(
    save_strategy="no",
    output_dir="task2",
    evaluation_strategy="epoch",
    num_train_epochs=NUM_EPOCHS,
    predict_with_generate=True,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    generation_max_length=32,
    metric_for_best_model="accuracy",
)

In [35]:
trainer_ft = Seq2SeqTrainer(
    model=model_ft,
    args=training_args_ft,
    train_dataset=train_small_tokenized,
    eval_dataset=valid_small_tokenized,
    compute_metrics=compute_accuracy,
    data_collator=data_collator_ft,
    tokenizer=tokenizer,
)

In [36]:
trainer_ft.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,0.1661,0.129643,0.855619
2,0.148,0.12563,0.858464
3,0.1381,0.118168,0.866389
4,0.1312,0.118386,0.867202
5,0.1317,0.118528,0.867811




TrainOutput(global_step=31250, training_loss=0.14623278198242187, metrics={'train_runtime': 5652.3121, 'train_samples_per_second': 176.919, 'train_steps_per_second': 5.529, 'total_flos': 1.6490146597699584e+16, 'train_loss': 0.14623278198242187, 'epoch': 5.0})

In [37]:
trainer_ft.evaluate(test_small_tokenized)



{'eval_loss': 0.12336444854736328,
 'eval_accuracy': 0.8632939739413681,
 'eval_runtime': 49.3559,
 'eval_samples_per_second': 199.044,
 'eval_steps_per_second': 6.22,
 'epoch': 5.0}

## 2.3 Task 3: Fine Tuning with Explanations

We need to give as labels the label and the explanation tokenized.

### Preparing the dataset with labelled explanations

In [38]:
dataset_explanations = load_dataset("esnli", download_mode="force_redownload")

Downloading data: 100%|██████████| 39.3M/39.3M [00:02<00:00, 14.6MB/s]
Downloading data: 100%|██████████| 1.62M/1.62M [00:00<00:00, 9.96MB/s]
Downloading data: 100%|██████████| 1.61M/1.61M [00:00<00:00, 9.82MB/s]
Generating train split: 100%|██████████| 549367/549367 [00:00<00:00, 5563942.52 examples/s]
Generating validation split: 100%|██████████| 9842/9842 [00:00<00:00, 2847705.57 examples/s]
Generating test split: 100%|██████████| 9824/9824 [00:00<00:00, 3058781.27 examples/s]


In [39]:
training_set_ex = dataset_explanations["train"]
validation_set_ex = dataset_explanations["validation"]
test_set_ex = dataset_explanations["test"]

print("Shape of training_set: ", training_set_ex.shape)
print("Shae of validation_set: ", validation_set_ex.shape)
print("Shape of test_set: ", test_set_ex.shape)

Shape of training_set:  (549367, 6)
Shae of validation_set:  (9842, 6)
Shape of test_set:  (9824, 6)


In [40]:
train_small_ex = training_set_ex.select(range(N_TRAIN))
valid_small_ex = validation_set_ex.select(range(N_VAL))
test_small_ex = test_set_ex.select(range(N_TEST))

print("Shape of train_small: ", train_small_ex.shape)
print("Shape of valid_small: ", valid_small_ex.shape)
print("Shape of test_small: ", test_small_ex.shape)

Shape of train_small:  (200000, 6)
Shape of valid_small:  (9842, 6)
Shape of test_small:  (9824, 6)


#### Tokenizing the dataset

In [41]:
from functools import partial
from src.utils import tokenize_function_ex

In [42]:
tokenize_mapping_ex = partial(tokenize_function_ex, tokenizer=tokenizer)

In [43]:
train_small_tokenized_ex = train_small_ex.map(
    tokenize_mapping_ex, batched=True
).with_format("torch")
valid_small_tokenized_ex = valid_small_ex.map(
    tokenize_mapping_ex, batched=True
).with_format("torch")
test_small_tokenized_ex = test_small_ex.map(
    tokenize_mapping_ex, batched=True
).with_format("torch")

print("Shape of train_small_tokenized: ", train_small_tokenized_ex.shape)
print("Shape of valid_small_tokenized: ", valid_small_tokenized_ex.shape)
print("Shape of test_small_tokenized: ", test_small_tokenized_ex.shape)



Map: 100%|██████████| 200000/200000 [00:16<00:00, 11921.56 examples/s]
Map: 100%|██████████| 9842/9842 [00:00<00:00, 11683.17 examples/s]
Map: 100%|██████████| 9824/9824 [00:00<00:00, 10772.25 examples/s]

Shape of train_small_tokenized:  (200000, 9)
Shape of valid_small_tokenized:  (9842, 9)
Shape of test_small_tokenized:  (9824, 9)





In [44]:
train_small_tokenized_ex = train_small_tokenized_ex.remove_columns(["label"])
valid_small_tokenized_ex = valid_small_tokenized_ex.remove_columns(["label"])
test_small_tokenized_ex = test_small_tokenized_ex.remove_columns(["label"])

In [45]:
train_small_tokenized_ex.features

{'premise': Value(dtype='string', id=None),
 'hypothesis': Value(dtype='string', id=None),
 'explanation_1': Value(dtype='string', id=None),
 'explanation_2': Value(dtype='string', id=None),
 'explanation_3': Value(dtype='string', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}

### Fine Tuning

In [46]:
NUM_EPOCHS = 5

In [47]:
transform_accuracy_ex = partial(
    eval_pred_transform_accuracy,
    tokenizer=tokenizer,
    remove_explanations_from_label=True,
)
compute_accuracy_removing_explanations = partial(
    compute_metrics,
    pred_transform=transform_accuracy_ex,
    metric=evaluate.load("accuracy"),
)

In [48]:
model_ft_ex = T5ForConditionalGeneration.from_pretrained("t5-small")
data_collator_ft_ex = DataCollatorForSeq2Seq(tokenizer, model=model_ft_ex)

In [49]:
training_args_ft_ex = Seq2SeqTrainingArguments(
    save_strategy="no",
    output_dir="task3",
    evaluation_strategy="epoch",
    num_train_epochs=NUM_EPOCHS,
    predict_with_generate=True,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    generation_max_length=128,
    metric_for_best_model="accuracy",
)

In [50]:
trainer_ft_ex = Seq2SeqTrainer(
    model=model_ft_ex,
    args=training_args_ft_ex,
    train_dataset=train_small_tokenized_ex,
    eval_dataset=valid_small_tokenized_ex,
    compute_metrics=compute_accuracy_removing_explanations,
    data_collator=data_collator_ft_ex,
    tokenizer=tokenizer,
)

In [51]:
trainer_ft_ex.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,1.2238,1.118269,0.816907
2,1.1374,1.07598,0.826458
3,1.1075,1.049721,0.843934




TypeError: unhashable type: 'dict'

In [None]:
trainer_ft_ex.evaluate(test_small_tokenized_ex)



{'eval_loss': 1.1160638332366943,
 'eval_accuracy': 0.8113802931596091,
 'eval_runtime': 98.8088,
 'eval_samples_per_second': 99.424,
 'eval_steps_per_second': 3.107,
 'epoch': 10.0}

## 2.4 Task 4: Fine Tuning with Shuffled Explanations

### Preparing the dataset with *wrong* labelled explanations

In [None]:
texts = []
for example in train_small:  
    texts.append(example["explanation_1"])

# Save the texts to a text file
with open("explanations_train.txt", "w", encoding="utf-8") as f:
    for text in texts:
        f.write(text + "\n")

In [None]:
texts = []
for example in valid_small: 
    texts.append(example["explanation_1"])

# Save the texts to a text file
with open("explanations_val.txt", "w", encoding="utf-8") as f:
    for text in texts:
        f.write(text + "\n")

In [None]:
texts = []
for example in test_small:  
    texts.append(example["explanation_1"])

# Save the texts to a text file
with open("explanations_test.txt", "w", encoding="utf-8") as f:
    for text in texts:
        f.write(text + "\n")

In [None]:
import random
input_file = "explanations_train.txt"
output_file = "shuffled_explanations_train.txt"

with open(input_file, "r") as f:
    lines = f.readlines()

random.shuffle(lines)

with open(output_file, "w") as f:
    f.writelines(lines)

In [None]:
input_file = "explanations_val.txt"
output_file = "shuffled_explanations_val.txt"

with open(input_file, "r") as f:
    lines = f.readlines()

random.shuffle(lines)

with open(output_file, "w") as f:
    f.writelines(lines)

In [None]:
input_file = "explanations_test.txt"
output_file = "shuffled_explanations_test.txt"

with open(input_file, "r") as f:
    lines = f.readlines()

random.shuffle(lines)

with open(output_file, "w") as f:
    f.writelines(lines)

In [None]:
with open("shuffled_explanations_train.txt", "r") as f:
    shuffled_explanations_train = f.readlines()

with open("shuffled_explanations_train.txt", "r") as f:
    shuffled_explanations_val = f.readlines()

with open("shuffled_explanations_test.txt", "r") as f:
    shuffled_explanations_test = f.readlines()

In [None]:
from src.utils import tokenize_function_ex

tokenize_mapping_train = partial(
    tokenize_function_ex, tokenizer=tokenizer, modified_explanations = shuffled_explanations_train
)

tokenize_mapping_val = partial(
    tokenize_function_ex, tokenizer=tokenizer, modified_explanations = shuffled_explanations_val
)

tokenize_mapping_test = partial(
    tokenize_function_ex, tokenizer=tokenizer, modified_explanations = shuffled_explanations_test
)

In [None]:
train_small_tokenized_shex = train_small.map(
    tokenize_mapping_train, batched=True
).with_format("torch")

valid_small_tokenized_shex = valid_small.map(
    tokenize_mapping_val, batched=True
).with_format("torch")

test_small_tokenized_shex = test_small.map(tokenize_mapping_test, batched=True).with_format(
    "torch"
)

print("Shape of train_small_tokenized: ", train_small_tokenized_shex.shape)
print("Shape of valid_small_tokenized: ", valid_small_tokenized_shex.shape)
print("Shape of test_small_tokenized: ", test_small_tokenized_shex.shape)

Map: 100%|██████████| 9824/9824 [00:00<00:00, 11032.52 examples/s]

Shape of train_small_tokenized:  (50000, 9)
Shape of valid_small_tokenized:  (9842, 9)
Shape of test_small_tokenized:  (9824, 9)





In [None]:
train_small_tokenized_shex = train_small_tokenized_shex.remove_columns(["label", "explanation_1", "explanation_2", "explanation_3"])
valid_small_tokenized_shex = valid_small_tokenized_shex.remove_columns(["label", "explanation_1", "explanation_2", "explanation_3"])
test_small_tokenized_shex = test_small_tokenized_shex.remove_columns(["label", "explanation_1", "explanation_2", "explanation_3"])

### Fine Tuning

In [None]:
NUM_EPOCHS = 20

In [None]:
transform_accuracy_shex = partial(
    eval_pred_transform_accuracy,
    tokenizer=tokenizer,
    remove_explanations_from_label=True,
)
compute_accuracy_removing_explanations_shex = partial(
    compute_metrics,
    pred_transform=transform_accuracy_ex,
    metric=evaluate.load("accuracy"),
)

In [None]:
model_ft_shex = T5ForConditionalGeneration.from_pretrained("t5-small")
data_collator_ft_shex = DataCollatorForSeq2Seq(tokenizer, model=model_ft_ex)

In [None]:
training_args_ft_shex = Seq2SeqTrainingArguments(
    save_strategy="no",
    output_dir="task4",
    evaluation_strategy="epoch",
    num_train_epochs=NUM_EPOCHS,
    predict_with_generate=True,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    generation_max_length=128,
    metric_for_best_model="accuracy",
)

In [None]:
trainer_ft_shex = Seq2SeqTrainer(
    model=model_ft_shex,
    args=training_args_ft_shex,
    train_dataset=train_small_tokenized_shex,
    eval_dataset=valid_small_tokenized_shex,
    compute_metrics=compute_accuracy_removing_explanations_shex,
    data_collator=data_collator_ft_shex,
    tokenizer=tokenizer,
)

In [None]:
trainer_ft_shex.train()

In [None]:
trainer_ft_shex.evaluate(test_small_tokenized_shex)



{'eval_loss': 3.9411330223083496,
 'eval_accuracy': 0.8736767100977199,
 'eval_runtime': 98.3303,
 'eval_samples_per_second': 99.908,
 'eval_steps_per_second': 3.122,
 'epoch': 10.0}

## 2.5 Task 5: Profiling-UD

### Read the results of the automatic annotation stage performed over explanations with Profilind-UD.

1. **Token ID**: The token's position in the sentence.
2. **Token**: The actual token text.
3. **Lemma**: The lemma or base form of the token.
4. Universal part-of-speech tag.
5. Language-specific part-of-speech tag (optional).
6. Miscellaneous (misc) field, which can contain additional annotations.
7. Head: The ID of the token's syntactic head.
8. Dependency relation: The type of syntactic relation between the token and its head.
9. Secondary dependencies or additional annotations.

In [None]:
# import pandas as pd 
# # Define the path to your CoNLL-U file
# conll_file_path = "explanations.conllu"

# # Define column names for the CoNLL-U file
# column_names = [
#     "ID",
#     "TOKEN",
#     "LEMMA",
#     "UPOS",
#     "XPOS",
#     "FEATS",
#     "HEAD",
#     "DEPREL",
#     "DEPS",
#     "MISC"
# ]

# # Read the CoNLL-U file into a DataFrame
# df = pd.read_csv(conll_file_path, delimiter='\t', comment='#', header=None, names=column_names)

# # Reset the index to create a numeric index
# df.reset_index(drop=True, inplace=True)

# # Display the DataFrame
# df[:15]

In [None]:
# df['SAMPLE'] = None

# sample = 0
# for index, row in df.iterrows():
#     if(row["ID"]==1):
#         sample = sample+1
#     df.at[index, "SAMPLE"] = sample

### Prepare the dataset with modified explanations

In [None]:
# # Define the input and output file paths
# output_file = "modified_explanations_1.txt"

# # Write the shuffled lines to the output file
# with open(output_file, "w") as f:
#     for i in range(N_TRAIN):
#         df_i = df.loc[df["SAMPLE"]==i]
#         modified_exp = ' '.join(df["LEMMA"].values)
#         f.writelines(modified_exp)

In [None]:
# with open("modified_explanations_1.txt", "r") as f:
#     explanations_m1 = f.readlines()