In [8]:
!python -m spacy download en

[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m51.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [57]:
import json
import math
from data_adaptor import DataAdaptor

# 2WikiMultihopQA
Authors only use question-answer information. The context is not provided.
> "we use only the question-answer pairs from these datasets, not any passages of relevant text that they contain. These datasets both contain 2-hop compositional questions sourced from facts that appear in Wikipedia articles." - Press, et al.

Authors do not use this dataset to measure compositionality gap which requires known sub-questions and answers to measure.
> "Note that the rest of this section shows that elicitive prompts improve performance but does not show that they narrow the compositionality gap since we lack sub-questions for datasets other than CC." - Press, et al.

In [10]:
from data_loaders import load_2WikiMultihopQA

In [11]:
n_examples = 50
train_fraction = .8

In [12]:
# Load the training data
wiki_sample = load_2WikiMultihopQA(n_examples=n_examples, split='train')

### Adapting to Self-Ask Examplar

In [13]:
wiki_adaptor = DataAdaptor(dataset="2WikiMultihopQA")

In [14]:
wiki_examplars = wiki_adaptor.generate_examplars(wiki_sample, strategy="self-ask")
for examplar in wiki_examplars:
    print(examplar)

Question: Are director of film Move (1970 Film) and director of film Méditerranée (1963 Film) from the same country?
Are follow up questions needed here: Yes.
Follow up: Who is the director of Move (1970 film)?
Intermediate answer: Stuart Rosenberg
Follow up: What is the director of Méditerranée (1963 film)?
Intermediate answer: Jean-Daniel Pollet
Follow up: What is the country of citizenship of Stuart Rosenberg?
Intermediate answer: American
Follow up: What is the country of citizenship of Jean-Daniel Pollet?
Intermediate answer: French
So the final answer is: no

Question: Do both films The Falcon (Film) and Valentin The Good have the directors from the same country?
Are follow up questions needed here: Yes.
Follow up: Who is the director of The Falcon (film)?
Intermediate answer: Vatroslav Mimica
Follow up: Who is the director of Valentin the Good?
Intermediate answer: Martin Frič
Follow up: What is the country of citizenship of Vatroslav Mimica?
Intermediate answer: Croatian
Follow

### Adapting to Self-Ask Training Example
We can augment the target texts in the dataset with the self-ask rationale to fine-tune a language model to generate text with the self-ask rationale.

In [15]:
wiki_training_examples = wiki_adaptor.generate_training_examples(wiki_sample, strategy="self-ask")
for training_example in wiki_training_examples:
    print(json.dumps(training_example, indent=4))

Generating 2WikiMultihopQA self-ask training examples: 100%|███████████████████████████████████████████████████| 50/50 [00:00<00:00, 147.23it/s]
Structuring 2WikiMultihopQA self-ask training examples: 100%|█████████████████████████████████████████████████| 50/50 [00:00<00:00, 2526.38it/s]

{
    "prompt": "Fact #0: Move is a 1970 American comedy film starring Elliott Gould, Paula Prentiss and Genevi\u00e8ve Wa\u00efte, and directed by Stuart Rosenberg.\nFact #1: M\u00e9diterran\u00e9e is a 1963 French experimental film directed by Jean-Daniel Pollet with assistance from Volker Schl\u00f6ndorff.\nFact #2: Stuart Rosenberg (August 11, 1927 \u2013 March 15, 2007) was an American film and television director whose motion pictures include \"Cool Hand Luke\" (1967), \"Voyage of the Damned\" (1976), \"The Amityville Horror\" (1979), and \"The Pope of Greenwich Village\" (1984).\nFact #3: Jean-Daniel Pollet (1936\u20132004) was a French film director and screenwriter who was most active in the 1960s and 1970s.\n\nQuestion: Are director of film Move (1970 Film) and director of film M\u00e9diterran\u00e9e (1963 Film) from the same country?\nAre follow up questions needed here:\n",
    "target": "Yes.\nFollow up: Who is the director of Move (1970 film)?\nIntermediate answer: Stuart




In [16]:
print(wiki_training_examples[0]["prompt"])
print(wiki_training_examples[0]["target"])

Fact #0: Move is a 1970 American comedy film starring Elliott Gould, Paula Prentiss and Geneviève Waïte, and directed by Stuart Rosenberg.
Fact #1: Méditerranée is a 1963 French experimental film directed by Jean-Daniel Pollet with assistance from Volker Schlöndorff.
Fact #2: Stuart Rosenberg (August 11, 1927 – March 15, 2007) was an American film and television director whose motion pictures include "Cool Hand Luke" (1967), "Voyage of the Damned" (1976), "The Amityville Horror" (1979), and "The Pope of Greenwich Village" (1984).
Fact #3: Jean-Daniel Pollet (1936–2004) was a French film director and screenwriter who was most active in the 1960s and 1970s.

Question: Are director of film Move (1970 Film) and director of film Méditerranée (1963 Film) from the same country?
Are follow up questions needed here:

Yes.
Follow up: Who is the director of Move (1970 film)?
Intermediate answer: Stuart Rosenberg
Follow up: What is the director of Méditerranée (1963 film)?
Intermediate answer: Jea

### Direct Prompting Training Examples
Simply provide the facts and ask the question. No thought variable or rationale involved.

In [17]:
direct_training_examples = wiki_adaptor.generate_training_examples(
    wiki_sample[0],
    strategy="direct"
)

Generating 2WikiMultihopQA direct training examples: 100%|█████████████████████████████████████████████████████| 1/1 [00:00<00:00, 13706.88it/s]
Structuring 2WikiMultihopQA direct training examples: 100%|█████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1742.54it/s]


In [18]:
print("--------- Augmented Prompt ---------")
print(direct_training_examples[0]["prompt"])
print("--------- Target ---------")
print(direct_training_examples[0]["target"])

--------- Augmented Prompt ---------
Fact #0: Move is a 1970 American comedy film starring Elliott Gould, Paula Prentiss and Geneviève Waïte, and directed by Stuart Rosenberg.
Fact #1: Méditerranée is a 1963 French experimental film directed by Jean-Daniel Pollet with assistance from Volker Schlöndorff.
Fact #2: Stuart Rosenberg (August 11, 1927 – March 15, 2007) was an American film and television director whose motion pictures include "Cool Hand Luke" (1967), "Voyage of the Damned" (1976), "The Amityville Horror" (1979), and "The Pope of Greenwich Village" (1984).
Fact #3: Jean-Daniel Pollet (1936–2004) was a French film director and screenwriter who was most active in the 1960s and 1970s.

Question: Are director of film Move (1970 Film) and director of film Méditerranée (1963 Film) from the same country?
Answer:
--------- Target ---------
no


### Augment with In-Context Examplars and Self-Ask Rationale Targets
We can combine the two above to create an augmented fine-tuning dataset:
1. Prompt text has in-context examplars
2. Target text has the self-ask rationale

In [19]:
training_examplars = wiki_examplars[:4]
augmented_example = wiki_adaptor.generate_training_examples(
    wiki_sample[0], 
    strategy="self-ask", 
    examplars=training_examplars
    )[0]
print("--------- Augmented Prompt ---------")
print(augmented_example["prompt"])
print("--------- Target ---------")
print(augmented_example["target"])

Generating 2WikiMultihopQA self-ask training examples: 100%|██████████████████████████████████████████████████████| 1/1 [00:00<00:00, 72.62it/s]
Structuring 2WikiMultihopQA self-ask training examples: 100%|████████████████████████████████████████████████████| 1/1 [00:00<00:00, 531.06it/s]

--------- Augmented Prompt ---------
Example Response
Question: Are director of film Move (1970 Film) and director of film Méditerranée (1963 Film) from the same country?
Are follow up questions needed here: Yes.
Follow up: Who is the director of Move (1970 film)?
Intermediate answer: Stuart Rosenberg
Follow up: What is the director of Méditerranée (1963 film)?
Intermediate answer: Jean-Daniel Pollet
Follow up: What is the country of citizenship of Stuart Rosenberg?
Intermediate answer: American
Follow up: What is the country of citizenship of Jean-Daniel Pollet?
Intermediate answer: French
So the final answer is: no

Example Response
Question: Do both films The Falcon (Film) and Valentin The Good have the directors from the same country?
Are follow up questions needed here: Yes.
Follow up: Who is the director of The Falcon (film)?
Intermediate answer: Vatroslav Mimica
Follow up: Who is the director of Valentin the Good?
Intermediate answer: Martin Frič
Follow up: What is the country o




In [20]:
training_examplars = wiki_examplars[:1]
augmented_examples = wiki_adaptor.generate_training_examples(
    wiki_sample, 
    strategy="self-ask",
    examplars=training_examplars
    )

# look at token counts in prompt (context size)
print("context size")
for example in augmented_examples:
    print(example["num_prompt_tokens"])

# look at token counts
print("total tokens")
for example in augmented_examples:
    print(example["num_tokens"])

Generating 2WikiMultihopQA self-ask training examples: 100%|███████████████████████████████████████████████████| 50/50 [00:00<00:00, 152.47it/s]
Structuring 2WikiMultihopQA self-ask training examples: 100%|█████████████████████████████████████████████████| 50/50 [00:00<00:00, 1692.00it/s]

context size
363
378
307
220
279
258
239
285
230
220
259
231
256
216
221
223
222
261
227
259
352
236
209
235
278
285
322
218
306
230
319
235
235
217
376
279
272
237
197
238
305
232
197
327
216
353
347
323
239
292
total tokens
463
499
404
278
342
319
297
379
279
271
306
279
304
258
276
274
272
330
278
306
452
287
259
282
326
336
421
265
408
279
427
284
284
271
475
323
335
286
250
292
354
286
242
417
261
451
452
418
290
345





In [53]:
print(json.dumps(augmented_examples[0], indent=4))

{
    "prompt": "Example Response\nQuestion: Are director of film Move (1970 Film) and director of film M\u00e9diterran\u00e9e (1963 Film) from the same country?\nAre follow up questions needed here: Yes.\nFollow up: Who is the director of Move (1970 film)?\nIntermediate answer: Stuart Rosenberg\nFollow up: What is the director of M\u00e9diterran\u00e9e (1963 film)?\nIntermediate answer: Jean-Daniel Pollet\nFollow up: What is the country of citizenship of Stuart Rosenberg?\nIntermediate answer: American\nFollow up: What is the country of citizenship of Jean-Daniel Pollet?\nIntermediate answer: French\nSo the final answer is: no\n\nFact #0: Move is a 1970 American comedy film starring Elliott Gould, Paula Prentiss and Genevi\u00e8ve Wa\u00efte, and directed by Stuart Rosenberg.\nFact #1: M\u00e9diterran\u00e9e is a 1963 French experimental film directed by Jean-Daniel Pollet with assistance from Volker Schl\u00f6ndorff.\nFact #2: Stuart Rosenberg (August 11, 1927 \u2013 March 15, 2007) 

# Huggingface Code

## Preprocess Data 
https://huggingface.co/docs/transformers/preprocessing

## Tokenize

In [61]:
from transformers import AutoTokenizer

In [62]:
model_checkpoint = "facebook/bart-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [63]:
train_encoded_input = tokenizer([x['prompt'] for x in augmented_examples[:math.floor(n_examples*train_fraction)]],
                                padding=True, truncation=True, return_tensors="pt") # TODO https://huggingface.co/docs/transformers/pad_truncation
val_encoded_input = tokenizer([x['prompt'] for x in augmented_examples[:(1+math.floor(n_examples*train_fraction))]],
                                padding=True, truncation=True, return_tensors="pt") # TODO https://huggingface.co/docs/transformers/pad_truncation
train_encoded_input

{'input_ids': tensor([[    0, 48837, 19121,  ...,     1,     1,     1],
        [    0, 48837, 19121,  ...,    35, 50118,     2],
        [    0, 48837, 19121,  ...,     1,     1,     1],
        ...,
        [    0, 48837, 19121,  ...,     1,     1,     1],
        [    0, 48837, 19121,  ...,     1,     1,     1],
        [    0, 48837, 19121,  ...,     1,     1,     1]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}

## Finetune
https://huggingface.co/docs/transformers/training

In [64]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
import os
os.environ['TOKENIZERS_PARALLELISM'] = 'false'

In [65]:
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

Downloading model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [73]:
outputs = model(**tokenizer(augmented_examples[0]['prompt'], padding=True, truncation=True, return_tensors="pt"))
outputs

Seq2SeqQuestionAnsweringModelOutput(loss=None, start_logits=tensor([[ 1.5487e+00, -1.3840e+00,  7.9838e-01,  7.3678e-01, -1.4034e-01,
          1.7010e-01, -7.9067e-01,  6.1418e-01,  6.6816e-01, -9.8489e-01,
         -2.6347e-01, -7.0753e-01, -1.3760e+00, -1.1047e+00,  1.2860e-01,
          7.8460e-01,  8.0529e-03,  4.9686e-01, -9.9922e-01, -4.1880e-01,
         -1.0232e+00, -1.1950e+00, -1.2410e+00,  2.3885e-01,  4.8444e-01,
         -2.1338e-01, -7.0997e-01,  8.2397e-01,  1.2362e+00, -1.3766e-01,
         -3.1007e-01,  6.7590e-01,  1.2313e-01,  2.9726e-01, -6.6358e-01,
         -8.2055e-02, -8.2189e-02,  4.5741e-01,  1.9832e-01,  1.3409e+00,
         -9.0332e-02,  9.4417e-01,  2.5070e-01, -1.3724e-01, -4.8651e-01,
         -5.6759e-01, -4.2708e-01, -1.0919e-01,  2.3893e-01,  7.9534e-02,
         -1.7495e-02,  5.6429e-01, -6.5310e-01, -3.2092e-01, -8.7503e-01,
         -2.4592e-01, -1.1015e+00,  4.1229e-02, -9.5609e-02, -7.9687e-03,
          5.9198e-01, -6.7937e-02,  4.3292e-01,  2.6

In [66]:
training_args = TrainingArguments(output_dir="data/BART-finetuning", evaluation_strategy='epoch')

## Evaluate

In [45]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: evaluate
Successfully installed evaluate-0.4.0


In [46]:
import numpy as np
import evaluate

In [54]:
metric = evaluate.load("accuracy") # TODO change to rouge_1

In [55]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [59]:
trainer = Trainer(model=model, args=training_args, train_dataset=train_encoded_input, eval_dataset=val_encoded_input, compute_metrics=compute_metrics)

In [60]:
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33madamwein[0m ([33mcompositional-reasoning-finetuning[0m). Use [1m`wandb login --relogin`[0m to force relogin


ValueError: The batch received was empty, your model won't be able to train on it. Double-check that your training dataset contains keys expected by the model: input_ids,attention_mask,head_mask,inputs_embeds,start_positions,end_positions,output_attentions,output_hidden_states,return_dict,start_positions,label_ids,end_positions,label.

## Preprocessing

In [100]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [101]:
squad_v2 = False
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

In [102]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [103]:
tokenizer(augmented_example['prompt'])#, augmented_example['target']) # TODO Best way to truncate the examples

Token indices sequence length is longer than the specified maximum sequence length for this model (663 > 512). Running this sequence through the model will result in indexing errors


{'input_ids': [101, 2742, 3433, 3160, 1024, 2024, 2472, 1997, 2143, 2693, 1006, 3359, 2143, 1007, 1998, 2472, 1997, 2143, 19960, 21646, 18053, 2063, 1006, 3699, 2143, 1007, 2013, 1996, 2168, 2406, 1029, 2024, 3582, 2039, 3980, 2734, 2182, 1024, 2748, 1012, 3582, 2039, 1024, 2040, 2003, 1996, 2472, 1997, 2693, 1006, 3359, 2143, 1007, 1029, 7783, 3437, 1024, 6990, 21069, 3582, 2039, 1024, 2054, 2003, 1996, 2472, 1997, 19960, 21646, 18053, 2063, 1006, 3699, 2143, 1007, 1029, 7783, 3437, 1024, 3744, 1011, 3817, 8554, 3388, 3582, 2039, 1024, 2054, 2003, 1996, 2406, 1997, 9068, 1997, 6990, 21069, 1029, 7783, 3437, 1024, 2137, 3582, 2039, 1024, 2054, 2003, 1996, 2406, 1997, 9068, 1997, 3744, 1011, 3817, 8554, 3388, 1029, 7783, 3437, 1024, 2413, 2061, 1996, 2345, 3437, 2003, 1024, 2053, 2742, 3433, 3160, 1024, 2079, 2119, 3152, 1996, 11684, 1006, 2143, 1007, 1998, 24632, 1996, 2204, 2031, 1996, 5501, 2013, 1996, 2168, 2406, 1029, 2024, 3582, 2039, 3980, 2734, 2182, 1024, 2748, 1012, 3582, 2039

In [104]:
max_length=512

In [127]:
tokenized_example = tokenizer(
    augmented_examples[0]['prompt'],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)
tokenized_example

{'input_ids': [[101, 2742, 3433, 3160, 1024, 2024, 2472, 1997, 2143, 2693, 1006, 3359, 2143, 1007, 1998, 2472, 1997, 2143, 19960, 21646, 18053, 2063, 1006, 3699, 2143, 1007, 2013, 1996, 2168, 2406, 1029, 2024, 3582, 2039, 3980, 2734, 2182, 1024, 2748, 1012, 3582, 2039, 1024, 2040, 2003, 1996, 2472, 1997, 2693, 1006, 3359, 2143, 1007, 1029, 7783, 3437, 1024, 6990, 21069, 3582, 2039, 1024, 2054, 2003, 1996, 2472, 1997, 19960, 21646, 18053, 2063, 1006, 3699, 2143, 1007, 1029, 7783, 3437, 1024, 3744, 1011, 3817, 8554, 3388, 3582, 2039, 1024, 2054, 2003, 1996, 2406, 1997, 9068, 1997, 6990, 21069, 1029, 7783, 3437, 1024, 2137, 3582, 2039, 1024, 2054, 2003, 1996, 2406, 1997, 9068, 1997, 3744, 1011, 3817, 8554, 3388, 1029, 7783, 3437, 1024, 2413, 2061, 1996, 2345, 3437, 2003, 1024, 2053, 2755, 1001, 1014, 1024, 2693, 2003, 1037, 3359, 2137, 4038, 2143, 4626, 9899, 14913, 1010, 13723, 3653, 16778, 4757, 1998, 20245, 3524, 2063, 1010, 1998, 2856, 2011, 6990, 21069, 1012, 2755, 1001, 1015, 1024, 

In [108]:
import math

from datasets import Dataset
import pandas as pd

In [114]:
augmented_examples_df = pd.DataFrame(augmented_examples)
train_ds = Dataset.from_dict(augmented_examples_df.loc[:math.floor(n_examples * train_fraction)].to_dict())
val_ds = Dataset.from_dict(augmented_examples_df.loc[(math.floor(n_examples * train_fraction) + 1):].to_dict())
del augmented_examples_df
train_ds

Dataset({
    features: ['prompt', 'target', 'num_prompt_tokens', 'num_target_tokens', 'num_tokens'],
    num_rows: 41
})

In [130]:
def preprocess_features(example):

    tokenized_example = tokenizer(
    example['prompt'],
    example['target'],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    )
    return tokenized_example

list(map(preprocess_features, augmented_examples))

[{'input_ids': [[101, 2742, 3433, 3160, 1024, 2024, 2472, 1997, 2143, 2693, 1006, 3359, 2143, 1007, 1998, 2472, 1997, 2143, 19960, 21646, 18053, 2063, 1006, 3699, 2143, 1007, 2013, 1996, 2168, 2406, 1029, 2024, 3582, 2039, 3980, 2734, 2182, 1024, 2748, 1012, 3582, 2039, 1024, 2040, 2003, 1996, 2472, 1997, 2693, 1006, 3359, 2143, 1007, 1029, 7783, 3437, 1024, 6990, 21069, 3582, 2039, 1024, 2054, 2003, 1996, 2472, 1997, 19960, 21646, 18053, 2063, 1006, 3699, 2143, 1007, 1029, 7783, 3437, 1024, 3744, 1011, 3817, 8554, 3388, 3582, 2039, 1024, 2054, 2003, 1996, 2406, 1997, 9068, 1997, 6990, 21069, 1029, 7783, 3437, 1024, 2137, 3582, 2039, 1024, 2054, 2003, 1996, 2406, 1997, 9068, 1997, 3744, 1011, 3817, 8554, 3388, 1029, 7783, 3437, 1024, 2413, 2061, 1996, 2345, 3437, 2003, 1024, 2053, 2755, 1001, 1014, 1024, 2693, 2003, 1037, 3359, 2137, 4038, 2143, 4626, 9899, 14913, 1010, 13723, 3653, 16778, 4757, 1998, 20245, 3524, 2063, 1010, 1998, 2856, 2011, 6990, 21069, 1012, 2755, 1001, 1015, 1024,

## Finetuning

In [51]:
! pip install datasets transformers
! pip install accelerate -U

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [52]:
# you need to install GIT–LFS separately https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage?platform=mac

In [115]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to

In [118]:
batch_size = 8

In [119]:
model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

In [120]:
from transformers import default_data_collator

data_collator = default_data_collator

In [121]:
import os
os.environ['TOKENIZERS_PARALLELISM'] = 'false'

In [122]:
trainer = Trainer(
    model,
    args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

/Users/adamweinberger/MIDS/compositional-reasoning-finetuning/distilbert-base-uncased-finetuned-squad is already a clone of https://huggingface.co/adam-wein/distilbert-base-uncased-finetuned-squad. Make sure you pull the latest changes with `repo.git_pull()`.


In [123]:
trainer.train()



IndexError: Invalid key: 40 is out of bounds for size 0