### BABILong: usage example

Download the data


In [None]:
pip install datasets

In [4]:
!git clone https://github.com/booydar/babilong source
!unzip source/data/tasks_1-20_v1-2.zip -d data/

Cloning into 'source'...
remote: Enumerating objects: 16, done.[K
remote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (13/13), done.[K
remote: Total 16 (delta 1), reused 13 (delta 1), pack-reused 0[K
Receiving objects: 100% (16/16), 17.11 MiB | 40.92 MiB/s, done.
Resolving deltas: 100% (1/1), done.


In [10]:
import datasets
from transformers import AutoTokenizer
from source.babilong.babilong_utils import TaskDataset, SentenceSampler, NoiseInjectionDataset

Let's inspect the first task: qa1_single-supporting-fact. Wikitext-2 is used as background text.

**Note**: for evaluation of your models use PG19. Wikitext is used for demonstration purposes only!

In [13]:
train_path = "data/tasks_1-20_v1-2/en-10k/qa1_single-supporting-fact_train.txt"
test_path = "data/tasks_1-20_v1-2/en-10k/qa1_single-supporting-fact_test.txt"

noise_dataset = datasets.load_dataset("wikitext", "wikitext-2-raw-v1")

Create a dataset for the task

In [14]:
task_dataset_train = TaskDataset(train_path)
task_dataset_test = TaskDataset(test_path)

Create a background text sampler

In [15]:
tokenizer = AutoTokenizer.from_pretrained('gpt2')

noise_sampler_train = SentenceSampler(noise_dataset['train'], tokenizer=tokenizer)
noise_sampler_test = SentenceSampler(noise_dataset['test'], tokenizer=tokenizer)

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

BABILong dataset:

In [16]:
sample_size = 512               # max number of tokens in sample
dataset_train = NoiseInjectionDataset(task_dataset=task_dataset_train,
                                        noise_sampler=noise_sampler_train,
                                        tokenizer=tokenizer,
                                        sample_size=sample_size)

dataset_test = NoiseInjectionDataset(task_dataset=task_dataset_test,
                                        noise_sampler=noise_sampler_test,
                                        tokenizer=tokenizer,
                                        sample_size=sample_size)

In [17]:
sample = dataset_train[0]
sample.keys()

dict_keys(['facts', 'question', 'answer', 'references', 'background_text', 'fact_positions', 'input_tokens', 'question_tokens', 'target_tokens'])

### Visualize one sample

In [18]:
facts = sample['facts']
question = sample['question']
answer = tokenizer.decode(sample['target_tokens'])

background_text = tokenizer.batch_decode(sample['background_text'])

input_tokens = tokenizer.decode(sample['input_tokens'])

print(f"Facts: {' '.join(facts)}")
print(f"Question: {question}")
print(f"Answer: {answer}")
print(f"References: {' '.join(sample['references'])}")
print()
print('Background text: ', ' '.join(background_text))
print('Combined input: ', input_tokens)

print(f"Target: {answer}")


Facts: Mary moved to the bathroom. John went to the hallway.
Question: Where is Mary? 
Answer: bathroom
References: Mary moved to the bathroom.

Background text:   Senjō no Valkyria 3 : Unrecorded Chronicles ( Japanese : 戦場のヴァルキュリア3, lit. Valkyria of the Battlefield 3 ), commonly referred to as Valkyria Chronicles III outside Japan, is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable. Released in January 2011 in Japan, it is the third game in the Valkyria series. Employing the same fusion of tactical and real @-@ time gameplay as its predecessors, the story runs parallel to the first game and follows the " Nameless ", a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven ".  The game began development in 2010, carrying over a large portion of the work done on Valkyria Chronicles II. While it retained the standard 