------------

**Author**: Gunnvant

**Description**: Explore the `datasets` library

-----------

In [None]:
from datasets import load_dataset

In [2]:
raw_datasets = load_dataset("glue","mrpc")

Downloading builder script: 100%|████████████████████████████████████████████████| 28.8k/28.8k [00:00<00:00, 14.9MB/s]
Downloading metadata: 100%|██████████████████████████████████████████████████████| 28.7k/28.7k [00:00<00:00, 19.0MB/s]
Downloading readme: 100%|████████████████████████████████████████████████████████| 27.9k/27.9k [00:00<00:00, 10.5MB/s]
Downloading data files:   0%|                                                                   | 0/3 [00:00<?, ?it/s]
Downloading data: 6.22kB [00:00, 4.56MB/s]
Downloading data files:  33%|███████████████████▋                                       | 1/3 [00:00<00:00,  2.18it/s]
Downloading data: 0.00B [00:00, ?B/s][A
Downloading data: 1.05MB [00:00, 7.42MB/s][A
Downloading data files:  67%|███████████████████████████████████████▎                   | 2/3 [00:00<00:00,  2.33it/s]
Downloading data: 441kB [00:00, 4.73MB/s]
Downloading data files: 100%|███████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.47it

In [3]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [4]:
raw_datasets_train = raw_datasets['train']

In [5]:
raw_datasets_train[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

In [6]:
raw_datasets_train[3:8]

{'sentence1': ['Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .',
  'The stock rose $ 2.11 , or about 11 percent , to close Friday at $ 21.51 on the New York Stock Exchange .',
  'Revenue in the first quarter of the year dropped 15 percent from the same period a year earlier .',
  'The Nasdaq had a weekly gain of 17.27 , or 1.2 percent , closing at 1,520.15 on Friday .',
  'The DVD-CCA then appealed to the state Supreme Court .'],
 'sentence2': ['Tab shares jumped 20 cents , or 4.6 % , to set a record closing high at A $ 4.57 .',
  'PG & E Corp. shares jumped $ 1.63 or 8 percent to $ 21.03 on the New York Stock Exchange on Friday .',
  "With the scandal hanging over Stewart 's company , revenue the first quarter of the year dropped 15 percent from the same period a year earlier .",
  'The tech-laced Nasdaq Composite .IXIC rallied 30.46 points , or 2.04 percent , to 1,520.15 .',
  'The DVD CCA appealed that decision

In [8]:
raw_datasets_train.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

In [9]:
from transformers import AutoTokenizer
ckpt = 'bert-base-uncased'

In [10]:
tokenizer = AutoTokenizer.from_pretrained(ckpt)

Downloading (…)okenizer_config.json: 100%|█████████████████████████████████████████| 28.0/28.0 [00:00<00:00, 45.9kB/s]
Downloading (…)lve/main/config.json: 100%|███████████████████████████████████████████| 570/570 [00:00<00:00, 2.27MB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████████████████████████████████████| 232k/232k [00:00<00:00, 599kB/s]
Downloading (…)/main/tokenizer.json: 100%|█████████████████████████████████████████| 466k/466k [00:00<00:00, 15.2MB/s]


### Tokenize the inputs

In [12]:
tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)

In [15]:
tokenized_dataset.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

The `tokenized_dataset` is now in the memory. During model training we may not want to do this and stream data on demand to the model

In [18]:
tokenized_dataset['input_ids'][1] ## Notice that the inputs are padded

[101,
 9805,
 3540,
 11514,
 2050,
 3079,
 11282,
 2243,
 1005,
 1055,
 2077,
 4855,
 1996,
 4677,
 2000,
 3647,
 4576,
 1999,
 2687,
 2005,
 1002,
 1016,
 1012,
 1019,
 4551,
 1012,
 102,
 9805,
 3540,
 11514,
 2050,
 4149,
 11282,
 2243,
 1005,
 1055,
 1999,
 2786,
 2005,
 1002,
 6353,
 2509,
 2454,
 1998,
 2853,
 2009,
 2000,
 3647,
 4576,
 2005,
 1002,
 1015,
 1012,
 1022,
 4551,
 1999,
 2687,
 1012,
 102,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

In [20]:
[len(i) for i in tokenized_dataset['input_ids']][0:20]## length of all the batches is same and has been fixed to be equal to the longest 

[103,
 103,
 103,
 103,
 103,
 103,
 103,
 103,
 103,
 103,
 103,
 103,
 103,
 103,
 103,
 103,
 103,
 103,
 103,
 103]

Create a training dataset using the `dataset` library and the `map` function.

In [22]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

In [23]:
tokenized_dataset = raw_datasets.map(tokenize_function)

Map: 100%|███████████████████████████████████████████████████████████████| 3668/3668 [00:00<00:00, 4996.59 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████| 408/408 [00:00<00:00, 4959.04 examples/s]
Map: 100%|███████████████████████████████████████████████████████████████| 1725/1725 [00:00<00:00, 5293.75 examples/s]


In [24]:
tokenized_dataset ## this is different from the earlier operation.

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

In [26]:
batch = tokenized_dataset['train'][0:100]

In [29]:
[len(i) for i in batch['input_ids']] ## all the inputs in one batch have a different length

[50,
 59,
 47,
 67,
 59,
 50,
 62,
 32,
 45,
 60,
 51,
 47,
 42,
 61,
 53,
 44,
 53,
 79,
 57,
 70,
 63,
 35,
 54,
 64,
 52,
 47,
 68,
 58,
 60,
 35,
 43,
 34,
 48,
 65,
 27,
 73,
 31,
 50,
 36,
 61,
 57,
 54,
 41,
 64,
 53,
 38,
 68,
 45,
 57,
 39,
 36,
 68,
 63,
 47,
 37,
 62,
 59,
 58,
 50,
 33,
 61,
 34,
 71,
 64,
 74,
 30,
 54,
 53,
 72,
 70,
 44,
 58,
 78,
 40,
 60,
 50,
 55,
 31,
 62,
 46,
 58,
 70,
 49,
 49,
 42,
 34,
 70,
 50,
 34,
 65,
 49,
 39,
 53,
 37,
 28,
 70,
 66,
 68,
 62,
 62]

### Dynamic Batching

To make sure that each batch has the same size for all the inputs we will use a `collate function`

In [30]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [32]:
samples = tokenized_dataset["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
[len(x) for x in samples["input_ids"]]

[50, 59, 47, 67, 59, 50, 62, 32]

In [33]:
batch = data_collator(samples)

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [38]:
[len(i) for i in batch['input_ids']] ## All the inputs in the batch have same size.

[67, 67, 67, 67, 67, 67, 67, 67]