[Datasets Documentation](https://huggingface.co/docs/datasets/loading_datasets.html#from-the-huggingface-hub)

## Install the Transformers and Datasets libraries

In [None]:
!pip install datasets transformers[sentencepiece]

## How to train on Batch

In [11]:
import tensorflow as tf
import numpy as np
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

# Same as before
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
    "I like to do this",
    "I don't like to do this!",
]
batch = dict(tokenizer(sequences, padding=True, truncation=True, return_tensors="tf"))
#batch = tokenizer(sequences, padding = True, truncation = True, return_tensors = 'tf') -> if directly used, throws Error : Unsupported value type BatchEncoding returned by IteratorSpec._serialize

# How to train on batches
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
labels = tf.convert_to_tensor([1, 0])
print(labels)
model.train_on_batch(batch, labels)

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tf.Tensor([1 0], shape=(2,), dtype=int32)


0.8382666707038879

In [29]:
test_batch = tokenizer(["I like to do this"])
print(test_batch)
logits = model.predict(test_batch['input_ids']).logits
probs = tf.math.softmax(logits)
print(model.output_names)
probs

{'input_ids': [[101, 1045, 2066, 2000, 2079, 2023, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1]]}
None


<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[0.26608974, 0.7339103 ]], dtype=float32)>

## All datasets from Model Hub

In [81]:
from datasets import list_datasets
datasets_list = list_datasets()
len(datasets_list)

1259

## How to load datsets from model hub

_MRPC (Microsoft Research Paraphrase Corpus) dataset, introduced in a paper by William B. Dolan and Chris Brockett. The dataset consists of 5,801 pairs of sentences, with a label indicating if they are paraphrases or not (i.e., if both sentences mean the same thing)._

In [30]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

Downloading:   0%|          | 0.00/7.78k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Downloading and preparing dataset glue/mrpc (download: 1.43 MiB, generated: 1.43 MiB, post-processed: Unknown size, total: 2.85 MiB) to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading: 0.00B [00:00, ?B/s]

Downloading: 0.00B [00:00, ?B/s]

Downloading: 0.00B [00:00, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [38]:
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[15]

{'idx': 16,
 'label': 0,
 'sentence1': 'Rudder was most recently senior vice president for the Developer & Platform Evangelism Business .',
 'sentence2': 'Senior Vice President Eric Rudder , formerly head of the Developer and Platform Evangelism unit , will lead the new entity .'}

In [35]:
raw_val_dataset = raw_datasets['validation']
raw_val_dataset[87]

{'idx': 812,
 'label': 0,
 'sentence1': 'However , EPA officials would not confirm the 20 percent figure .',
 'sentence2': 'Only in the past few weeks have officials settled on the 20 percent figure .'}

In [36]:
raw_train_dataset.features # shows info about the features of the dataset

{'idx': Value(dtype='int32', id=None),
 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
 'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None)}

## How to tokenize the training set, validation set & test set

#### How to access individual sentences in training set 

In [47]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
print('Senetence_1 \n')
print(tokenized_sentences_1['input_ids'][15])
print(tokenizer.convert_ids_to_tokens(tokenized_sentences_1['input_ids'][15]))
print('\n')

tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])
print('Senetence_2')
print(tokenized_sentences_2['input_ids'][15])
print(tokenizer.convert_ids_to_tokens(tokenized_sentences_2['input_ids'][15]))

Senetence_1 

[101, 24049, 2001, 2087, 3728, 3026, 3580, 2343, 2005, 1996, 9722, 1004, 4132, 9340, 12439, 2964, 2449, 1012, 102]
['[CLS]', 'rudder', 'was', 'most', 'recently', 'senior', 'vice', 'president', 'for', 'the', 'developer', '&', 'platform', 'evan', '##gel', '##ism', 'business', '.', '[SEP]']


Senetence_2
[101, 3026, 3580, 2343, 4388, 24049, 1010, 3839, 2132, 1997, 1996, 9722, 1998, 4132, 9340, 12439, 2964, 3131, 1010, 2097, 2599, 1996, 2047, 9178, 1012, 102]
['[CLS]', 'senior', 'vice', 'president', 'eric', 'rudder', ',', 'formerly', 'head', 'of', 'the', 'developer', 'and', 'platform', 'evan', '##gel', '##ism', 'unit', ',', 'will', 'lead', 'the', 'new', 'entity', '.', '[SEP]']


#### Tokenize using a function

In [71]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
def tokenize_dataset(dataset):
    encoded = tokenizer(
        dataset["sentence1"],
        dataset["sentence2"],
        padding=True,
        truncation=True,
        max_length = 128, # setting maximum length of sequence to be 128
        return_tensors='tf',
    )
    return encoded.data

tokenized_datasets = {
    split: tokenize_dataset(raw_datasets[split]) for split in raw_datasets.keys()
}

In [72]:
tokenized_datasets['train']

{'attention_mask': <tf.Tensor: shape=(3668, 103), dtype=int32, numpy=
 array([[1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        ...,
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0]], dtype=int32)>,
 'input_ids': <tf.Tensor: shape=(3668, 103), dtype=int32, numpy=
 array([[  101,  2572,  3217, ...,     0,     0,     0],
        [  101,  9805,  3540, ...,     0,     0,     0],
        [  101,  2027,  2018, ...,     0,     0,     0],
        ...,
        [  101,  1000,  2057, ...,     0,     0,     0],
        [  101,  1996, 26828, ...,     0,     0,     0],
        [  101,  1996,  2382, ...,     0,     0,     0]], dtype=int32)>,
 'token_type_ids': <tf.Tensor: shape=(3668, 103), dtype=int32, numpy=
 array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, .

In [58]:
tokenized_datasets['validation']

{'attention_mask': <tf.Tensor: shape=(408, 86), dtype=int32, numpy=
 array([[1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        ...,
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0]], dtype=int32)>,
 'input_ids': <tf.Tensor: shape=(408, 86), dtype=int32, numpy=
 array([[  101,  2002,  2056, ...,     0,     0,     0],
        [  101, 20201, 22948, ...,     0,     0,     0],
        [  101,  1996,  7922, ...,     0,     0,     0],
        ...,
        [  101,  2651,  1999, ...,     0,     0,     0],
        [  101,  1996,  1055, ...,     0,     0,     0],
        [  101,  4654,  1011, ...,     0,     0,     0]], dtype=int32)>,
 'token_type_ids': <tf.Tensor: shape=(408, 86), dtype=int32, numpy=
 array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0,

## Preparing dataset with map method

In [76]:
raw_train_dataset

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 3668
})

In [79]:
def tokenize_function(dataset):
  return dict(tokenizer(dataset['sentence1'],dataset['sentence2'],padding = True, truncation = True, max_length = 128, return_tensors = 'tf'))

tokenized = raw_train_dataset.map(tokenize_function) # to check batched = True
tokenized.column_names

  0%|          | 0/3668 [00:00<?, ?ex/s]

['sentence1',
 'sentence2',
 'label',
 'idx',
 'input_ids',
 'token_type_ids',
 'attention_mask']

In [80]:
tokenized

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 3668
})

In [87]:
tokenized_dataset = tokenized.remove_columns(['sentence1','sentence2','idx'])
tokenized_dataset = tokenized_dataset.rename_column('label','labels')
tokenized_dataset = tokenized_dataset.with_format('tensorflow')
tokenized_dataset

Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 3668
})

In [89]:
# to select a subset
tokenized_dataset.select(range(100))

Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 100
})

# Things to note

1. Here the tokenizer returns input_ids, token_type_ids & attention_mask. In some models the tokenizer may not return the token_type_id, because this is bert specific, as we've used bert-base-uncased checkpoint for tokenizer, so it knows what the bert model expects as inputs, and it does tokenization that way. This is how bert was pretrained on MLM and next senetence prediction(the goal with this task is to model the relationship between pairs of sentences)

  With next sentence prediction, the model is provided pairs of sentences (with randomly masked tokens) and asked to predict whether the second sentence follows the first. To make the task non-trivial, half of the time the sentences follow each other in the original document they were extracted from, and the other half of the time the two sentences come from two different documents.

2.  The parts of the input corresponding to [CLS] sentence1 [SEP] all have a token type ID of 0, while the other parts, corresponding to sentence2 [SEP], all have a token type ID of 1.

3. We can’t just pass two sequences to the model and get a prediction of whether the two sentences are paraphrases or not. We need to handle the two sequences as a pair, and apply the appropriate preprocessing. Fortunately, the tokenizer can also take a pair of sequences and prepare it the way our BERT model expects



```
inputs = tokenizer("This is the first sentence.", "This is the second one.")
inputs
{ 
  'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102], # These are the token idsmapped from vocab
  'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], # tells the model which is the first and which is the second sentence
  'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] # Tells the model which tokens to attend, if 0s don't attend
}
```



To check - Natural Language Inference(NLI) - Contracdiction, Neutral, Entailment classification