<a href="https://colab.research.google.com/github/LennyRBriones/huggingface/blob/main/my_NLP_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Getting the Data

In [1]:
%%capture
!pip install datasets transformers evaluate

To this code we going to use the dataset MRPC, being one of the 10th in the benchmaark GLUE. 

https://huggingface.co/datasets/glue/viewer/mrpc/test



In [2]:
from datasets import load_dataset

ds = load_dataset("glue", "mrpc") #the subset that we are downloading

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

Downloading and preparing dataset glue/mrpc to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Now expoloring the data we can see that there are 2 sentences & 1 output

In [3]:
example = ds["train"][200]
example

{'sentence1': '" Smallpox is not the only threat to the public \'s health , and vaccination is not the only tool for smallpox preparedness , " Strom said .',
 'sentence2': '" Smallpox is not the only threat to the nation \'s health , and vaccination is not the only tool for preparedness , " his introductory statement says .',
 'label': 1,
 'idx': 223}

now we have our labels to explore

In [4]:
ds["train"].features["label"]

ClassLabel(names=['not_equivalent', 'equivalent'], id=None)

In [5]:
labels = ds["train"].features["label"]

In [6]:
labels.int2str(0)

'not_equivalent'

In [7]:
labels.int2str(1)

'equivalent'

## Tokenizer

This tokenizer is translating text to numbers to transform understanable to the model

In [8]:
from transformers import AutoTokenizer

repo_id = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(repo_id)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [9]:
tokenized_sentence_1 = tokenizer(ds["train"]["sentence1"][2])
tokenized_sentence_1

{'input_ids': [101, 2027, 2018, 2405, 2019, 15147, 2006, 1996, 4274, 2006, 2238, 2184, 1010, 5378, 1996, 6636, 2005, 5096, 1010, 2002, 2794, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [10]:
inputs = tokenizer("I really need vacations", "I dont need vacations")
inputs
#showing 0,0,0 as the firts value & 1,1,1 as the second value
#attention_mask indicates if is necessary to pay attention to the value (In padding is necessary)

{'input_ids': [101, 1045, 2428, 2342, 10885, 2015, 102, 1045, 2123, 2102, 2342, 10885, 2015, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

**input_ids** the translation from words to numbers

**attention_mask** the tensor with the same shape as **input_id**, but filled with 0 & 1, being 1 attended & 0 ignored

**Token_type_id** separate the sentences, indicating 0 to one sentence and 1 to another

In [11]:
tokenizer.convert_ids_to_tokens(inputs["input_ids"])

['[CLS]',
 'i',
 'really',
 'need',
 'vacation',
 '##s',
 '[SEP]',
 'i',
 'don',
 '##t',
 'need',
 'vacation',
 '##s',
 '[SEP]']

CLS indicate the starting, SEP indicate the separation and the last SEP indicate the finish

In [13]:
repo_id = "distilroberta-base"

tokenizer = AutoTokenizer.from_pretrained(repo_id)

Downloading (…)lve/main/config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [14]:
def tokenizer_fn(example):
  return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

In [16]:
prepared_ds = ds.map(tokenizer_fn, batched=True)

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

## Padding