<a href="https://colab.research.google.com/github/HanSong19/Hugging-Face/blob/main/2.2%20Fine-Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Download dataset
MRCP dataset (one of the GLUE benchmark)


In [1]:
!pip install datasets evaluate transformers[sentencepiece]



In [2]:
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset



In [3]:
raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

There are three datasets: train, validation, test.
Accessing the dataset by indexing

In [4]:
raw_train_dataset = raw_datasets['train']

In [5]:
raw_train_dataset[1001]

{'sentence1': 'The redesigned Finder also features search , coloured labels for customized organization of documents and projects and dynamic browsing of the network for Mac , Windows and UNIX file servers .',
 'sentence2': 'It also supports coloured labels to better organise documents , and dynamic browsing of the network for Mac , Windows and Unix file servers .',
 'label': 0,
 'idx': 1122}

Finding feature will help showing which integer correspondes to the label. (what does label means)

In [6]:
raw_train_dataset.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

This shows that label shows ClassLabel and it is either 0='not_eequivalent' or 1='equivalent'

In [7]:
print(raw_train_dataset[15])
print(raw_train_dataset[87])

{'sentence1': 'Rudder was most recently senior vice president for the Developer & Platform Evangelism Business .', 'sentence2': 'Senior Vice President Eric Rudder , formerly head of the Developer and Platform Evangelism unit , will lead the new entity .', 'label': 0, 'idx': 16}
{'sentence1': 'Tuition at four-year private colleges averaged $ 19,710 this year , up 6 percent from 2002 .', 'sentence2': 'For the current academic year , tuition at public colleges averaged $ 4,694 , up almost $ 600 from the year before .', 'label': 1, 'idx': 100}


##Comparing multiple sentences
For example, by comparing two sentences, I can find whether two sentences contradictory/ nutral/ or embedded?


In [8]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

inputs = tokenizer("This is the first sentence.", "This is the second one.")
inputs

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [9]:
#if I want to use the sentences from the dataset
sentence1 = raw_train_dataset[15]['sentence1']
sentence2 = raw_train_dataset[15]['sentence2']
print(sentence1 , sentence2)
inputs = tokenizer(sentence1, sentence2)
print(inputs)

Rudder was most recently senior vice president for the Developer & Platform Evangelism Business . Senior Vice President Eric Rudder , formerly head of the Developer and Platform Evangelism unit , will lead the new entity .
{'input_ids': [101, 24049, 2001, 2087, 3728, 3026, 3580, 2343, 2005, 1996, 9722, 1004, 4132, 9340, 12439, 2964, 2449, 1012, 102, 3026, 3580, 2343, 4388, 24049, 1010, 3839, 2132, 1997, 1996, 9722, 1998, 4132, 9340, 12439, 2964, 3131, 1010, 2097, 2599, 1996, 2047, 9178, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [10]:
#convert the ids to tokens

tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'])
print(tokens)

['[CLS]', 'rudder', 'was', 'most', 'recently', 'senior', 'vice', 'president', 'for', 'the', 'developer', '&', 'platform', 'evan', '##gel', '##ism', 'business', '.', '[SEP]', 'senior', 'vice', 'president', 'eric', 'rudder', ',', 'formerly', 'head', 'of', 'the', 'developer', 'and', 'platform', 'evan', '##gel', '##ism', 'unit', ',', 'will', 'lead', 'the', 'new', 'entity', '.', '[SEP]']


input ids: numerical id for each token that shows the contextual information of the token.
token_type_ids: which token belongs to the first (0) and second (1) sentence.
attention_mask: which token shoule the model focuses on

## Using Dataset.map()
## Making a batch and compare two sentences in a dataset


use function that concatenatet two sentences in the datset and use .map()
why use .map().

it is fast and convenient
how?  

The results of the function are cached-> no extra time when re-execute.

Apply multiprocessing to go faster than applying the function on each element of the dataset.  
Not load the whole dataset into memory, saving results as soon as ont is processed

In [11]:
def tokenize_function(example):
  return tokenizer(example['sentence1'], example['sentence2'], truncation=True)

In [12]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched = True)
print(tokenized_datasets)
print(tokenized_datasets['train'][0])
#raw data before tokenizezd funtion (no input id or attention mask)
print(raw_datasets['train'][0])

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})
{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', 'label': 1, 'idx': 0, 'input_ids': [101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 

Here, I use dataset.map() to keep datasets as datasets and apply the function.

### Padding in batched input

Dynamic padding: pad based on the maximum length WITHIN the batch
Or pad the maximun lenght in the entire dataset.

use collate function: responsible for putting together samples inside a batch
DataCollatorWithPadding: put together samples with padding

In [13]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

samples = tokenized_datasets['train'][:8]
samples = {k: v for k, v in samples.items() if k not in ['idx','sentence1','sentence2']}
[len(x) for x in samples["input_ids"]]

[50, 59, 47, 67, 59, 50, 62, 32]

the longest one in the batch is 67. Hence, the others will be padded to match 67.


In [14]:
batch = data_collator(samples)
{k:v.shape for k, v in batch.items()}

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'attention_mask': torch.Size([8, 67]),
 'labels': torch.Size([8])}

check every ids is 67.

## Find Tuning
Use Trainder class to fine-tune any of the pretrained models it provides on your datasets.
1. define a TrainingArguments that contains Trainer.  
   only argument needed is "where" the model will be saved
2. defind the model.
3. Define a Trainer by passing all the objects constructed up to now (training_args, train_dataset, eval_dataset. data_collator, tokenizer)

In [15]:
!pip install transformers[torch]



In [16]:
!pip install accelerate -U
!pip install transformers[torch]




In [17]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")

In [18]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [19]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [None]:
trainer.train()



Step,Training Loss
500,0.5201
