# How to infer?

1. High level abstracted approach,  `pipeline` from HF.


2. Second approach:
- Tokenize input.
- Pass through Model/Transformer.
- Explicitly infer output.

### Training in PyTorch
1. Dataloader
2. collate function
3. decide a loss function
3. optimizer
4. model

5. Write a for loop
- Input, label pair
- Pass it through the model
- Get output (logits)
- Compute loss
- Do one step of update with the optimiser

# Finetuning
1. Dataset
- Data could be stored in different formats.
- Once the data is in tabular format, text or csv, it can be converted to a HF dataset.
2. Training/finetuning

## Trainer API
1. Optimized for transformers
2. Use for finetuning and complete training

In [1]:
!pip install datasets
import datasets

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.w

In [2]:
from pprint import pprint

## Datasets
1. Allows to download numerous datasets
- it could be from text, audio, images, videos etc.
2. HF allows to download and use a dataset.
3. HF dataset is stored in a specific format.
4. Visit https://huggingface.co/docs/datasets/en/index for a list of datasets available.
5. Data is stored in columnar format, where columns are features and rows are samples.
6. HF uses pyarrow to store datasets in backend. File format is parquet.

### Let's dowload data in GLUE benchmark.

Let's visit the benchmark page.

https://huggingface.co/datasets/nyu-mll/glue

https://gluebenchmark.com/


https://paperswithcode.com/dataset/glue

1. Heirarchy of datasets (subdatasets)
2. Standardised collection of NLP tasks and corresponding datasets.
3. A popular benchmark.

In [3]:
from datasets import load_dataset, get_dataset_split_names, get_dataset_config_names, get_dataset_config_info

Get list of sub-datasets

In [4]:
# get all datasets in 'glue' benchmark
get_dataset_config_names('glue')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

['ax',
 'cola',
 'mnli',
 'mnli_matched',
 'mnli_mismatched',
 'mrpc',
 'qnli',
 'qqp',
 'rte',
 'sst2',
 'stsb',
 'wnli']

Get split names of a dataset

In [5]:
# for dataset 'mrpc', get split names
get_dataset_split_names('glue','mrpc')

['train', 'validation', 'test']

In [6]:
# for dataset 'ax', get split names
get_dataset_split_names('glue','ax')

['test']

Get config information of a dataset

In [10]:
# get config info of 'mrpc'
pprint(get_dataset_config_info('glue','mrpc'))

DatasetInfo(description='',
            citation='',
            homepage='',
            license='',
            features={'idx': Value(dtype='int32', id=None),
                      'label': ClassLabel(names=['not_equivalent',
                                                 'equivalent'],
                                          id=None),
                      'sentence1': Value(dtype='string', id=None),
                      'sentence2': Value(dtype='string', id=None)},
            post_processed=None,
            supervised_keys=None,
            builder_name='parquet',
            dataset_name='glue',
            config_name='mrpc',
            version=0.0.0,
            splits={'test': SplitInfo(name='test',
                                      num_bytes=442410,
                                      num_examples=1725,
                                      shard_lengths=None,
                                      dataset_name=None),
                    'train': SplitInfo(name='

Do all datasets have the 3 splits?

Ans: No

In [None]:
# are there any datasets with fewer splits than 3?

DatasetInfo(description='',
            citation='',
            homepage='',
            license='',
            features={'hypothesis': Value(dtype='string', id=None),
                      'idx': Value(dtype='int32', id=None),
                      'label': ClassLabel(names=['entailment',
                                                 'neutral',
                                                 'contradiction'],
                                          id=None),
                      'premise': Value(dtype='string', id=None)},
            post_processed=None,
            supervised_keys=None,
            builder_name='parquet',
            dataset_name='glue',
            config_name='ax',
            version=0.0.0,
            splits={'test': SplitInfo(name='test',
                                      num_bytes=237694,
                                      num_examples=1104,
                                      shard_lengths=None,
                                      dataset_n

Download a dataset

In [11]:
# download 'mrpc' dataset
mrpc_data = load_dataset(path='glue',
                         name = 'mrpc')

train-00000-of-00001.parquet:   0%|          | 0.00/649k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [12]:
# what are the dataset's attributes and methods
dir(mrpc_data)

['__class__',
 '__class_getitem__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__ior__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__or__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__ror__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_check_values_features',
 '_check_values_type',
 'align_labels_with_mapping',
 'cache_files',
 'cast',
 'cast_column',
 'class_encode_column',
 'cleanup_cache_files',
 'clear',
 'column_names',
 'copy',
 'data',
 'filter',
 'flatten',
 'flatten_indices',
 'formatted_as',
 'from_csv',
 'from_json',
 'from_parquet',
 'from_text',
 'fromkeys',
 'get',
 'items',
 'keys',
 'load_from_disk',
 'map',
 'num_columns',
 'num_r

In [16]:
# what are the features of the dataset?
mrpc_data

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

What is the structure of a dataset?

In [None]:
# structure of the dataset?


DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})


- `DatasetDict` format must be followed if you are creating your own dataset to be used for `Trainer` API.
- The feature set would differ from dataset to dataset.

Let's explore the dataset

In [18]:
# only select train split
train_mrpc = mrpc_data['train']

In [19]:
# features from the train split
train_mrpc.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

In [20]:
# print few samples from train split, with index slicing
train_mrpc.data[:2]

pyarrow.Table
sentence1: string
sentence2: string
label: int64
idx: int32
----
sentence1: [["Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .","Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion ."]]
sentence2: [["Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .","Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 ."]]
label: [[1,0]]
idx: [[0,1]]

#### Transforming dataset

- map()
- filter()
- remove_columns()
- rename_columns()

In [21]:
from transformers import AutoTokenizer

In [26]:
model_checkpoint = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [24]:
# print first two samples from training split and first feature
train_mrpc['sentence1'][4:6]

['The stock rose $ 2.11 , or about 11 percent , to close Friday at $ 21.51 on the New York Stock Exchange .',
 'Revenue in the first quarter of the year dropped 15 percent from the same period a year earlier .']

### Formating the input

- Let's finetune the bert model
- The model takes input in following format:

`[CLS] Sentence 1 [SEP] Sentence 2`

In [None]:
asample ='[CLS] Russia and Ukraine are fighting. [SEP] Ukraine has been in war with Russia for 3 years'

In [29]:
# tokenize few 'sentence1' from training set
tokenizer(train_mrpc['sentence1'][:5])

{'input_ids': [[101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102], [101, 9805, 3540, 11514, 2050, 3079, 11282, 2243, 1005, 1055, 2077, 4855, 1996, 4677, 2000, 3647, 4576, 1999, 2687, 2005, 1002, 1016, 1012, 1019, 4551, 1012, 102], [101, 2027, 2018, 2405, 2019, 15147, 2006, 1996, 4274, 2006, 2238, 2184, 1010, 5378, 1996, 6636, 2005, 5096, 1010, 2002, 2794, 1012, 102], [101, 2105, 6021, 19481, 13938, 2102, 1010, 21628, 6661, 2020, 2039, 2539, 16653, 1010, 2030, 1018, 1012, 1018, 1003, 1010, 2012, 1037, 1002, 1018, 1012, 5179, 1010, 2383, 3041, 2275, 1037, 2501, 2152, 1997, 1037, 1002, 1018, 1012, 5401, 1012, 102], [101, 1996, 4518, 3123, 1002, 1016, 1012, 2340, 1010, 2030, 2055, 2340, 3867, 1010, 2000, 2485, 5958, 2012, 1002, 2538, 1012, 4868, 2006, 1996, 2047, 2259, 4518, 3863, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 

In [None]:
help(mrpc_data.map)

In [43]:
# create custom function to combine two sentences
def custom_mapping_fn(aSample):
  output =  tokenizer(text=aSample['sentence1'],
                      text_pair=aSample['sentence2'],
                      truncation = True)
  return output

In [44]:
# tokenize the dataset
tokenized_dataset = mrpc_data.map(custom_mapping_fn,
                                  batched = True)

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [45]:
# look at tokenized data
tokenized_dataset


DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

In [46]:
# print one sample from tokenized data
tokenized_dataset['test'][:1]

{'sentence1': ["PCCW 's chief operating officer , Mike Butcher , and Alex Arena , the chief financial officer , will report directly to Mr So ."],
 'sentence2': ['Current Chief Operating Officer Mike Butcher and Group Chief Financial Officer Alex Arena will report to So .'],
 'label': [1],
 'idx': [0],
 'input_ids': [[101,
   7473,
   2278,
   2860,
   1005,
   1055,
   2708,
   4082,
   2961,
   1010,
   3505,
   14998,
   1010,
   1998,
   4074,
   5196,
   1010,
   1996,
   2708,
   3361,
   2961,
   1010,
   2097,
   3189,
   3495,
   2000,
   2720,
   2061,
   1012,
   102,
   2783,
   2708,
   4082,
   2961,
   3505,
   14998,
   1998,
   2177,
   2708,
   3361,
   2961,
   4074,
   5196,
   2097,
   3189,
   2000,
   2061,
   1012,
   102]],
 'token_type_ids': [[0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
 

In [None]:
# tokenize dataset in batches


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

## Dynamic padding with DataCollator

- Create custom collate function, then pass it to dataloader
- To ensure all sequences in a batch have the same length, pad them.
- Use collate function for padding.


In [60]:
from transformers import DataCollatorWithPadding

In [61]:
data_collator = DataCollatorWithPadding(tokenizer = tokenizer,
                                        padding=True)

In [None]:
# to get more info on data collator
# ?DataCollatorWithPadding

In [47]:
# take few training samples
some_samples = tokenized_dataset['train'][:5]

In [None]:
# view the samples
some_samples


{'sentence1': ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
  "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .",
  'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .',
  'Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .',
  'The stock rose $ 2.11 , or about 11 percent , to close Friday at $ 21.51 on the New York Stock Exchange .',
  'Revenue in the first quarter of the year dropped 15 percent from the same period a year earlier .',
  'The Nasdaq had a weekly gain of 17.27 , or 1.2 percent , closing at 1,520.15 on Friday .',
  'The DVD-CCA then appealed to the state Supreme Court .'],
 'sentence2': ['Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
  "Yucaipa bought Dominick 's in 1995 for $ 693 

In [49]:
# drop 'idx', 'sentence1' and 'sentence2' features
few_samples = {k:v for k,v in some_samples.items() if k not in ['sentence1', 'sentence2','idx']}

In [57]:
# view the samples
few_samples.keys()


dict_keys(['label', 'input_ids', 'token_type_ids', 'attention_mask'])

In [58]:
# print length of each sentence
for i_sample in few_samples['input_ids']:
  print(len(i_sample), end="  ")

50  59  47  67  59  

In [62]:
# process the samples with data collator
batch = data_collator(few_samples)

In [64]:
# print length of each sentence
for i_sample in batch['input_ids']:
  print(len(i_sample), end="  ")


67  67  67  67  67  

# Choose the Model

In [65]:
from transformers import AutoModelForSequenceClassification, AutoModel

In [66]:
base_model= AutoModel.from_pretrained(model_checkpoint)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [None]:
base_model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

In [70]:
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint,
                                                           num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [71]:
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

# Training with Trainer API

In [74]:
from transformers import Trainer, TrainingArguments

Large number of parameters of `TrainingArguments` (127 to be exact, at the moment)

In [75]:
?TrainingArguments

In [91]:
# model.to_device('cpu')

AttributeError: 'BertForSequenceClassification' object has no attribute 'to_device'

In [85]:
train_args = TrainingArguments('test_trainer', num_train_epochs=1,
                               report_to='none')

In [86]:
trainer = Trainer(
    model,
    train_args,
    train_dataset = tokenized_dataset['train'],
    # eval_dataset = tokenized_dataset['validation'],
    data_collator = data_collator,
    tokenizer=tokenizer
)

  trainer = Trainer(


In [79]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

In [87]:
# !export WANDB_DISABLED=true
trainer.train()

Step,Training Loss


TrainOutput(global_step=459, training_loss=0.253633087756587, metrics={'train_runtime': 21.109, 'train_samples_per_second': 173.765, 'train_steps_per_second': 21.744, 'total_flos': 135411749085120.0, 'train_loss': 0.253633087756587, 'epoch': 1.0})

In [82]:
for name, param in model.named_parameters():
  if name.startswith("bert"): # choose whatever you like here
    param.requires_grad = False
  # param.requires_grad = True

  print(name,param.requires_grad)
    #     param.requires_grad = False

bert.embeddings.word_embeddings.weight False
bert.embeddings.position_embeddings.weight False
bert.embeddings.token_type_embeddings.weight False
bert.embeddings.LayerNorm.weight False
bert.embeddings.LayerNorm.bias False
bert.encoder.layer.0.attention.self.query.weight False
bert.encoder.layer.0.attention.self.query.bias False
bert.encoder.layer.0.attention.self.key.weight False
bert.encoder.layer.0.attention.self.key.bias False
bert.encoder.layer.0.attention.self.value.weight False
bert.encoder.layer.0.attention.self.value.bias False
bert.encoder.layer.0.attention.output.dense.weight False
bert.encoder.layer.0.attention.output.dense.bias False
bert.encoder.layer.0.attention.output.LayerNorm.weight False
bert.encoder.layer.0.attention.output.LayerNorm.bias False
bert.encoder.layer.0.intermediate.dense.weight False
bert.encoder.layer.0.intermediate.dense.bias False
bert.encoder.layer.0.output.dense.weight False
bert.encoder.layer.0.output.dense.bias False
bert.encoder.layer.0.output.Lay

## Test the model

In [101]:
input_text = ['I enjoy sports','Watching news is so grim these days.']
input_text_tokenized = tokenizer(text = input_text[0],
                                 text_pair=input_text[1],
                                 return_tensors='pt')
pprint(input_text_tokenized)

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]),
 'input_ids': tensor([[  101,  1045,  5959,  2998,   102,  3666,  2739,  2003,  2061, 11844,
          2122,  2420,  1012,   102]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [96]:
# for k,v in input_text_tokenized.items():
#   v.to('cuda')

In [102]:
model = model.to('cpu')

In [103]:
model(**input_text_tokenized)

SequenceClassifierOutput(loss=None, logits=tensor([[ 2.0345, -2.0551]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

device(type='cuda', index=0)