<a href="https://colab.research.google.com/github/ArunKoundinya/DeepLearning/blob/main/posts/hugging_face_sentiment_classification/index.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this module we will explore basics of Hugging Face for Sentiment Classification. Subtopics that we will explore in this blog.

- Simple Pipeline of HuggingFace for Sentiment Classification.
- Diving the pieces of Pipeline ( Tokenizing, Predicting, Reporting }
- What happens inside Tokenizer
- How Finetuning works
- How Finetuning works with Custom Layers on Top of the Hugging Face Model.

Primarly this is prepeared for my ready reference for future. Hope this helps others too.

## 1. Hugging Face Pipeline

The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering. [Pipeline Reference](https://huggingface.co/docs/transformers/main_classes/pipelines).

In [1]:
from transformers import pipeline

Here we need to import pipeline from `transformers` package. This function primarly has two important inputs.

-  What Task we need to perform.
-  What model we need to use to perform the task.

As shown in the below image we can choose the task that we need to use at hand.

![](pipeline-1.png)

We can select any relevant model from this [link](https://huggingface.co/models?pipeline_tag=text-classification&sort=trending).

To start with I choose:
1.   Task as Sentiment Analysis
2.   Model as Distilbert-base




In [2]:
task = "sentiment-analysis"

checkpoint = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"

sentiment_classifer = pipeline(task,model = checkpoint,framework="tf")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

If we can observe all the related base files are loaded; that includes model configuration, model itself and vocab text

Now predicting is simple like we use `chatgpt`

In [3]:
sentiment_classifer("This is a cool blog!!!")

[{'label': 'POSITIVE', 'score': 0.9998422861099243}]

In [4]:
sentiment_classifer("Games are boring!!!")

[{'label': 'NEGATIVE', 'score': 0.9996864795684814}]

We can even pass the inputs in a list

In [5]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
    "My niece absolutely loves this. It’s a great way to get her outside but keep her distracted so she’s not running around the yard. It was very easy to assemble and super durable. Also super easy to clean."
]

In [6]:
sentiment_classifer(raw_inputs)

[{'label': 'POSITIVE', 'score': 0.9598046541213989},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455},
 {'label': 'POSITIVE', 'score': 0.9994500279426575}]

We can see how simple the pipeline module works. All we need to identify our task and model and create a pipeline.

After creating a pipeline we need to pass our input data and wait for its prediction. The beauty of this prediction is that it provides the final labels too.

## 2. Inside Pipeline

Inside Pipeline we have three steps:

- Tokenizing
- Model
- Post Processing

Tokenizing means - Tokenizing the sentence into numerical values.
Model - Uses these token ids and predicts the logits
Post Processing - We will use obtained logits to arrive at our required labels.

For reference see the below image.

![](behind-pipeline.png)

### 2.1 AutoTokenizer

AutoTokenizer is a generic tokenizer that can be used for wide range of models offers by hugging face.

We need to create a tokenizer object

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

type(tokenizer)

Based on the model it automatically loads the required type of tokenizer. Since we used our checkpoint as `distilbert` we got the same tokenizer.

Lets check the same with another model for demonstration sake.

In [8]:
type(AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment-latest"))



config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]


Since, we have used Roberta model. Autotokenizer automatically loads `Roberta Tokenizer`. The beauty of Autotokenizer wrapper is that it became model agnostic.

All we need is a simple command to convert words into token ids. Here we have added `padding` to create padding to ensure same structure of all sentence lengths. `Truncation` is then introducted to maintain the max token length of the model.

In [9]:
tokenizer(raw_inputs,padding=True,truncation=True,return_tensors="tf")

{'input_ids': <tf.Tensor: shape=(3, 48), dtype=int32, numpy=
array([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662,
        12172,  2607,  2026,  2878,  2166,  1012,   102,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0],
       [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0],
       [  101,  2026, 12286,  7078,  7459,  2023,  1012,  2009,  1521,
         1055,  1037,  2307,  2126,  2000,  2131,  2014,  2648,  2021,
         2562,  2014, 11116

Output generates `input_ids` these are numerical values that are specific to vocabulary. `attention_mask` to tell the model which are padded numbers and which aren't. Now, we are all ready to use this generated tokenids for our modelling.

In [10]:
tokenized_ids = tokenizer(raw_inputs,padding=True,truncation=True,return_tensors="tf")

### 2.2. Sequence Classification Model

Similar to `AutoTokenizer` for modelling we have `AutoModel` and since our task is Sequence Classification we will directly use `AutoModelForSequenceClassification`.

Now we will create a model object

In [11]:
from transformers import TFAutoModelForSequenceClassification

bert_model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)

type(bert_model)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


Based on the model it automatically loads the required type of TFModel. Since we used our checkpoint as `distilbert` we got the same model.

Here TF stands for `TensorFlow` since we will using TensorFlow Framework isntead of pytorch.


We can see the model configuration by using the keywork `.config`. This model is trained on 30K vocab size.


In [12]:
bert_model.config

DistilBertConfig {
  "_name_or_path": "distilbert/distilbert-base-uncased-finetuned-sst-2-english",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.41.0",
  "vocab_size": 30522
}

We can explore the `model architecture` by simply viewing the summary as well. Which has a core distilbert model followed by a pre-classifier and then a classifer layer for predicting the output.

In [23]:
bert_model.summary()

Model: "tf_distil_bert_for_sequence_classification_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_39 (Dropout)        multiple                  0 (unused)
                                                                 
Total params: 66955010 (255.41 MB)
Trainable params: 66955010 (255.41 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


We can view the architecture of individual layers in the following way.

In [37]:
for layer in bert_model.layers:
  print(layer.name)
  print(layer.get_config())

distilbert
{'name': 'distilbert', 'trainable': True, 'dtype': 'float32', 'config': {'vocab_size': 30522, 'max_position_embeddings': 512, 'sinusoidal_pos_embds': False, 'n_layers': 6, 'n_heads': 12, 'dim': 768, 'hidden_dim': 3072, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation': 'gelu', 'initializer_range': 0.02, 'qa_dropout': 0.1, 'seq_classif_dropout': 0.2, 'return_dict': True, 'output_hidden_states': False, 'output_attentions': False, 'torchscript': False, 'torch_dtype': None, 'use_bfloat16': False, 'tf_legacy_loss': False, 'pruned_heads': {}, 'tie_word_embeddings': True, 'chunk_size_feed_forward': 0, 'is_encoder_decoder': False, 'is_decoder': False, 'cross_attention_hidden_size': None, 'add_cross_attention': False, 'tie_encoder_decoder': False, 'max_length': 20, 'min_length': 0, 'do_sample': False, 'early_stopping': False, 'num_beams': 1, 'num_beam_groups': 1, 'diversity_penalty': 0.0, 'temperature': 1.0, 'top_k': 50, 'top_p': 1.0, 'typical_p': 1.0, 'repetition_penalty': 1.0,


Now let's pass the `tokenized_ids` that we have saved in the previous section and pass to this model.


In [15]:
bert_model(tokenized_ids)

TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(3, 2), dtype=float32, numpy=
array([[-1.560698 ,  1.6122831],
       [ 4.1692305, -3.3464472],
       [-3.6664698,  3.8386683]], dtype=float32)>, hidden_states=None, attentions=None)

What did we get here??

We got `logits` which are nothing but the raw scores of the final layer. These are not probabilities. All we need to do is to add a softmax layer to get the probabilities of each label. Which can done by using `tf.math.softmax`

### 2.3. Post Processing

In [51]:
logits = bert_model(tokenized_ids).logits

import tensorflow as tf

print(tf.math.softmax(logits).numpy())

[[4.0195242e-02 9.5980471e-01]
 [9.9945587e-01 5.4418476e-04]
 [5.4994709e-04 9.9945003e-01]]


Now we have extracted probabilities. Now, we can use `argmax` function to get the index of the maximum value of the probability.

In [52]:
tf.argmax(tf.math.softmax(logits),axis=1).numpy()

array([1, 0, 1])

- `1` means positive
- `0` means

In this way we have explored three parts of the pipeline individually.

## 3. What's happening inside Tokenizer

Below is we get when we use default tokenizer. Where we get the input_ids and attention_mask by default. Now, we will explore how this is formed.

In [53]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer("I hate this so much!")

{'input_ids': [101, 1045, 5223, 2023, 2061, 2172, 999, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

### 3.1 Tokenization

Usually there are three types of tokenization.

- Word based tokenization
- Character level tokenization

Both of the above are two extreme states of tokenization. The usual approach that is followed in large models are `sub-word tokenization`


We can tokenize the words by using `tokenizer.tokenize` function

In [55]:
tokenizer.tokenize("I hate this so much!")

['i', 'hate', 'this', 'so', 'much', '!']

In [56]:
tokenizer.tokenize("This beautilization is so lucky!")  #Random Sentence to demonstrate automatic subword tokenization

['this', 'beau', '##ti', '##lization', 'is', 'so', 'lucky', '!']

### 3.2. Tokenized Words to IDs

To convert these tokens to words we need to map using the vocabulary which is trained through this model.

Earlier using the config file we have explored that the vocabulary is of size `30K`. Which can be further verified by using the `len` function and the we can see the first 10 vocabulary words along with ids.

In [69]:
print(len(tokenizer.vocab))
list(tokenizer.vocab.items())[:10]

30522


[('96', 5986),
 ('republished', 24476),
 ('worst', 5409),
 ('##bant', 29604),
 ('##ahu', 21463),
 ('fellow', 3507),
 ('explosives', 14792),
 ('infrared', 14611),
 ('##osaurus', 25767),
 ('tenant', 16713)]

In [70]:
tokenizer.tokenize("I hate this so much!")

['i', 'hate', 'this', 'so', 'much', '!']

We can convert these into tokens by mapping the vocabulury key values

In [81]:
for i in tokenizer.tokenize("I hate this so much!"):
  print(tokenizer.vocab[i])

print("")

tokenizer("I hate this so much!")["input_ids"]

1045
5223
2023
2061
2172
999



[101, 1045, 5223, 2023, 2061, 2172, 999, 102]

Now, we can see that the we are able to map the tokens. However, there are some special tokens at start and at end which are labelled as `101` and `102`. Lets explore what are they

In [82]:
print(dict(tokenizer.vocab)["[CLS]"])
print(dict(tokenizer.vocab)["[SEP]"])

101
102


These are `[CLS]` indicating the starting of the sequence and `[SEP]` indicates a seperator here it is end of sentence as it is Sentiment Classification task. We can convert them directly to ids using `convert_token_to_ids` function

In [42]:
tokens = tokenizer.tokenize("I hate this so much!")

tokenizer.convert_tokens_to_ids(tokens)

[1045, 5223, 2023, 2061, 2172, 999]

![](tokenization-pipeline.png)

### 3.3. Preparing tokens for downstream modelling

We need to prepare these tokens for downstream modelling by using function `prepare_for_model` which creates both input_ids and attention_mask

In [53]:
token_to_ids = tokenizer.convert_tokens_to_ids(tokens)

tokenizer.prepare_for_model(token_to_ids)

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'input_ids': [101, 1045, 5223, 2023, 2061, 2172, 999, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

## 4. How Fine Tuning Works!

Let's assume that we want to fine-tune the data to our dataset at hand.

Lets load the dataset of `Amazon Review` which we explored in our previous blog.

In [83]:
from google.colab import drive
import os

import pandas as pd

drive.mount('/content/drive')
os.chdir('/content/drive/My Drive/MSIS/IntroductiontoDeepLearning/Project/')

testdata = pd.read_csv('test_data_sample_complete.csv')
traindata = pd.read_csv('train_data_sample_complete.csv')

Mounted at /content/drive


Preparing the dataset and taking a sample of 1000 rows of both training and test dataset.

In [84]:
train_data = traindata.sample(n=1000, random_state=42)
test_data = testdata.sample(n=1000, random_state=42)

train_data['class_index'] = train_data['class_index'].map({1:0, 2:1})
test_data['class_index'] = test_data['class_index'].map({1:0, 2:1})

train_data['review_combined_lemma'] = train_data['review_combined_lemma'].fillna('')
test_data['review_combined_lemma'] = test_data['review_combined_lemma'].fillna('')

In [85]:
train_data.head(3)

Unnamed: 0,class_index,review_combined_lemma
2079998,0,expensive junk product consists piece thin fle...
1443106,0,toast dark even lowest setting toast dark liki...
3463669,1,excellent imagerydumbed story enjoyed disc vid...


### 4.1. Converting Pandas DataFrame to HuggingFace Dataset

For finetuning the hugging face model we need to convert the data frame into hugging face format which is pretty straight forward and easy

In [86]:
!pip install datasets


Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets

In [87]:
import datasets
from datasets import Dataset, DatasetDict

train_data = Dataset.from_pandas(train_data)
test_data = Dataset.from_pandas(test_data)

raw_data = DatasetDict()

raw_data["train"] =  train_data
raw_data["test"] = test_data

print(raw_data)

DatasetDict({
    train: Dataset({
        features: ['class_index', 'review_combined_lemma', '__index_level_0__'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['class_index', 'review_combined_lemma', '__index_level_0__'],
        num_rows: 1000
    })
})


To these datasets we need to tokenize the input data using our tokenizer which is done in the following way

In [88]:
def tokenize_function(example):
    return tokenizer(example["review_combined_lemma"], truncation=True)

tokenized_datasets = raw_data.map(tokenize_function, batched=True)

print(tokenized_datasets)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['class_index', 'review_combined_lemma', '__index_level_0__', 'input_ids', 'attention_mask'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['class_index', 'review_combined_lemma', '__index_level_0__', 'input_ids', 'attention_mask'],
        num_rows: 1000
    })
})


In [89]:
Dataset.to_pandas(tokenized_datasets["train"]).head(3)

Unnamed: 0,class_index,review_combined_lemma,__index_level_0__,input_ids,attention_mask
0,0,expensive junk product consists piece thin fle...,2079998,"[101, 6450, 18015, 4031, 3774, 3538, 4857, 123...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,0,toast dark even lowest setting toast dark liki...,1443106,"[101, 15174, 2601, 2130, 7290, 4292, 15174, 26...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2,1,excellent imagerydumbed story enjoyed disc vid...,3463669,"[101, 6581, 13425, 8566, 18552, 2094, 2466, 56...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


Finally, we have prepared the dataset in the required format and tokenized the input data.

Do note, we have not padding the input data. So, we lenghts of input_ids will be different based on the sentence length.

In [90]:
print(len(Dataset.to_pandas(tokenized_datasets["train"])["attention_mask"][0]))
print(len(Dataset.to_pandas(tokenized_datasets["train"])["attention_mask"][1]))
print(len(Dataset.to_pandas(tokenized_datasets["train"])["review_combined_lemma"][0]))
print(len(Dataset.to_pandas(tokenized_datasets["train"])["review_combined_lemma"][1]))

81
27
422
132


We need to use the `DataCollatorWith Padding` and finally convert this into individualized tensorflow datasets for faster processing.

In [95]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids"],
    label_cols=["class_index"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)

tf_validation_dataset = tokenized_datasets["test"].to_tf_dataset(
    columns=["attention_mask", "input_ids"],
    label_cols=["class_index"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=8,
)

Old behaviour: columns=['a'], labels=['labels'] -> (tf.Tensor, tf.Tensor)  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor)  
New behaviour: columns=['a'],labels=['labels'] -> ({'a': tf.Tensor}, {'labels': tf.Tensor})  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor) 


### 4.2. Fine Tuning the Model

We have already loaded the bert_model in our previous steps which we will use for training.

In [91]:
bert_model.summary()

Model: "tf_distil_bert_for_sequence_classification_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_39 (Dropout)        multiple                  0         
                                                                 
Total params: 66955010 (255.41 MB)
Trainable params: 66955010 (255.41 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Once we have this model we need to change the loss to logits as the model output logits which needs to be further converted into probabilities.

`compile` and `fit` is simlar to that of regular tensorflow operations

In [96]:
bert_model.compile(
    optimizer="adam",
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"],
)
bert_model.fit(
    tf_train_dataset,
    validation_data=tf_validation_dataset,
    verbose=1,
    epochs=10
)

Epoch 1/10


Cause: for/else statement not yet supported


Cause: for/else statement not yet supported
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tf_keras.src.callbacks.History at 0x7fdd8571d810>

Currently the accuracy is 50% because we have use small data and as well smaller epochs. It would definetly increase the accuracy if we can on full data which we will explore in next blog. However, this comes with a huge cost of training as fine-tuning the full model will take time as it needs to retain lot of paramters.

Since, majority of the paramters lie in `distilbert` layer; Let's not train this layer. Which can be adjusted by making that layer `trainable as False`

In [99]:
bert_new_model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)

## Reloading the original model such that the original paramters are restored.

bert_new_model.layers[0].trainable = False

bert_new_model.summary()

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


Model: "tf_distil_bert_for_sequence_classification_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_59 (Dropout)        multiple                  0 (unused)
                                                                 
Total params: 66955010 (255.41 MB)
Trainable params: 592130 (2.26 MB)
Non-trainable params: 66362880 (253.15 MB)
_________________________________________________________________


Now, we can see that the Trainable Paramters are now `5.9 Lacs`

In [100]:
bert_new_model.compile(
    optimizer="adam",
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"],
)
bert_new_model.fit(
    tf_train_dataset,
    validation_data=tf_validation_dataset,
    verbose=1,
    epochs=10
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tf_keras.src.callbacks.History at 0x7fdd7e938190>

Sufficiently large data and with large epocs should increase validation accuracy which we will explore in next blog

## 5. Adding Custom Layers on Top of Hugging Face Model

In the earlier model since there is already a classifier built-in it is not wise to add custom layers on top of it. Hugging face provides a default AutoModel without a classifier for such scenarios for us.

In [102]:
from transformers import TFAutoModel

bert_for_custom_model = TFAutoModel.from_pretrained(checkpoint)

bert_for_custom_model.summary()

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertModel: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
- This IS expected if you are initializing TFDistilBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


Model: "tf_distil_bert_model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
Total params: 66362880 (253.15 MB)
Trainable params: 66362880 (253.15 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Now, we can see that we have got the same model without the classifer layers which we can add on top of this.

### 5.1. Sample Exploration of the layer

In [103]:
inputs = tokenizer("This is a sample input", return_tensors="tf")
bert_for_custom_model(inputs)

TFBaseModelOutput(last_hidden_state=<tf.Tensor: shape=(1, 7, 768), dtype=float32, numpy=
array([[[ 0.05297906,  0.10692392,  0.49419168, ...,  0.12507138,
          0.23212025,  0.17222881],
        [-0.01472363, -0.11656911,  0.28193325, ...,  0.12857884,
          0.30744654,  0.17272332],
        [ 0.01157056, -0.07407673,  0.54636   , ...,  0.03764485,
          0.2051253 ,  0.49154142],
        ...,
        [ 0.32394847,  0.08095304,  0.4270281 , ...,  0.08437095,
          0.18978027, -0.0091958 ],
        [-0.15465298,  0.17761149,  0.5083473 , ...,  0.19376224,
          0.36129084, -0.14435732],
        [ 1.0194045 ,  0.17841692,  0.48938137, ...,  0.79112005,
         -0.12476905, -0.09671681]]], dtype=float32)>, hidden_states=None, attentions=None)

Here we can see that the output is named as `last_hidden_state` which has a shape of `(1,7,768)`

`7` represents the number of input_ids.

`768` is the dimension of the model which can be seen in the config part.

In [107]:
bert_for_custom_model.config

DistilBertConfig {
  "_name_or_path": "distilbert/distilbert-base-uncased-finetuned-sst-2-english",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.41.0",
  "vocab_size": 30522
}

### 5.2. Adding Custom Layers

In [125]:
from tensorflow.keras.layers import Layer

class HuggingFaceModelLayer(Layer):
    def __init__(self, model, **kwargs):
        super(HuggingFaceModelLayer, self).__init__(**kwargs)
        self.model = model

    def call(self, inputs):
        input_ids, attention_mask = inputs
        return self.model(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state


input_ids = tf.keras.layers.Input(shape=(None,), dtype=tf.int32, name='input_ids')
attention_mask = tf.keras.layers.Input(shape=(None,), dtype=tf.int32, name='attention_mask')

last_hidden_state = HuggingFaceModelLayer(bert_for_custom_model)([input_ids, attention_mask])

# Taking the first CLS token which usually captures the overall information
cls_token_state = last_hidden_state[:, 0, :]

# Add custom layers
x = tf.keras.layers.Dense(256, activation='relu')(cls_token_state)
x = tf.keras.layers.Dense(128, activation='relu')(x)
x = tf.keras.layers.Dense(64, activation='relu')(x)
output = tf.keras.layers.Dense(1, activation='sigmoid')(x)  # For binary classification, use 'sigmoid'

# Create the final model
custom_model = tf.keras.Model(inputs=[input_ids, attention_mask], outputs=output)

custom_model.summary()

Model: "model_4"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_ids (InputLayer)      [(None, None)]               0         []                            
                                                                                                  
 attention_mask (InputLayer  [(None, None)]               0         []                            
 )                                                                                                
                                                                                                  
 hugging_face_model_layer_5  (None, None, 768)            6636288   ['input_ids[0][0]',           
  (HuggingFaceModelLayer)                                 0          'attention_mask[0][0]']      
                                                                                            

Here we have successfuly added layers on top of the Hugging Face Model.

### 5.3. Fine Tuning the Model

For this blog sake. We will keep the Distilbert layer as non-trainable.

In [129]:

custom_model.layers[2].trainable = False

custom_model.summary()

Model: "model_4"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_ids (InputLayer)      [(None, None)]               0         []                            
                                                                                                  
 attention_mask (InputLayer  [(None, None)]               0         []                            
 )                                                                                                
                                                                                                  
 hugging_face_model_layer_5  (None, None, 768)            6636288   ['input_ids[0][0]',           
  (HuggingFaceModelLayer)                                 0          'attention_mask[0][0]']      
                                                                                            

While using earlier we used entropy loss with logits; Now, we can use binary_cross entropy directly. Since we have added our own classifer custom layers.

In [130]:
custom_model.compile(
    optimizer="adam",
    loss="binary_crossentropy",
    metrics=["accuracy"],
)
custom_model.fit(
    tf_train_dataset,
    validation_data=tf_validation_dataset,
    verbose=1,
    epochs=10
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7fdd0b8c0cd0>

Finally we successfully trained the custom model. Since we are training on baby dataset the validation accuracy hasn't improved. Will explore to train on larger datasets in next blog and will share the results.

Overall, this is a good learning on the fundementals of the Hugging Face module usage.  Please do comment on further explanation where i could not provide in the above.