---
title: Hugging Face - Representation Models 
categories:
- Representation Models
- Hugging Face
- Sentiment Classification
date: '2025-02-24'
description: Exploration of Represenation Models & Text Classification
draft: false
---

In this module, we will explore the basics of two approaches to text classification using Encoder Transformers:
- Using BERT
- Using Label Encodings (Sentence Transformers)

Encourage to explore this [article](https://arunkoundinya.github.io/AIBasicswithAK/blogs/posts/representation_models/) to understand the background and intuition behind these two models.

In this article, we will also delve into sentiment classification through the following methods:
- Without training
- Using BERT LLM and Logistic Regression
- Using Sentence Transformers LLM and Logistic Regression
- Creating labels when they are not available

![](image.png)

<a href="https://colab.research.google.com/github/ArunKoundinya/DeepLearning/blob/main/posts/RepresentationModels_TextClassification/index.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Installing & Loading Libraries

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.3.2-py3-none-any.whl (485 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading

In [2]:
from google.colab import drive
import os

import pandas as pd

from transformers import TFAutoModelForSequenceClassification, AutoTokenizer
import tensorflow as tf
import numpy as np

import datasets
from datasets import Dataset, DatasetDict

### BERT Model - Sentiment Prediction w/o Training

In [3]:
checkpoint = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


If we can observe all the related base files are loaded; that includes model configuration, model itself and vocab text

Now predicting is simple like we use `chatgpt`

In [4]:
model.summary()

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_19 (Dropout)        multiple                  0 (unused)
                                                                 
Total params: 66955010 (255.41 MB)
Trainable params: 66955010 (255.41 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Since we have used `TFAutoModelForSequenceClassification` the model has a default classifier which predicts the output etiher as positive or negative.

In [5]:
os.chdir('/content/drive/My Drive/MSIS/IntroductiontoDeepLearning/Project/')

test_data = pd.read_csv('test_data_sample_complete.csv')
train_data = pd.read_csv('train_data_sample_complete.csv')

test_data = test_data.sample(n=1500, random_state=42)
train_data = train_data.sample(n=1500, random_state=42)

test_data['class_index'] = test_data['class_index'].map({1:0, 2:1})
train_data['class_index'] = train_data['class_index'].map({1:0, 2:1})

test_data['review_combined_lemma'] = test_data['review_combined_lemma'].fillna('')
train_data['review_combined_lemma'] = train_data['review_combined_lemma'].fillna('')

test_data = Dataset.from_pandas(test_data)
train_data = Dataset.from_pandas(train_data)
raw_data = DatasetDict()
raw_data["test"] = test_data
raw_data["train"] = train_data

print(raw_data)

DatasetDict({
    test: Dataset({
        features: ['class_index', 'review_combined_lemma', '__index_level_0__'],
        num_rows: 1500
    })
    train: Dataset({
        features: ['class_index', 'review_combined_lemma', '__index_level_0__'],
        num_rows: 1500
    })
})


This dataset contains Amazon reviews, downloaded from Kaggle, pre-processed, and stored for a project assignment I completed about a year ago.

Using Datasets package, we have converted the dataset into the required format of huggingface transformers processing

In [None]:
Dataset.to_pandas(raw_data['test'])

Unnamed: 0,class_index,review_combined_lemma,__index_level_0__
0,1,great book must preface saying not religious l...,23218
1,0,huge disappointment big time long term trevani...,20731
2,1,wayne tight cant hang turk album hot want howe...,39555
3,1,excellent read book elementary school probably...,147506
4,0,not anusara although book touted several anusa...,314215
...,...,...,...
1495,0,indifferent hears big dog little dog yap away ...,316639
1496,1,movie watch grandchild good movie little gore ...,91834
1497,1,patriot did win superbowl great piece memorabi...,176737
1498,0,11 stinker never fan series cd really bizarre ...,298198


In [None]:
tokenized_ids = tokenizer(raw_data['test']['review_combined_lemma'], truncation=True,padding=True,return_tensors="tf", max_length=128)
model_output = model(tokenized_ids)

Here we converted the raw data into numerical format using tokenizer, which tokenizes the text into numbers using the downloaded vocab dictionary.

These tokens are passed into the model and output is captured.

Since, we are not training the model again we are tokenizing only the test data set.

In [None]:
from sklearn.metrics import classification_report

tf.keras.backend.clear_session()

print(classification_report(raw_data['test']['class_index'], tf.argmax(model_output.logits,axis=1)))

              precision    recall  f1-score   support

           0       0.72      0.95      0.82       722
           1       0.93      0.67      0.78       778

    accuracy                           0.80      1500
   macro avg       0.83      0.81      0.80      1500
weighted avg       0.83      0.80      0.80      1500



Here, we can see that the default `foundationmodel` of BERT is giving us 80% accuracy. Which is very good :).

### Bert with Logistic Regression

In [6]:
from transformers import TFAutoModel

bert_model = TFAutoModel.from_pretrained(checkpoint)
bert_model.summary()

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertModel: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias', 'classifier.weight']
- This IS expected if you are initializing TFDistilBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


Model: "tf_distil_bert_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
Total params: 66362880 (253.15 MB)
Trainable params: 66362880 (253.15 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Since, we will be training the classifer layer we are loading the model without classifer layer using the command `TFAutoModel`. You can see the difference in outputs in both models

In [7]:
tokenized_ids = tokenizer(raw_data['train']['review_combined_lemma'], truncation=True,padding=True,return_tensors="tf", max_length=128)
bert_output = bert_model(tokenized_ids)

We are tokenizing the training dataset.

In [8]:
bert_output.last_hidden_state.numpy().mean(axis=1).shape
reshaped_output = bert_output.last_hidden_state.numpy().mean(axis=1)

Extracting the last layer output

In [9]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(reshaped_output, raw_data['train']['class_index'])

Feeding the BERT last layer output to the Logistic Regression and trained the Logistic Regression.

In [10]:
from sklearn.metrics import classification_report
tokenized_ids = tokenizer(raw_data['test']['review_combined_lemma'], truncation=True,padding=True,return_tensors="tf", max_length=128)
bert_output = bert_model(tokenized_ids)
reshaped_output = bert_output.last_hidden_state.numpy().mean(axis=1)
y_pred = lr.predict(reshaped_output)
print(classification_report(raw_data['test']['class_index'], y_pred))

              precision    recall  f1-score   support

           0       0.84      0.86      0.85       722
           1       0.87      0.84      0.85       778

    accuracy                           0.85      1500
   macro avg       0.85      0.85      0.85      1500
weighted avg       0.85      0.85      0.85      1500



On Test Dataset we can see that the accuracy has jumped from 80% to 85% with a mere Logistic Classifier at the end. Isn't it beautiful. However, only drawback of this is that is consumes lot of GPU memory.

### Sentence Transformers with Logistic Regression

In [None]:
os.chdir('/content/drive/My Drive/MSIS/IntroductiontoDeepLearning/Project/')

test_data = pd.read_csv('test_data_sample_complete.csv')
train_data = pd.read_csv('train_data_sample_complete.csv')

test_data = test_data.sample(n=10000, random_state=42)
train_data = train_data.sample(n=10000, random_state=42)

test_data['class_index'] = test_data['class_index'].map({1:0, 2:1})
train_data['class_index'] = train_data['class_index'].map({1:0, 2:1})

test_data['review_combined_lemma'] = test_data['review_combined_lemma'].fillna('')
train_data['review_combined_lemma'] = train_data['review_combined_lemma'].fillna('')

test_data = Dataset.from_pandas(test_data)
train_data = Dataset.from_pandas(train_data)
raw_data = DatasetDict()
raw_data["test"] = test_data
raw_data["train"] = train_data

print(raw_data)

DatasetDict({
    test: Dataset({
        features: ['class_index', 'review_combined_lemma', '__index_level_0__'],
        num_rows: 10000
    })
    train: Dataset({
        features: ['class_index', 'review_combined_lemma', '__index_level_0__'],
        num_rows: 100000
    })
})


I have reloaded the dataset to demonstrate that sentence transformers can handle larger datasets more efficiently compared to the BERT model shown earlier. Sentence transformers effortlessly convert text into embeddings, reducing memory usage for tokenization and subsequent model processing.

Although both models are based on BERT, sentence transformers offer better memory efficiency.

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

train_embeddings = model.encode(raw_data['train']['review_combined_lemma'], show_progress_bar=True)
test_embeddings = model.encode(raw_data['test']['review_combined_lemma'], show_progress_bar=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/3125 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Loaded the model and converted both train data and test data into embeddings.

In [None]:
train_embeddings.shape

(100000, 768)

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=1000)
lr.fit(train_embeddings, raw_data['train']['class_index'])

Furthermore, we trained a lightweight logistic regression model using those embeddings.

In [None]:
from sklearn.metrics import classification_report

y_pred = lr.predict(test_embeddings)
print(classification_report(raw_data['test']['class_index'], y_pred))

              precision    recall  f1-score   support

           0       0.88      0.86      0.87      4972
           1       0.86      0.88      0.87      5028

    accuracy                           0.87     10000
   macro avg       0.87      0.87      0.87     10000
weighted avg       0.87      0.87      0.87     10000



Here, we can see that our accuracy increased from 85% to 87%. However, we cannot directly attribute this improvement to the use of sentence transformers alone, as both BERT and sentence transformers capture the context of the information. That said, based on my understanding, sentence transformers are faster, more scalable, and reliable.

### Creating Labels Using Sentence Transformers

Let’s assume that instead of predicting positive or negative sentiment, we want to classify sentiment on a 5-point Likert scale. Sentence transformers come in handy here, as they allow us to explore the similarity between the labels and the input text, helping us tag the input accordingly.

In [None]:
label_embeddings = model.encode( ["Very Negative", "Negative", "Neutral", "Positive", "Very Positive"], show_progress_bar=True)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(test_embeddings, label_embeddings)

array([[0.15364526, 0.17884818, 0.12452998, 0.1333864 , 0.08642562],
       [0.27978075, 0.19118355, 0.12162416, 0.17023209, 0.17311683],
       [0.07127699, 0.14324695, 0.07260972, 0.08962228, 0.07847168],
       ...,
       [0.15041098, 0.13494283, 0.01669509, 0.1404528 , 0.17394802],
       [0.00270087, 0.05694368, 0.01807276, 0.0432991 , 0.03236848],
       [0.13147888, 0.17518383, 0.14696477, 0.15878314, 0.17004938]],
      dtype=float32)

Its simple, we have arrived at cosine similarly of both input text and output labels that we have defined above.

In [None]:
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)
y_pred

array([1, 0, 1, ..., 4, 1, 1])

In [None]:

labels = ["Very Negative", "Negative", "Neutral", "Positive", "Very Positive"]
y_pred_labels = [labels[i] for i in y_pred]

test_df = Dataset.to_pandas(raw_data['test'])
y_pred_df = pd.DataFrame(y_pred_labels, columns=['Predicted_Labels'])

combined_df = pd.concat([test_df.reset_index(drop=True), y_pred_df.reset_index(drop=True)], axis=1)
combined_df


Unnamed: 0,class_index,review_combined_lemma,__index_level_0__,Predicted_Labels
0,1,great book must preface saying not religious l...,23218,Negative
1,0,huge disappointment big time long term trevani...,20731,Very Negative
2,1,wayne tight cant hang turk album hot want howe...,39555,Negative
3,1,excellent read book elementary school probably...,147506,Positive
4,0,not anusara although book touted several anusa...,314215,Negative
...,...,...,...,...
9995,0,left many question read book recently diagnose...,105263,Positive
9996,1,liked wontrom reading rest great book no doubt...,334968,Negative
9997,1,recorder product durable bought fourth grader ...,355111,Very Positive
9998,1,like book elizabeth von arnim enjoy gardening ...,95143,Negative


Wohooo!!! We have custom created our own `Predicted Labels` using sentence tranfomers. Although they might not be completely accurate but it helps us to arrive at a quick conclusion when we have no information about the input text.

This programming article enhanced my understanding of how to use representation models in practice, providing new insights and uncovering exciting possibilities for leveraging embedding models. More to come—stay tuned!

```{=html}
<script src="https://giscus.app/client.js"
        data-repo="ArunKoundinya/DeepLearning"
        data-repo-id="R_kgDOLhOfMA"
        data-category="General"
        data-category-id="DIC_kwDOLhOfMM4CeHeZ"
        data-mapping="pathname"
        data-strict="0"
        data-reactions-enabled="1"
        data-emit-metadata="0"
        data-input-position="bottom"
        data-theme="dark_high_contrast"
        data-lang="en"
        crossorigin="anonymous"
        async>
</script>
```