## **`Sentiment Classification using TinyBERT`**

#### BERT (Bidirectional Encoder Representations from Transformers)

Developed by Google in 2018.
A large transformer-based language model.
Pretrained on massive text corpora (BooksCorpus + Wikipedia) using masked language modeling and next sentence prediction.
Captures bidirectional context (left + right) in sentences.
Very powerful but computationally heavy (large number of parameters).

#### TinyBERT

A smaller, faster, and lighter version of BERT, created through knowledge distillation.
The large BERT (teacher model) transfers its knowledge to a smaller student model.
Maintains most of BERT’s accuracy while being much more efficient for real-time and edge applications (e.g., mobile devices).
Useful for tasks like text classification, sentiment analysis, and question answering where low latency is needed.

👉 In short: BERT = big, accurate, resource-heavy.
TinyBERT = compact, efficient, still good accuracy.

### **Import Libraries**

In [1]:
import os
import boto3
import torch
import posixpath
import numpy as np
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer
from transformers import pipeline
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

### **Load Data using HuggingFace Datasets library**

In [2]:
data = pd.read_csv("https://github.com/laxmimerit/All-CSV-ML-Data-Files-Download/raw/refs/heads/master/IMDB-Dataset.csv")

In [3]:
data.shape

(50000, 2)

In [4]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
### Converting to Huggingface Dataset

dataset = Dataset.from_pandas(data)
dataset

Dataset({
    features: ['review', 'sentiment'],
    num_rows: 50000
})

In [6]:
### Splitting data into train and test

dataset = dataset.train_test_split(test_size=0.3)
dataset

DatasetDict({
    train: Dataset({
        features: ['review', 'sentiment'],
        num_rows: 35000
    })
    test: Dataset({
        features: ['review', 'sentiment'],
        num_rows: 15000
    })
})

In [7]:
data["sentiment"].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

In [8]:
label2id = {"negative": 0, "positive": 1}
id2label = {0: "negative", 1: "positive"}

In [9]:
dataset = dataset.map(lambda x: {"label": label2id[x["sentiment"]]})
dataset

Map:   0%|          | 0/35000 [00:00<?, ? examples/s]

Map:   0%|          | 0/15000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['review', 'sentiment', 'label'],
        num_rows: 35000
    })
    test: Dataset({
        features: ['review', 'sentiment', 'label'],
        num_rows: 15000
    })
})

In [10]:
dataset["train"][0]

{'review': 'A phenomenal achievement in awfulness. It\'s actually hilariously awful.<br /><br />First off...Nicholas Cage must now have made it to the finals in the Over-Emoting Category in his acting class. Wearing new hair plugs and with a face that has been lifted so many times his pinned back ears seem to be straining to touch in the back he oozes not only a sick smarmiess but creates a "hero" character that you have no vested interest in.<br /><br />I don\'t know what it is with Neil Labute and female characters. He makes females out to be totally deviant and evil...and pays them back by having Cage punch several of them directly in the face and call them all "b****es" a few times too. I\'ve enjoyed LaBute\'s early films and a few of his plays...but it\'s a strange fascination he has.<br /><br />I\'d give this film a 2 out of 10 solely based on Ellen Burstyn\'s performance. By the time she finally makes her appearance (bravely soldiering through her scenes with her wig line clearl

### **Data Tokenization**

In [11]:
model_checkpoint = "huawei-noah/TinyBERT_General_4L_312D"

In [12]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

In [13]:
tokenizer

BertTokenizerFast(name_or_path='huawei-noah/TinyBERT_General_4L_312D', vocab_size=30522, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

In [14]:
tokenizer(dataset["train"][1]["review"])

{'input_ids': [101, 1045, 2001, 3201, 2005, 1037, 21676, 2075, 6816, 2806, 3185, 1998, 2035, 1045, 2288, 2001, 1996, 5409, 3185, 1045, 1005, 2310, 2464, 1999, 2086, 1012, 2009, 2001, 2471, 2004, 2919, 2004, 5797, 3854, 14163, 12680, 14691, 5054, 1012, 2757, 5896, 1012, 2757, 3772, 1012, 2757, 2673, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 4379, 2045, 2001, 2070, 2204, 2954, 5019, 2021, 1996, 3893, 2217, 4515, 2045, 1012, 2065, 2023, 3185, 8480, 1999, 2115, 2160, 2448, 8040, 28578, 2075, 2000, 1037, 3042, 1998, 13764, 19989, 1998, 2360, 1010, 1000, 3531, 2393, 2045, 2003, 1037, 3185, 1999, 2026, 2160, 3214, 2000, 2486, 2111, 2000, 10797, 5920, 1000, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [15]:
def tokenize(batch):
    temp = tokenizer(batch["review"], padding=True, truncation=True, max_length=300)

    return temp

In [16]:
dataset = dataset.map(tokenize, batched=True, batch_size=None)

Map:   0%|          | 0/35000 [00:00<?, ? examples/s]

Map:   0%|          | 0/15000 [00:00<?, ? examples/s]

In [17]:
dataset["train"][0].keys()

dict_keys(['review', 'sentiment', 'label', 'input_ids', 'token_type_ids', 'attention_mask'])

### **Building Model Evaluation**

- https://huggingface.co/docs/transformers/v4.42.0/en/tasks/sequence_classification#evaluate

In [18]:
import evaluate

accuracy = evaluate.load("accuracy")

In [19]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

### **Building Model Pipeline**

- `AutoModelForSequenceClassification` model has a classification head on top of the pretrained model outputs
- The first thing we need is a pretrained BERT-like model.
- The only slight modification is that we use the `AutoModelForSequenceClassification` model instead of AutoModel.
- The difference is that the `AutoModelForSequenceClassification` model has a classification head on top of the pretrained model outputs, which can be easily trained with the base model.

In [20]:
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
device

device(type='mps')

In [21]:
model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint,
    num_labels=len(label2id),
    label2id=label2id,
    id2label=id2label,   
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at huawei-noah/TinyBERT_General_4L_312D and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [22]:
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 312, padding_idx=0)
      (position_embeddings): Embedding(512, 312)
      (token_type_embeddings): Embedding(2, 312)
      (LayerNorm): LayerNorm((312,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-3): 4 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=312, out_features=312, bias=True)
              (key): Linear(in_features=312, out_features=312, bias=True)
              (value): Linear(in_features=312, out_features=312, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=312, out_features=312, bias=True)
              (LayerNorm): LayerNorm((312,), eps=1e-1

In [23]:
args = TrainingArguments(
    output_dir = "sentiment-train_dir",
    overwrite_output_dir = True,
    num_train_epochs = 3,
    learning_rate = 2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    eval_strategy="epoch"
)

trainer = Trainer(
    model = model,
    args = args,
    train_dataset = dataset["train"],
    eval_dataset = dataset["test"],
    compute_metrics = compute_metrics,
    processing_class=tokenizer
)

In [24]:
### Train the model

trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,0.3521,0.309081,0.867933
2,0.2949,0.288847,0.8792
3,0.2545,0.292007,0.8804




TrainOutput(global_step=3282, training_loss=0.31753584439256144, metrics={'train_runtime': 746.0679, 'train_samples_per_second': 140.738, 'train_steps_per_second': 4.399, 'total_flos': 882184338000000.0, 'train_loss': 0.31753584439256144, 'epoch': 3.0})

In [25]:
trainer.evaluate()



{'eval_loss': 0.2920065224170685,
 'eval_accuracy': 0.8804,
 'eval_runtime': 23.8391,
 'eval_samples_per_second': 629.22,
 'eval_steps_per_second': 19.674,
 'epoch': 3.0}

### **Save model and Load for Inference**

In [26]:
trainer.save_model("sentiment-classifier-tinyBERT")

In [27]:
sample_reviews = [
    "Absolutely loved this movie! The story was engaging and the performances were top-notch.",
    "A delightful film with great acting and a heartwarming message. Highly recommended!",
    "This was a complete waste of time. The plot was predictable and the acting was terrible.",
    "I didn't enjoy this movie at all. The pacing was slow and the characters were uninteresting."
]

In [None]:
sentiment_classifier = pipeline(task="text-classification", model="sentiment-classifier-tinyBERT", device=device)

sentiment_classifier(sample_reviews)

Device set to use mps


[{'label': 'positive', 'score': 0.9879263639450073},
 {'label': 'positive', 'score': 0.9894049167633057},
 {'label': 'negative', 'score': 0.9905412793159485},
 {'label': 'negative', 'score': 0.9911723732948303}]

### **Push Trained Sentiment Classifier model to AWS S3**

In [29]:
s3_client = boto3.client("s3")

bucket_name = "bert-based-project"

In [30]:
def create_bucket(bucket_name):
    s3_buckets = [bucket["Name"] for bucket in s3_client.list_buckets()["Buckets"]]

    if bucket_name not in s3_buckets:
        s3_client.create_bucket(
            Bucket=bucket_name, 
            CreateBucketConfiguration={"LocationConstraint": "ap-south-1"}
        )
        return(f"Bucket {bucket_name} created successfully")

    return("Bucket already exists in your account!!! Feel free to use it.")

In [31]:
create_bucket(bucket_name)

'Bucket already exists in your account!!! Feel free to use it.'

In [32]:
s3_client.list_buckets()

{'ResponseMetadata': {'RequestId': 'Q7JBAAK0S1V5SGMR',
  'HostId': 'FDkvq+OVf1McnNu73TzvI86v3zMEPuswsnVO6q3ttXTbtWiCTr1lCs6Nwag4k34usdpiw64uUA0=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'FDkvq+OVf1McnNu73TzvI86v3zMEPuswsnVO6q3ttXTbtWiCTr1lCs6Nwag4k34usdpiw64uUA0=',
   'x-amz-request-id': 'Q7JBAAK0S1V5SGMR',
   'date': 'Mon, 18 Aug 2025 22:46:16 GMT',
   'content-type': 'application/xml',
   'transfer-encoding': 'chunked',
   'server': 'AmazonS3'},
  'RetryAttempts': 0},
 'Buckets': [{'Name': 'bert-based-project',
   'CreationDate': datetime.datetime(2025, 8, 18, 20, 47, 29, tzinfo=tzutc())}],
 'Owner': {'ID': '9594599b0b363b01690f507e083d2f2038c964873451ac0a90d0ce794286321a'}}

In [33]:
### Upload model folder to S3 bucket bert-based-project in ml-models/sentiment-classifier-tinyBERT

def upload_directory(bucket_name, dir_path, s3_prefix):
    for root, dirs, files in os.walk(dir_path):
        for file in files:
            # Full local path
            file_path = os.path.join(root, file)

            # Relative path (keeps folder structure in S3)
            rel_path = os.path.relpath(file_path, dir_path)

            # S3 key (use posixpath to enforce "/")
            s3_key = posixpath.join(s3_prefix, rel_path)
            
            print(f"Uploading {file_path} → s3://{bucket_name}/{s3_key}")

            s3_client.upload_file(file_path, bucket_name, s3_key)

In [34]:
upload_directory(bucket_name, "sentiment-classifier-tinyBERT", "ml-models/sentiment-classifier-tinyBERT")

Uploading sentiment-classifier-tinyBERT/model.safetensors → s3://bert-based-project/ml-models/sentiment-classifier-tinyBERT/model.safetensors
Uploading sentiment-classifier-tinyBERT/tokenizer_config.json → s3://bert-based-project/ml-models/sentiment-classifier-tinyBERT/tokenizer_config.json
Uploading sentiment-classifier-tinyBERT/special_tokens_map.json → s3://bert-based-project/ml-models/sentiment-classifier-tinyBERT/special_tokens_map.json
Uploading sentiment-classifier-tinyBERT/config.json → s3://bert-based-project/ml-models/sentiment-classifier-tinyBERT/config.json
Uploading sentiment-classifier-tinyBERT/tokenizer.json → s3://bert-based-project/ml-models/sentiment-classifier-tinyBERT/tokenizer.json
Uploading sentiment-classifier-tinyBERT/training_args.bin → s3://bert-based-project/ml-models/sentiment-classifier-tinyBERT/training_args.bin
Uploading sentiment-classifier-tinyBERT/vocab.txt → s3://bert-based-project/ml-models/sentiment-classifier-tinyBERT/vocab.txt
