# Install Dependencies

## 1. transformers

### What it is:
A library by HuggingFace containing pretrained models like BERT, RoBERTa, GPT, DistilBERT, T5, etc.

Why we install it:
To use:
- BERT tokenizer
- BERT for classification
- Sequence models (NER, QA, Summarization)
- Trainer API

In simple words:
- Gives you ready-made AI models like BERT and GPT.

## 2. datasets

### What it is:
A HuggingFace library for handling datasets efficiently.

Why we install it:
- Load datasets very fast
- Tokenize large datasets with multiprocessing
- Works directly with Trainer API

In simple words:
- Helps load and process large datasets easily and fast.

## 3. torch (PyTorch)

### What it is:
The deep learning framework used to train neural networks.

Why we install it:
- Transformers library uses PyTorch backend
- Needed for training BERT, DistilBERT, etc.
- Provides tensors, GPU acceleration, neural network layers

In simple words:
- This is the engine that trains the BERT model.

## 4. scikit-learn

### What it is:
A machine-learning library.

Why we install it:
- Train/test split
- Accuracy score
- Classification report
- Traditional ML models (SVM, RF, etc.)

In simple words:
- Used to measure accuracy and prepare data.

## 5. pandas

### What it is:
A data handling and manipulation library.

Why we install it:
- To read CSV datasets
- Create DataFrames
- Clean and preprocess text data

In simple words:
- Used to load and handle your dataset.

## 6. --quiet

### What it is:
A flag that hides installation logs.

Why we use it:
- To keep notebook output clean
- Prevent long installation messages

In simple words:
- Installs everything silently without showing long messages.

In [1]:
!pip install transformers datasets torch scikit-learn pandas --quiet

# Import Libraries

## 1. import pandas as pd

**What it does:**
Loads the **pandas** library and gives it the short name `pd`.

**Why we use it:**

* Read CSV files
* Create & manipulate DataFrames
* Handle text datasets

**Example:**

```python
df = pd.read_csv("data.csv")
```

## 2. import numpy as np

**What it does:**
Loads **NumPy**, a numerical computing library.

**Why we use it:**

* Handle arrays
* Perform numerical operations
* Convert data to `numpy` format

**Example:**

```python
arr = np.array([1,2,3])
```

## 3. import torch

**What it does:**
Loads **PyTorch**, the deep learning framework.

**Why we use it:**

* Create tensors
* Move data to GPU
* Build & train neural networks
* Used internally by Transformers

**Example:**

```python
x = torch.tensor([1, 2, 3])
```

## 4. from sklearn.model_selection import train_test_split

**What it does:**
Imports the function that **splits dataset** into:

* Training set
* Validation set
* Test set

**Example:**

```python
train_df, val_df = train_test_split(df, test_size=0.2)
```

## 5. from sklearn.metrics import accuracy_score, classification_report

**What these do:**

### accuracy_score

* Measures how many predictions were correct.
* Used for classification tasks.

### classification_report

* Prints precision, recall, F1-score.
* Gives detailed model evaluation.

**Example:**

```python
print(accuracy_score(y_true, y_pred))
print(classification_report(y_true, y_pred))
```

## 6. BertTokenizer

**What it does:**
Converts text → tokens → model input format.

**Why needed:**
BERT cannot read raw text.

**Example:**

```python
tokens = BertTokenizer.from_pretrained("bert-base-uncased")
```

## 7. BertForSequenceClassification

**What it does:**
A pretrained BERT model **for classification tasks**, such as:

* Sentiment analysis
* Spam detection
* Fake review detection
* Topic classification

**Example:**

```python
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
```

## 8. BertForTokenClassification

**What it does:**
Model used for **NER (Named Entity Recognition)**:

* PERSON
* LOCATION
* ORGANIZATION
* Dates
* Medical terms

**Example:**

```python
model = BertForTokenClassification.from_pretrained("bert-base-uncased")
```

## 9. BertTokenizerFast

**What it does:**
A **faster** tokenizer than `BertTokenizer`.
Uses Rust backend for speed.

**Why use it:**

* Faster NER tokenization
* Keeps word_ids (needed for token classification)

**Example:**

```python
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
```

## 10. BertModel

**What it does:**
Loads **BERT without any classification head**.

Used when you only want embeddings for:

* Sentence similarity
* Feature extraction
* Custom models
* Clustering / semantic search

**Example:**

```python
bert = BertModel.from_pretrained("bert-base-uncased")
```

## 11. AutoTokenizer

**What it does:**
Automatically loads the right tokenizer for **any model**.

**Example:**

```python
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
```

## 12. AutoModelForSequenceClassification

**What it does:**
Loads the correct model for classification **automatically**, e.g.:

* BERT
* DistilBERT
* ALBERT
* RoBERTa
* MobileBERT

**Example:**

```python
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
```

## 13. Trainer

**What it does:**
Handles the entire training loop automatically:

* Training
* Evaluation
* Saving checkpoints
* Logging
* Batching
* GPU usage

**You don’t write manual loops.**

**Example:**

```python
trainer = Trainer(model=model, args=training_args, train_dataset=train_data)
```


## 14. TrainingArguments

**What it does:**
Configures HOW training happens:

* learning rate
* batch size
* number of epochs
* where to save model
* evaluation strategy
* logging

**Example:**

```python
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3
)
```

In [2]:
import pandas as pd
import numpy as np
import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from transformers import (
    BertTokenizer, BertForSequenceClassification,
    BertForTokenClassification, BertForSequenceClassification,
    BertTokenizerFast, BertModel,
    AutoTokenizer, AutoModelForSequenceClassification,
    Trainer, TrainingArguments
)



# Part 01: Binary Classification

## Dataset

In [7]:
data = {
    "text": [
        "I love this phone!", "This is terrible.",
        "Amazing performance!", "Worst battery ever.",
        "I am very happy.", "I hate this.",
        "I lost my friend"
    ],
    "label": [1,0,1,0,1,0,0]
}
df = pd.DataFrame(data)
print(df)

                   text  label
0    I love this phone!      1
1     This is terrible.      0
2  Amazing performance!      1
3   Worst battery ever.      0
4      I am very happy.      1
5          I hate this.      0
6      I lost my friend      0


## Split

In [8]:
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)
print(train_df)
print(val_df)

                   text  label
5          I hate this.      0
2  Amazing performance!      1
4      I am very happy.      1
3   Worst battery ever.      0
6      I lost my friend      0
                 text  label
0  I love this phone!      1
1   This is terrible.      0


## Tokenizer + Dataset Class

### 1. Load the BERT Tokenizer

```python
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
```

#### What this line does:

* Downloads the **BERT tokenizer**.
* "bert-base-uncased" = lowercase English model.
* Tokenizer converts text → tokens → input IDs → attention masks.

### 2. Start of Custom Dataset Class

```python
class BERTDataset(torch.utils.data.Dataset):
```

#### What this line does:

* Creates a **custom PyTorch dataset**.
* This Dataset will later be used by DataLoader and Trainer.
* It tells PyTorch how to feed BERT with data.

### 3. Constructor Method

```python
def __init__(self, text, labels, tokenizer, max_len=64):
```

#### What this does:

This function is called when you create the dataset.

It receives:

* `text` → list of sentences
* `labels` → list of labels (0/1, or multi-class)
* `tokenizer` → BERT tokenizer
* `max_len` → maximum token length (default 64 tokens)

### 4. Store Inputs in Object Variables

```python
self.text = text
self.labels = labels
self.tokenizer = tokenizer
self.max_len = max_len
```

#### What this does:

Stores the inputs so they can be used later by the dataset.

### 5. **getitem** Method

This method returns **one training example** at a time.

```python
def __getitem__(self, idx):
```

#### Purpose:

Given an index `idx`, it returns:

* Tokenized text
* Attention mask
* Label

PyTorch uses this to create batches.

### 6. Tokenize the Sentence

```python
enc = tokenizer(
    self.text[idx], truncation=True, padding="max_length",
    max_length=self.max_len, return_tensors="pt"
)
```

#### What this does:

It converts a single sentence into:

* `input_ids`
* `attention_mask`
* (optional) `token_type_ids`

##### Parameters:

| Parameter                 | Meaning                            |
| ------------------------- | ---------------------------------- |
| `self.text[idx]`          | Get sentence at index `idx`        |
| `truncation=True`         | Cuts long sentences beyond max_len |
| `padding="max_length"`    | Pads short ones to the same length |
| `max_length=self.max_len` | Maximum token length               |
| `return_tensors="pt"`     | Returns PyTorch tensors            |

##### Output example:

```python
{
 'input_ids': tensor([...]),
 'attention_mask': tensor([...])
}
```

### 7. Remove Extra Batch Dimension

```python
enc = {k: v.squeeze(0) for k,v in enc.items()}
```

#### Why?

Tokenizer returns shape:

```
(1, max_len)
```

But Trainer expects:

```
(max_len,)
```

So `.squeeze(0)` removes the first dimension.

### 8. Add Label to Dictionary

```python
enc["labels"] = torch.tensor(self.labels[idx])
```

#### What this does:

Adds the label so Trainer knows the correct class.

Example:

```
enc["labels"] = tensor(1)
```

This dictionary will be passed into the model.

### 9. Return Single Training Example

```python
return enc
```

#### What it returns:

A dictionary like:

```python
{
 'input_ids': tensor([...]),
 'attention_mask': tensor([...]),
 'labels': tensor(1)
}
```

This is exactly what **BERT** needs for training.

### 10. len Method

```python
def __len__(self):
    return len(self.text)
```

#### Purpose:

Returns how many samples are in the dataset.

If you have 100 sentences:

```
len(dataset) → 100
```

Trainer uses this to know how many batches to create.

In [10]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

class BERTDataset(torch.utils.data.Dataset):
    def __init__(self, text, labels, tokenizer, max_len=32):
      # Since the dataset is small, using max_len=32 is sufficient and improves training speed.
        self.text = text
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __getitem__(self, idx):
        enc = tokenizer(
            self.text[idx], truncation=True, padding="max_length",
            max_length=self.max_len, return_tensors="pt"
        )
        enc = {k: v.squeeze(0) for k,v in enc.items()}
        enc["labels"] = torch.tensor(self.labels[idx])
        return enc

    def __len__(self):
        return len(self.text)

## Prepare Dataset

### 1. input_ids

Example:

```python
'input_ids': tensor([101, 1045, 5223, 2023, 1012, 102, 0, 0, 0, ...])
```

#### What is `input_ids`?

These numbers are **token IDs** — BERT vocabulary numbers.

Example decoding:

| Token   | ID   |
| ------- | ---- |
| `[CLS]` | 101  |
| `I`     | 1045 |
| `love`  | 5223 |
| `this`  | 2023 |
| `.`     | 1012 |
| `[SEP]` | 102  |

Everything after that becomes `0` because of padding.

#### Why do we pad with zeros?

Because BERT needs **fixed sequence length**, here `max_len=64`.
If sentence is short, we pad with zeros.

### 2. token_type_ids

Example:

```python
'token_type_ids': tensor([0, 0, 0, 0, ..., 0])
```

#### What is this?

Token type IDs tell BERT whether a token belongs to:

* Sentence A → **0**
* Sentence B → **1**

Used for **next sentence prediction**.

But in **classification**, we only have 1 sentence → so ALL are **0**.

### 3. attention_mask

Example:

```python
'attention_mask': tensor([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, ...])
```

#### What does attention mask do?

It tells BERT **which tokens are real** and **which are padding**:

| Mask value | Meaning                            |
| ---------- | ---------------------------------- |
| `1`        | Token must be attended (real word) |
| `0`        | Ignore this token (padding)        |

So here the first 6 tokens are real, rest 58 are padding zeros.

### 4. labels

Example:

```python
'labels': tensor(0)
```

#### This is the actual label for training.

If it is a binary classification:

* `0` = Negative
* `1` = Positive

In your example:

```
First sample → label = 0
Second sample → label = 1
```

This is what the model tries to predict.

In [11]:
train_dataset = BERTDataset(train_df.text.tolist(), train_df.label.tolist(), tokenizer)
val_dataset   = BERTDataset(val_df.text.tolist(), val_df.label.tolist(), tokenizer)
print(train_dataset[0])
print(val_dataset[0])

{'input_ids': tensor([ 101, 1045, 5223, 2023, 1012,  102,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0]), 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0]), 'labels': tensor(0)}
{'input_ids': tensor([ 101, 1045, 2293, 2023, 3042,  999,  102,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0]), 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0

## Model

###1. BertForSequenceClassification

This is a **pretrained BERT model designed for classification tasks**, such as:

* Sentiment analysis
* Spam detection
* Fake review detection
* Binary or multi-class classification

It adds a special **classification layer** (a linear layer) on top of BERT.

So BERT outputs → go into a classifier to predict labels.

###2. .from_pretrained("bert-base-uncased")

This loads the **pretrained BERT weights** from HuggingFace.

#### "bert-base-uncased" means:

* **base** = 12 layers, 110 million parameters
* **uncased** = lowercase model (ignores capital letters)

  * “Apple” → “apple”
  * “HELLO” → “hello”

It downloads:

* Vocabulary
* Model architecture
* Pretrained weights from huge corpus (Wikipedia + Books)

###3. num_labels=2

This tells BERT how many classes your classification problem has.

#### Since num_labels=2:

* Label **0** → e.g., negative
* Label **1** → e.g., positive

This configures the final classification layer as:

```
Hidden size → 2 output logits
```

So model outputs:

```
[logit_for_class_0, logit_for_class_1]
```

The higher logit determines the predicted class.

In [12]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
print(model.classifier)

Linear(in_features=768, out_features=2, bias=True)


## Training Arguments

###output_dir="./bin_cls"

#### Meaning:

This is the folder where:

* Fine-tuned model
* Checkpoints
* Config files
* Tokenizer files

will be saved.

Example directory created:

```
bin_cls/
    ├── checkpoint-1/
    ├── checkpoint-2/
    ├── config.json
    ├── pytorch_model.bin
```

You MUST specify this — HuggingFace Trainer requires it.

###report_to="none"
#### Meaning:

Disable ALL external logging integrations.

This disables:

* Weights & Biases (wandb)
* MLflow
* TensorBoard (except manual logs)
* Comet

Without this, Trainer tries to report logs to wandb if it's installed.

So:

```
report_to="none" = No visualization tools used
```

This keeps training output clean.

###save_strategy="epoch"

#### Meaning:

Save a checkpoint **at the end of every epoch**.

If you train 2 epochs → it saves:

```
checkpoint-1/
checkpoint-2/
```

Useful because you can:

* Resume training
* Compare performance of different checkpoints
* Avoid losing progress if Colab disconnects

###logging_dir="./logs"

#### Meaning:

Where training logs will be stored.

Logs include:

* Loss
* Evaluation metrics
* Learning rate

You can later view them using TensorBoard.

Example:

```python
%tensorboard --logdir ./logs
```

###learning_rate=2e-5

#### Meaning:

This sets the speed at which model updates weights.

`2e-5` = 0.00002

This is the most recommended learning rate for BERT training.

Why small?

* BERT is pretrained
* Large LR destroys pretrained knowledge
* Small LR fine-tunes gently

###num_train_epochs=2

#### Meaning:

Number of full passes through the training dataset.

* 1 epoch → fast but low performance
* **2–3 epochs** is typical for BERT
* Too many epochs → overfitting

So `2 epochs` is a safe, good starting point.

###per_device_train_batch_size=4

#### Meaning:

Number of samples processed in one batch **per GPU/CPU**.

* Batch size 4 fits small GPUs
* Larger batch → faster training but needs more VRAM

Typical BERT batch sizes:

| GPU VRAM | Recommended batch |
| -------- | ----------------- |
| 8 GB     | 4                 |
| 12 GB    | 8                 |
| 16+ GB   | 16                |

Batch size 4 is safe for most environments.


In [15]:
training_args = TrainingArguments(
    output_dir="./bin_cls",
    report_to="none",
    save_strategy="epoch",
    logging_dir="./logs",
    learning_rate=2e-5,
    num_train_epochs=4, #2
    per_device_train_batch_size=2 #4
)


Increasing the number of training epochs from 2 to 4 allows the model to learn more from the data, while reducing the batch size from 4 to 2 increases training updates per epoch and can improve generalization, though it may also increase training time and overfitting risk.

## Trainer

In [16]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

trainer.train()



Step,Training Loss




TrainOutput(global_step=12, training_loss=0.5481272141138712, metrics={'train_runtime': 109.2446, 'train_samples_per_second': 0.183, 'train_steps_per_second': 0.11, 'total_flos': 328888819200.0, 'train_loss': 0.5481272141138712, 'epoch': 4.0})

## Evaluation

In [None]:
pred = trainer.predict(val_dataset)
y_pred = pred.predictions.argmax(1)
y_true = pred.label_ids

print("Accuracy:", accuracy_score(y_true, y_pred))
print(classification_report(y_true, y_pred))



Accuracy: 0.5
              precision    recall  f1-score   support

           0       0.50      1.00      0.67         1
           1       0.00      0.00      0.00         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Saving FineTuned Model

In [17]:
trainer.save_model("./bin_cls")
tokenizer.save_pretrained("./bin_cls")

('./bin_cls/tokenizer_config.json',
 './bin_cls/special_tokens_map.json',
 './bin_cls/vocab.txt',
 './bin_cls/added_tokens.json')

## Utilizing the Finetune Model

In [18]:
model = BertForSequenceClassification.from_pretrained("./bin_cls")
tokenizer = BertTokenizer.from_pretrained("./bin_cls")

In [19]:
def predict_with_confidence(text):
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model(**inputs)
    logits = outputs.logits

    probs = torch.softmax(logits, dim=1)

    pred_id = probs.argmax(dim=1).item()
    class_name = label_names[pred_id]
    confidence = probs[0][pred_id].item()

    return class_name, confidence

In [20]:
label_names = {0: "Negative", 1: "Positive"}

In [21]:
text = "I enjoy rainy days"
label, conf = predict_with_confidence(text)
print(f"Text: {text}")
print(f"Prediction: {label}")
print(f"Confidence: {conf*100:.2f}%")


Text: I enjoy rainy days
Prediction: Negative
Confidence: 56.45%


## Using Binary Sentiment Dataset

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("dineshpiyasamara/sentiment-analysis-dataset")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/dineshpiyasamara/sentiment-analysis-dataset?dataset_version_number=1...


100%|██████████| 460k/460k [00:00<00:00, 38.8MB/s]

Extracting files...
Path to dataset files: /root/.cache/kagglehub/datasets/dineshpiyasamara/sentiment-analysis-dataset/versions/1





In [None]:
import os
files = os.listdir(path)
for file in files:
    if file.endswith(".csv"):
        df = pd.read_csv(path + "/" + file)
        print("Loaded:", file)
        break

Loaded: sentiment_analysis.csv


In [None]:
df

Unnamed: 0,id,label,tweet
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1...
1,2,0,Finally a transparant silicon case ^^ Thanks t...
2,3,0,We love this! Would you go? #talk #makememorie...
3,4,0,I'm wired I know I'm George I was made that wa...
4,5,1,What amazing service! Apple won't even talk to...
...,...,...,...
7915,7916,0,Live out loud #lol #liveoutloud #selfie #smile...
7916,7917,0,We would like to wish you an amazing day! Make...
7917,7918,0,Helping my lovely 90 year old neighbor with he...
7918,7919,0,Finally got my #smart #pocket #wifi stay conne...


In [None]:
df = df[['tweet', 'label']]

In [None]:
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)
print(train_df)
print(val_df)

                                                  tweet  label
4252  Cool Car wash idea #TheIsland #BankHolidayMond...      0
4428  Photo: 35th #Birthday to the #Sony #walkman @t...      0
7374  iPads are the biggest pile of fucking $&@*# on...      1
1410  Yearbook? Hmmmmm #instagram #instagood #togeth...      0
7896  So pissed! Macbook crashes, Apple Company does...      1
...                                                 ...    ...
5226  #Shana #tova!!! #Jewish #newyear everyone, may...      0
5390  I'm so sick of buying new cell phone chargers....      1
860   it, Want it, Have it! Download the free #iPhon...      0
7603  Photo: #nikosx #iphone #beach #holiday #bw #ip...      0
7270  Just got an iPhone 4S :) hehe #iPhone #apple #...      1

[6336 rows x 2 columns]
                                                  tweet  label
4896  Photo: cause we both dressed up today  #boyfr...      0
7539  @skullcandy your product is brutal, 1 headphon...      1
1677  Sunset Today in Zeeland 

In [None]:
train_dataset = BERTDataset(train_df.tweet.tolist(), train_df.label.tolist(), tokenizer)
val_dataset   = BERTDataset(val_df.tweet.tolist(), val_df.label.tolist(), tokenizer)
print(train_dataset[0])
print(val_dataset[0])

{'input_ids': tensor([  101,  4658,  2482,  9378,  2801,  1001,  1996,  2483,  3122,  1001,
         2924, 14854,  8524, 24335, 29067,  2100, 10957,  1001, 10957, 11442,
         4710,  1001, 25157,  4140,  1001, 27125, 26887,  1001,  2431, 22591,
        13469,  1001,  5070, 11968,  3207,  2860,  1001,  4913,  5297,  1001,
         2258, 18656,  1001, 19102,  1001, 18059,  1001, 28205,  2015,  1001,
         2482,  1001, 13154,  1001,  1048,  2290,  1001, 18798, 16059,  1001,
         4202, 26760, 10128,   102]), 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), 'labels'

In [None]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
print(model.classifier)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Linear(in_features=768, out_features=2, bias=True)


In [None]:
training_args = TrainingArguments(
    output_dir="./bin_cls",
    report_to="none",
    save_strategy="epoch",
    logging_dir="./logs",
    learning_rate=2e-5,
    num_train_epochs=2,
    per_device_train_batch_size=32
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

trainer.train()

Step,Training Loss


In [None]:
pred = trainer.predict(val_dataset)
y_pred = pred.predictions.argmax(1)
y_true = pred.label_ids

print("Accuracy:", accuracy_score(y_true, y_pred))
print(classification_report(y_true, y_pred))

In [None]:
trainer.save_model("./bin_cls")
tokenizer.save_pretrained("./bin_cls")

In [None]:
model = BertForSequenceClassification.from_pretrained("./bin_cls")
tokenizer = BertTokenizer.from_pretrained("./bin_cls")

In [None]:
tweet = "Battery life is awful, I regret buying this."
label, conf = predict_with_confidence(text)
print(f"Text: {tweet}")
print(f"Prediction: {label}")
print(f"Confidence: {conf*100:.2f}%")

# Part 02: Multi-Class Classification

In [22]:
data = {
    "text": [
        "The phone is great",       # positive
        "Battery is average",       # neutral
        "Worst camera ever",        # negative
        "Amazing sound quality",    # positive
        "Not good, not bad",        # neutral
        "Terrible performance"      # negative
    ],
    "label": [2,1,0,2,1,0]           # 0=Neg, 1=Neutral, 2=Pos
}
df = pd.DataFrame(data)

In [24]:
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)
print(train_df)
print(val_df)

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

class BERTDataset(torch.utils.data.Dataset):
    def __init__(self, text, labels, tokenizer, max_len=16):
        self.text = text
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __getitem__(self, idx):
        enc = tokenizer(
            self.text[idx], truncation=True, padding="max_length",
            max_length=self.max_len, return_tensors="pt"
        )
        enc = {k: v.squeeze(0) for k,v in enc.items()}
        enc["labels"] = torch.tensor(self.labels[idx])
        return enc

    def __len__(self):
        return len(self.text)
train_dataset = BERTDataset(train_df.text.tolist(), train_df.label.tolist(), tokenizer)
val_dataset   = BERTDataset(val_df.text.tolist(), val_df.label.tolist(), tokenizer)
print(train_dataset[0])
print(val_dataset[0])

# Re-initialize the model with the correct number of labels (3 for 0, 1, 2)
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)

training_args = TrainingArguments(
    output_dir="./bin_cls",
    report_to="none",
    save_strategy="epoch",
    logging_dir="./logs",
    learning_rate=2e-5,
    num_train_epochs=5, #2
    per_device_train_batch_size=1
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

trainer.train()
pred = trainer.predict(val_dataset)
y_pred = pred.predictions.argmax(1)
y_true = pred.label_ids

print("Accuracy:", accuracy_score(y_true, y_pred))
print(classification_report(y_true, y_pred))
trainer.save_model("./bin_cls")
tokenizer.save_pretrained("./bin_cls")
model = BertForSequenceClassification.from_pretrained("./bin_cls")
tokenizer = BertTokenizer.from_pretrained("./bin_cls")
def predict_with_confidence(text):
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model(**inputs)
    logits = outputs.logits

    probs = torch.softmax(logits, dim=1)

    pred_id = probs.argmax(dim=1).item()
    class_name = label_names[pred_id]
    confidence = probs[0][pred_id].item()

    return class_name, confidence

                    text  label
5   Terrible performance      0
2      Worst camera ever      0
4      Not good, not bad      1
3  Amazing sound quality      2
                 text  label
0  The phone is great      2
1  Battery is average      1
{'input_ids': tensor([ 101, 6659, 2836,  102,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0]), 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'attention_mask': tensor([1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'labels': tensor(0)}
{'input_ids': tensor([ 101, 1996, 3042, 2003, 2307,  102,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0]), 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'labels': tensor(2)}


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss




Accuracy: 0.5
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       0.00      0.00      0.00         1
           2       1.00      1.00      1.00         1

    accuracy                           0.50         2
   macro avg       0.33      0.33      0.33         2
weighted avg       0.50      0.50      0.50         2



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [32]:
label_names = {0: "Negative", 1: "Positive", 2:"Neutral",}
text = "They need to talk"
label, conf = predict_with_confidence(text)
print(f"Text: {text}")
print(f"Prediction: {label}")
print(f"Confidence: {conf*100:.2f}%")


Text: They need to talk
Prediction: Neutral
Confidence: 37.56%


# Part 03: Name Entity Relation

In [None]:
sentences = [
    "John lives in London",
    "Sara works at Google"
]
ner_tags = [
    ["B-PER", "O", "O", "B-LOC"],
    ["B-PER", "O", "O", "B-ORG"]
]

In [None]:
labels = {"O":0, "B-PER":1, "B-LOC":2, "B-ORG":3}

In [None]:
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

In [None]:
encoded = tokenizer(
    sentences,
    is_split_into_words=False,
    return_offsets_mapping=True,
    padding=True,
    truncation=True
)

encoded_labels = []
for i, sentence in enumerate(sentences):
    word_ids = encoded.word_ids(batch_index=i)
    tag_ids = []
    j = 0
    for word_id in word_ids:
        if word_id is None:
            tag_ids.append(-100)
        else:
            tag_ids.append(labels[ner_tags[i][word_id]])
    encoded_labels.append(tag_ids)

encoded.pop("offset_mapping")

In [None]:
class NERDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k,v in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Part 04: Sentence Similarity

In [None]:
sent1 = ["The sky is blue", "Dogs are running"]
sent2 = ["The sky is very blue", "A group of dogs run"]
scores = [4.5, 4.0]

In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

encodings = tokenizer(
    sent1, sent2,
    padding=True,
    truncation=True,
    return_tensors="pt"
)
labels = torch.tensor(scores)