**Transformer-from-Scratch: News Category Classifier**

1. Ingests & cleans the AG News corpus, tokenizes text, and builds a bespoke vocabulary.  
2. Implements sinusoidal position embeddings and a stacked multi-head self-attention encoder from first principles.  
3. Adds a lightweight classification head and trains end-to-end to predict one of four news sections.  
4. Validates accuracy, printing tensor shapes at every stage for sanity checks.

---

### Supporting technical theory

#### 1 · Tokenization & Embeddings  
Tokens are mapped to vectors **E** ∈ ℝ^{|V|×d}. Fixed sinusoidal encodings **P** supply sequence order, so the input to the encoder is **X** = **E** + **P**.

#### 2 · Multi-Head Self-Attention  
For each head *h*:  
\[
\mathrm{Attention}(Q,K,V)=\operatorname{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right)V
\]  
Parallel heads let the model focus on multiple relational patterns; their outputs are concatenated and linearly projected.

#### 3 · Residual Add + LayerNorm  
Every sub-layer is wrapped as \( \mathrm{LayerNorm}(x + \mathrm{Sublayer}(x)) \), stabilizing statistics and preserving gradient flow in deep stacks.

#### 4 · Position-Wise Feed-Forward  
A two-layer MLP (ReLU/GELU) is applied identically to each time-step, enriching token-level representations and boosting model capacity.

> *Symbols:* \( |V| \) – vocab size, \( d \) – hidden size, \( d_k \) – per-head key/query dimension.


In [1]:
#Import the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import json

## Load the Dataset

In [2]:
import json

# Read the file
with open('/kaggle/input/news-category-dataset/News_Category_Dataset_v3.json') as f:
    data = [json.loads(line) for line in f]

#Convert that into Dataframe or easier inspection
df = pd.DataFrame(data)

# View the first item
df.head()

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


In [3]:
# Total Number of data
len(df)

209527

In [4]:
#Take only headline and short_description
df = df[['headline', 'short_description', 'category']]

In [5]:
#Combine them both in single columns
df['news'] = df['headline'] + ' - ' + df['short_description']

In [6]:
#Remove the headline and short_description
df = df.drop(columns = ['headline', 'short_description'])
df.head()

Unnamed: 0,category,news
0,U.S. NEWS,Over 4 Million Americans Roll Up Sleeves For O...
1,U.S. NEWS,"American Airlines Flyer Charged, Banned For Li..."
2,COMEDY,23 Of The Funniest Tweets About Cats And Dogs ...
3,PARENTING,The Funniest Tweets From Parents This Week (Se...
4,U.S. NEWS,Woman Who Called Cops On Black Bird-Watcher Lo...


In [7]:
print(df['news'][0])
print('-' * 110)
print(df['news'][10])
print('-' * 110)
print(df['news'][20])

Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters - Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall.
--------------------------------------------------------------------------------------------------------------
World Cup Captains Want To Wear Rainbow Armbands In Qatar - FIFA has come under pressure from several European soccer federations who want to support a human rights campaign against discrimination at the World Cup.
--------------------------------------------------------------------------------------------------------------
Golden Globes Returning To NBC In January After Year Off-Air - For the past 18 months, Hollywood has effectively boycotted the Globes after reports that the HFPA’s 87 members of non-American journalists included no Black members.


## Cleaning the Text

In [8]:
import re

def clean_text(text):
    # Convert to lowercase
    text = text.lower()
    
    # Remove URLs
    text = re.sub(r'http\S+|www.\S+', '', text)
    
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    
    # Remove punctuation and special characters (except words and spaces)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

df['news'] = df['news'].apply(clean_text)

print(df['news'][0])
print('-' * 110)
print(df['news'][10])
print('-' * 110)
print(df['news'][20])

over 4 million americans roll up sleeves for omicrontargeted covid boosters health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the us ordered for the fall
--------------------------------------------------------------------------------------------------------------
world cup captains want to wear rainbow armbands in qatar fifa has come under pressure from several european soccer federations who want to support a human rights campaign against discrimination at the world cup
--------------------------------------------------------------------------------------------------------------
golden globes returning to nbc in january after year offair for the past 18 months hollywood has effectively boycotted the globes after reports that the hfpas 87 members of nonamerican journalists included no black members


## Build the Vocab

In [9]:
#Building the Vocab
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

def tokenize(text):
    return word_tokenize(text.lower())

from collections import Counter

# 1)Build vocab 
def build_vocab(texts, min_freq = 1):
    """
    Build vocabulary from a list of texts.

    Args:
        texts (list of str): List of text samples.
        min_freq (int): Minimum frequency for a word to be included.

    Returns:
        word2idx (dict): Mapping from word to unique integer index.
    """
    # Count token frequencies
    counter = Counter()
    for text in texts:
        tokens = word_tokenize(text.lower())
        counter.update(tokens)
    
    # Filter tokens by frequency threshold
    vocab_tokens = [token for token, freq in counter.items() if freq >= min_freq]
    
    # Add special tokens
    special_tokens = ['<PAD>', '<UNK>']
    
    # Final vocabulary: special tokens + sorted frequent tokens
    vocab = special_tokens + sorted(vocab_tokens)
    
    # Create token to index mapping
    word2idx = {word: idx for idx, word in enumerate(vocab)}
    
    return word2idx

# Encode a single text to token IDs with padding/truncation
def encode_text(text, word2idx, max_len=32):
    tokens = word_tokenize(text.lower())
    ids = [word2idx.get(token, word2idx['<UNK>']) for token in tokens]
    if len(ids) > max_len:
        ids = ids[:max_len]
    else:
        ids += [word2idx['<PAD>']] * (max_len - len(ids))
    return ids

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Making the Dataset

In [10]:
from torch.utils.data import Dataset,DataLoader

import torch

class NewsDataset(Dataset):
    """
    PyTorch Dataset class for news text classification.

    Args:
        texts (List[str]): List of input text samples (e.g., headlines or descriptions).
        labels (List[int]): Corresponding list of integer class labels.
        word2idx (dict): Vocabulary mapping from tokens to integer indices.
        max_len (int): Maximum sequence length. Texts will be padded or truncated to this length.
    """
    def __init__(self, texts, labels, word2idx, max_len=32):
        self.texts = texts              # Raw input texts
        self.labels = labels            # Integer class labels
        self.word2idx = word2idx        # Token-to-index vocabulary
        self.max_len = max_len          # Max token length per sample

    def __len__(self):
        # Return the total number of samples
        return len(self.texts)

    def __getitem__(self, idx):
        # Retrieve and encode a single example by index
        encoded_text = encode_text(self.texts[idx], self.word2idx, self.max_len)
        label = self.labels[idx]
        
        # Return tensors for model consumption
        return torch.tensor(encoded_text, dtype=torch.long), torch.tensor(label, dtype=torch.long)


In [11]:
#Check if there is a null values
df.isna().sum()

category    0
news        0
dtype: int64

In [12]:
#How many are their in each category
df['category'].value_counts()

category
POLITICS          35602
WELLNESS          17945
ENTERTAINMENT     17362
TRAVEL             9900
STYLE & BEAUTY     9814
PARENTING          8791
HEALTHY LIVING     6694
QUEER VOICES       6347
FOOD & DRINK       6340
BUSINESS           5992
COMEDY             5400
SPORTS             5077
BLACK VOICES       4583
HOME & LIVING      4320
PARENTS            3955
THE WORLDPOST      3664
WEDDINGS           3653
WOMEN              3572
CRIME              3562
IMPACT             3484
DIVORCE            3426
WORLD NEWS         3299
MEDIA              2944
WEIRD NEWS         2777
GREEN              2622
WORLDPOST          2579
RELIGION           2577
STYLE              2254
SCIENCE            2206
TECH               2104
TASTE              2096
MONEY              1756
ARTS               1509
ENVIRONMENT        1444
FIFTY              1401
GOOD NEWS          1398
U.S. NEWS          1377
ARTS & CULTURE     1339
COLLEGE            1144
LATINO VOICES      1130
CULTURE & ARTS     1074
EDUCATI

## Label Encoder

In [13]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['label'] = le.fit_transform(df['category'])
print(le.classes_)  # array of category names in order
print(df['label'].value_counts())  # distribution of encoded labels

['ARTS' 'ARTS & CULTURE' 'BLACK VOICES' 'BUSINESS' 'COLLEGE' 'COMEDY'
 'CRIME' 'CULTURE & ARTS' 'DIVORCE' 'EDUCATION' 'ENTERTAINMENT'
 'ENVIRONMENT' 'FIFTY' 'FOOD & DRINK' 'GOOD NEWS' 'GREEN' 'HEALTHY LIVING'
 'HOME & LIVING' 'IMPACT' 'LATINO VOICES' 'MEDIA' 'MONEY' 'PARENTING'
 'PARENTS' 'POLITICS' 'QUEER VOICES' 'RELIGION' 'SCIENCE' 'SPORTS' 'STYLE'
 'STYLE & BEAUTY' 'TASTE' 'TECH' 'THE WORLDPOST' 'TRAVEL' 'U.S. NEWS'
 'WEDDINGS' 'WEIRD NEWS' 'WELLNESS' 'WOMEN' 'WORLD NEWS' 'WORLDPOST']
label
24    35602
38    17945
10    17362
34     9900
30     9814
22     8791
16     6694
25     6347
13     6340
3      5992
5      5400
28     5077
2      4583
17     4320
23     3955
33     3664
36     3653
39     3572
6      3562
18     3484
8      3426
40     3299
20     2944
37     2777
15     2622
41     2579
26     2577
29     2254
27     2206
32     2104
31     2096
21     1756
0      1509
11     1444
12     1401
14     1398
35     1377
1      1339
4      1144
19     1130
7      1074
9      1

In [14]:
import torch
import torch.nn as nn

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(f'Using device: {device}')

Using device: cuda


In [15]:
# Calculate class counts
counts = df['label'].value_counts().sort_index().values  # sorted by label index
print(counts)

# Compute class weights inversely proportional to frequency
class_weights = 1.0 / counts
class_weights = class_weights / class_weights.sum() * len(class_weights)  # normalize

# Convert to torch tensor
class_weights_tensor = torch.tensor(class_weights, dtype=torch.float32).to(device)

# Use in loss
criterion = nn.CrossEntropyLoss(weight=class_weights_tensor)

[ 1509  1339  4583  5992  1144  5400  3562  1074  3426  1014 17362  1444
  1401  6340  1398  2622  6694  4320  3484  1130  2944  1756  8791  3955
 35602  6347  2577  2206  5077  2254  9814  2096  2104  3664  9900  1377
  3653  2777 17945  3572  3299  2579]


## Input Embeddings

Let's focus on the Encoder Part for now:

### Transformer Encoder

At first you can see that we have Input Embedding and the Positional Encoding so let's talk about that,

**Embedding** -> So, we know that the first thing we do is tokenize and we recieve the set's of discrete tokens and embedding's job is to change the set of discrete tokens into the continous vector representation.

Why the need to do this?
Because, Transformer is the neural network and they understand the numbers and not the words, so we need to change them to the numerical representation forms such that they captures the semantic meaning and context.

The transformer architecture starts with embedding sequences as vectors, and then encoding each token's position in the sequence so that tokens can be processed in parallel.

Suppose, we have three tokens 
["Cat", "Dog", "Fish"]

We know that each token have their own unique ID in the modle vocab which the model recognize. If we embed them using the embedding layers we get the embedding vector.The length of this vector is also referred to as the number of dimensions, or dimensionality. 

In [16]:
import torch
import torch.nn as nn
import math

class InputEmbeddings(nn.Module):
    """
    Converts token indices into dense vector embeddings and scales them.

    Args:
        vocab_size (int): Size of the vocabulary (number of unique tokens).
        d_model (int): Dimensionality of the embedding vectors (also the model's hidden size).
    """
    def __init__(self, vocab_size: int, d_model: int) -> None:
        super().__init__()

        self.d_model = d_model                  # Embedding dimension (same as model hidden size)
        self.vocab_size = vocab_size            # Total number of tokens in vocabulary
        self.embedding = nn.Embedding(vocab_size, d_model)  # Learnable embedding table

    def forward(self, x):
        """
        Args:
            x (Tensor): Tensor of token indices of shape (batch_size, seq_len)

        Returns:
            Tensor: Embedded and scaled tensor of shape (batch_size, seq_len, d_model)
        """
        # Multiply by sqrt(d_model) as recommended in the Transformer paper to help with convergence
        return self.embedding(x) * math.sqrt(self.d_model)

**Positional Encoding** : It is added to give the model information about the position of each word in a sequence.

Why the need of this? Because The word "ate" in "The cat ate the fish" is different from "ate" in "Ate the cat the fish?" — the order matters.

The positional Encoding are generated using the special encoding equation, where the sin is use for the even embedding values and cos is used for odd emedding values

The positional encoding for position `pos` and dimension `i` is defined as:

$$
\text{PE}_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{\frac{2i}{d}}}\right)
$$

$$
\text{PE}_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{\frac{2i}{d}}}\right)
$$

Where:
- \( pos \) is the position in the sequence,
- \( i \) is the dimension index,
- \( d \) is the total embedding dimension.

Sin and Cosine are the periodic functions who have their values between -1 and 1.

Why do we used them?

**Provide the unique patterns for each position**
1. The combination of sin and cos with different frequencies ensures that the each position has a unique encoding vector.
2. No two positions have the same encoding, and nearby positions have similar vectors, which helps the model recognize local context.

**Captures relative position information**
1. The sinusoidal form makes it easy for the model to learn the relative positions between words.
2. For example, PE(pos + k) can be expressed as a linear function of PE(pos), allowing the model to infer order differences like “word A is 2 steps ahead of word B”.

In [17]:
import torch
import torch.nn as nn
import math

class PositionalEncoding(nn.Module):
    """
    Implements the sinusoidal positional encoding from the Transformer paper:
    "Attention is All You Need" (Vaswani et al. 2017).

    This adds information about token positions to the input embeddings, 
    enabling the model to capture order without recurrence.

    Args:
        d_model (int): Dimensionality of the model/embedding.
        max_seq_length (int): Maximum sequence length supported.
    """
    def __init__(self, d_model, max_seq_length):
        super().__init__()

        # Initialize a matrix of shape (max_seq_length, d_model)
        pe = torch.zeros(max_seq_length, d_model)

        # Position indices (0 to max_seq_length-1) shaped as (max_seq_length, 1)
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)

        # Compute the div_term (frequency) for the sinusoidal functions.
        # Only half (every 2nd dim) because sin and cos alternate over even and odd dims
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

        # Apply sine to even indices in the array; 2i
        pe[:, 0::2] = torch.sin(position * div_term)

        # Apply cosine to odd indices in the array; 2i+1
        pe[:, 1::2] = torch.cos(position * div_term)

        # Register as a buffer (non-learnable), adds a batch dimension for broadcasting
        self.register_buffer('pe', pe.unsqueeze(0))  # shape: (1, max_seq_length, d_model)

    def forward(self, x):
        """
        Adds positional encoding to the input tensor.

        Args:
            x (Tensor): Input of shape (batch_size, seq_len, d_model)

        Returns:
            Tensor: Positionally encoded input of the same shape
        """
        # Add positional encoding to the input
        return x + self.pe[:, :x.size(1)]

## Multi-Head Attention

Before that, What is self Attention?

Self Attention is what enables the transformers to identify the relationship between tokens and to determine and focus on the most relevant ones. It allows a model to look at other positions in the same input sequence when encoding a word — hence the name "self" attention.

Self-attention determines:
“Which other words in the sentence should I pay attention to when understanding this word?”\

Example:
Take the sentence:

“The cat sat on the mat because it was warm.”

To understand what “it” refers to, self-attention helps the model focus on “cat” or “mat” rather than every word equally. The model figures this out on its own during training.

We know that each input word is converted into the embedding right? So then each embedding is project into the three different matrices known as Q, K and V.

Q : Query (indicates what each "token" is looking for in another token)

V : Value (Actual content to be aggregated or weighted)

K : Key (Represents the content of each token that other token might find relevant )

using seprate linear transformations with learned weights.

🧠 Analogy: Job Search Example
Imagine you're trying to hire someone:

    - Your Query (Q) is the job requirement.

    - Each candidate has a Key (K) = their resume.

    - The actual Value (V) is what you’d get if you hired them.

You compare your Query to all the Keys (resumes) to get scores, then use those scores to weigh the Values (candidates’ actual skills).

Values are based on the attention-scores, which are computed by doing the dot-product of the Key and Query matrices
So, Attention scores = Q-K similarity(dot - product) from where we get the attention scores(n*n)

From the attention scores we apply the Softmax to get the attention weights.

so, below is the clear image to show


Example : 
"orange is my favorite fruit," the tokens "favorite" and "fruit" receive the highest attention when processing "orange," as they directly influence its context and meaning. The model interprets "orange" as a favored fruit rather than a color or other meaning.

### ⚙️ Step-by-step:

#### 1. Compute Dot Products:  $Q \cdot K^T$

This gives a score of how much attention word A should pay to word B.

#### 2. Scale and Apply Softmax:

$$
\text{Attention\_weights} = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right)
$$

This normalizes the scores into probabilities.

#### 3. Multiply with V:

$$
\text{Output} = \text{Attention\_weights} \cdot V
$$

Each word’s final output is a **weighted sum of all the Value vectors**, based on attention.


### Multi Attention head

Multi-Head Attention is an advanced form of self-attention used in Transformers. Instead of calculating just one set of attention outputs (with one Q/K/V), it creates multiple "attention heads" — each learning different relationships or features in the input.

⚙️ Why do we need Multi-Head Attention?

A single self-attention layer may focus too narrowly. With multi-head attention:

- Each head looks at the sequence from a different perspective.

- Some heads may learn syntax (e.g., subject-verb links), others learn semantics (e.g., coreference, word meaning).

- This makes the model much more expressive.

The resulting embeddings capture token meaning, positional encoding, and contextual relationships.

In [18]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class MultiHeadAttention(nn.Module):
    """
    Multi-head self-attention mechanism as described in the "Attention is All You Need" paper.

    Args:
        d_model (int): Total dimensionality of the model.
        num_heads (int): Number of parallel attention heads.
    """
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

        self.num_heads = num_heads
        self.d_model = d_model
        self.head_dim = d_model // num_heads  # Dimension per head

        # Linear transformations for query, key, and value (no bias for attention projection)
        self.query_linear = nn.Linear(d_model, d_model, bias=False)
        self.key_linear = nn.Linear(d_model, d_model, bias=False)
        self.value_linear = nn.Linear(d_model, d_model, bias=False)

        # Final linear layer after concatenating all heads
        self.output_linear = nn.Linear(d_model, d_model)

    def split_heads(self, x, batch_size):
        """
        Split the embedding into multiple heads.

        Args:
            x (Tensor): shape (batch_size, seq_len, d_model)

        Returns:
            Tensor: shape (batch_size, num_heads, seq_len, head_dim)
        """
        seq_length = x.size(1)
        x = x.reshape(batch_size, seq_length, self.num_heads, self.head_dim)
        return x.permute(0, 2, 1, 3)  # move num_heads before seq_len

    def compute_attention(self, query, key, value, mask=None):
        """
        Compute scaled dot-product attention.

        Returns:
            context vector after attention, shape: (batch_size, num_heads, seq_len, head_dim)
        """
        # Shape: (batch_size, num_heads, seq_len, seq_len)
        scores = torch.matmul(query, key.transpose(-2, -1)) / (self.head_dim ** 0.5)

        # Apply mask (if provided): mask shape should match scores
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attention_weights = F.softmax(scores, dim=-1)  # softmax along last dimension
        return torch.matmul(attention_weights, value)  # context

    def combine_heads(self, x, batch_size):
        """
        Combine the heads back to a single tensor.

        Args:
            x (Tensor): shape (batch_size, num_heads, seq_len, head_dim)

        Returns:
            Tensor: shape (batch_size, seq_len, d_model)
        """
        x = x.permute(0, 2, 1, 3).contiguous()  # (batch_size, seq_len, num_heads, head_dim)
        return x.view(batch_size, -1, self.d_model)  # combine last two dims

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        # Apply linear transformations
        query = self.query_linear(query)
        key = self.key_linear(key)
        value = self.value_linear(value)

        # Split into heads
        query = self.split_heads(query, batch_size)
        key = self.split_heads(key, batch_size)
        value = self.split_heads(value, batch_size)

        # Apply attention on all heads
        attn_output = self.compute_attention(query, key, value, mask)

        # Combine heads and pass through final linear layer
        output = self.combine_heads(attn_output, batch_size)

        return self.output_linear(output)

## FeedForward SubLayer

After the Multi-Head Attention layer in a Transformer block, there's a FeedForward Neural Network (FFN) layer, also known as the FeedForward SubLayer. It adds non-linearity and transformation to each token independently.

📌 Why It’s Used

While attention layers let tokens communicate, the FFN lets each token transform itself — enriching its internal representation after it has “heard” from others.

Our FeedForwardSublayer class contains two fully connected linear layers separated by a ReLU activation. 

Notice we use a dimension d_ff between linear layers, typically different from the embedding dimension used throughout the model to further facilitate capturing complex patterns. The forward method applies the forward pass to the attention mechanism outputs, passing them through the layers.


In [19]:
import torch.nn as nn

class FeedForwardSubLayer(nn.Module):
    """
    Position-wise Feed-Forward Network used in Transformer blocks.

    Args:
        d_model (int): Input and output dimensionality (same as the embedding size).
        d_ff (int): Hidden dimensionality (usually larger, e.g., 2048 in original paper).

    Architecture:
        FFN(x) = max(0, xW1 + b1)W2 + b2
               = fc2(ReLU(fc1(x)))
    """
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.fc1 = nn.Linear(d_model, d_ff)   # First linear transformation (expands dimension)
        self.relu = nn.ReLU()                 # Activation function
        self.fc2 = nn.Linear(d_ff, d_model)   # Second linear transformation (projects back to d_model)

    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))  # Apply FFN to each position independently


In [20]:
class EncoderLayer(nn.Module):
    """
    A single Transformer encoder block.

    Consists of:
    1. Multi-head self-attention layer with residual connection + LayerNorm
    2. Position-wise feed-forward network with residual connection + LayerNorm

    Args:
        d_model (int): Input/output embedding dimension.
        num_heads (int): Number of attention heads.
        d_ff (int): Hidden layer size in the feed-forward network.
        dropout (float): Dropout rate applied after attention and FFN.
    """
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super().__init__()

        # Multi-head self-attention
        self.attn = MultiHeadAttention(d_model, num_heads)

        # Position-wise feed-forward network
        self.ff_sublayer = FeedForwardSubLayer(d_model, d_ff)

        # Layer normalizations for residual connections
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

        # Dropout for regularization
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, src_mask=None):
        """
        Args:
            x: Tensor of shape (batch_size, seq_len, d_model)
            src_mask: Optional mask for self-attention (batch_size, seq_len, seq_len)

        Returns:
            Tensor of shape (batch_size, seq_len, d_model)
        """

        # === Sublayer 1: Multi-Head Self-Attention ===
        attn_output = self.attn(x, x, x, src_mask)  # Q = K = V = x
        x = self.norm1(x + self.dropout(attn_output))  # Add & Norm

        # === Sublayer 2: Feed-Forward ===
        ff_output = self.ff_sublayer(x)
        x = self.norm2(x + self.dropout(ff_output))  # Add & Norm

        return x

## Encoder Layer
A Transformer Encoder Layer is a single block in the stack of encoder blocks used in models like BERT, GPT (decoder-only variant), and the original Transformer. Each layer processes a sequence of tokens to build richer, context-aware representations.

Encoder-only transformers simplify this architecture to place greater emphasis on understanding and representing the input data, such as text classification. 

They have two main components: 
- Each encoder layer incorporates a multi-head self-attention mechanism to capture relationships between tokens in the sequence

- followed by feed-forward sublayers to map this knowledge into abstract, nonlinear representations. Both elements are usually combined with other techniques like layer normalizations and dropouts to improve training.

In [21]:
class TransformerEncoder(nn.Module):
    """
    Full Transformer Encoder stack composed of:
    - Input token embeddings
    - Positional encodings
    - N stacked encoder layers

    Args:
        vocab_size (int): Size of the input vocabulary.
        d_model (int): Embedding dimension.
        num_layers (int): Number of encoder layers to stack.
        num_heads (int): Number of attention heads in each layer.
        d_ff (int): Hidden layer size in feed-forward network.
        dropout (float): Dropout rate for regularization.
        max_seq_length (int): Maximum input sequence length.
    """
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_seq_length):
        super().__init__()

        # Token embedding + positional encoding
        self.embedding = InputEmbeddings(vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_seq_length)

        # Stack of N encoder layers
        self.layers = nn.ModuleList([
            EncoderLayer(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])

    def forward(self, x, src_mask=None):
        """
        Args:
            x (Tensor): Input token IDs, shape (batch_size, seq_len)
            src_mask (Tensor or None): Attention mask, shape (batch_size, seq_len, seq_len)

        Returns:
            Tensor: Encoded representation, shape (batch_size, seq_len, d_model)
        """
        # Embed token IDs and add positional information
        x = self.embedding(x)                          # (batch_size, seq_len, d_model)
        x = self.positional_encoding(x)                # (batch_size, seq_len, d_model)

        # Pass through each encoder layer
        for layer in self.layers:
            x = layer(x, src_mask)

        return x

we can create a classification head, suitable for tasks like text classification and sentiment analysis. 
It consists of a linear layer with softmax activation to map the resulting encoder hidden states into class probabilities. 

## Classification Head

- **Tasks** : Text Classification, Sentiment Analysis, NER Recognition, Extractive QA and more

In [22]:
class ClassifierHead(nn.Module):
    def __init__(self, d_model, num_classes):
        """
        Classification head for the Transformer model.

        Args:
            d_model (int): The dimensionality of the input features (usually
                           the hidden size of the Transformer encoder).
            num_classes (int): Number of output classes for classification.

        Components:
            - A single fully connected (linear) layer that maps the Transformer
              output feature vector to class logits.
        """
        super().__init__()
        self.fc = nn.Linear(d_model, num_classes)  # Linear layer from d_model to num_classes

    def forward(self, x):
        """
        Forward pass.

        Args:
            x (Tensor): Input tensor of shape (batch_size, d_model) or
                        possibly (batch_size, seq_len, d_model) if sequence output
                        is passed directly.

        Returns:
            Tensor: Raw logits of shape (batch_size, num_classes).

        Note:
            - If input is (batch_size, seq_len, d_model), you typically want to
              pool or select a token representation (e.g., mean pooling or first token)
              before passing here.
            - This layer returns raw logits (no activation), which is compatible with
              loss functions like nn.CrossEntropyLoss.
        """
        return self.fc(x)  # Output logits for each class


In [23]:
class TransformerClassifier(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_seq_length, num_classes):
        super().__init__()
        # Transformer encoder to get contextual embeddings
        self.encoder = TransformerEncoder(vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_seq_length)
        # Simple classification head
        self.classifier = ClassifierHead(d_model, num_classes)

    def forward(self, x, src_mask=None):
        # Encode input tokens: output shape (batch_size, seq_len, d_model)
        encoder_output = self.encoder(x, src_mask)

        # Take the first token's embedding as aggregate representation (CLS token)
        cls_token_embedding = encoder_output[:, 0, :]  # shape (batch_size, d_model)

        # Classify using the CLS token embedding
        logits = self.classifier(cls_token_embedding)  # shape (batch_size, num_classes)

        return logits


In [24]:
# Example: assuming df with 'text' and 'label' columns already available
texts = df['news'].tolist()
labels = df['label'].tolist()

# Your vocab & input length
max_seq_length = 32
d_model = 512
num_heads = 8

# Build vocab with minimum frequency 5 to reduce vocab size
word2idx = build_vocab(texts, min_freq=1)
vocab_size = len(word2idx)

embedding_layer = InputEmbeddings(vocab_size=vocab_size, d_model=d_model)
positional_encoding_layer = PositionalEncoding(d_model=d_model, max_seq_length=max_seq_length)
mha_layer = MultiHeadAttention(d_model, num_heads).to(device)

# Move to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
embedding_layer = embedding_layer.to(device)
positional_encoding_layer = positional_encoding_layer.to(device)

## Check the Embeddings, Encoding and MHA Shape 

In [25]:
dataset = NewsDataset(texts, labels, word2idx, max_seq_length)
dataloader = DataLoader(dataset, batch_size=64, shuffle=True)

# Take one batch from your real dataset
for batch_inputs, batch_labels in dataloader:
    batch_inputs = batch_inputs.to(device)

    # Step 1: Embed input token IDs
    embedding_output = embedding_layer(batch_inputs)  # (B, L, D)

    # Step 2: Add positional encoding
    encoding_output = positional_encoding_layer(embedding_output)  # (B, L, D)

    # Step 3: Apply Multi-Head Attention (Self-attention)
    output = mha_layer(encoding_output, encoding_output, encoding_output)  # (B, L, D)

    print("Input shape:", batch_inputs.shape)
    print("Embedding shape:", embedding_output.shape)
    print("After Positional Encoding:", encoding_output.shape)
    print("MHA Output shape:", output.shape)

    break

Input shape: torch.Size([64, 32])
Embedding shape: torch.Size([64, 32, 512])
After Positional Encoding: torch.Size([64, 32, 512])
MHA Output shape: torch.Size([64, 32, 512])


## Split the dataset into Train and Validation Sets

In [26]:
from torch.utils.data import random_split

dataset = NewsDataset(texts, labels, word2idx, max_len=max_seq_length)

train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size

train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=64, shuffle=False)

## Define the parameters

In [27]:
import torch.optim as optim

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
num_classes = 42

model = TransformerClassifier(
    vocab_size=len(word2idx),
    d_model=512,
    num_layers=2,
    num_heads=8,
    d_ff=2048,
    dropout=0.1,
    max_seq_length=32,
    num_classes=num_classes  # number of categories
).to(device)

criterion = nn.CrossEntropyLoss(weight=class_weights_tensor.to(device))
optimizer = optim.Adam(model.parameters(), lr=1e-4)

## Evaluate the Model

In [28]:
def evaluate(model, dataloader, criterion, device):
    model.eval()  # set eval mode
    running_loss = 0.0
    correct = 0
    total = 0

    with torch.no_grad():  # no gradients needed
        for batch_inputs, batch_labels in dataloader:
            batch_inputs = batch_inputs.to(device)
            batch_labels = batch_labels.to(device)

            outputs = model(batch_inputs)  # forward pass
            loss = criterion(outputs, batch_labels)

            running_loss += loss.item() * batch_inputs.size(0)

            _, predicted = torch.max(outputs, dim=1)
            correct += (predicted == batch_labels).sum().item()
            total += batch_inputs.size(0)

    avg_loss = running_loss / total
    accuracy = correct / total

    return avg_loss, accuracy

## Train the Model

In [29]:
from tqdm import tqdm
import torch

num_epochs = 20
patience = 3  # stop if no improvement for 3 consecutive epochs
best_val_loss = float('inf')
epochs_no_improve = 0

for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0

    loop = tqdm(train_dataloader, desc=f"Epoch {epoch+1}/{num_epochs}", leave=False)

    for batch_inputs, batch_labels in loop:
        batch_inputs = batch_inputs.to(device)
        batch_labels = batch_labels.to(device)

        optimizer.zero_grad()
        outputs = model(batch_inputs)  # logits (batch_size, num_classes)
        loss = criterion(outputs, batch_labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item() * batch_inputs.size(0)
        _, predicted = torch.max(outputs, dim=1)
        correct += (predicted == batch_labels).sum().item()
        total += batch_inputs.size(0)

        loop.set_postfix(loss=loss.item(), accuracy=correct/total)

    epoch_loss = running_loss / total
    epoch_acc = correct / total

    # Validation step - define this function yourself or inline
    val_loss, val_acc = evaluate(model, val_dataloader, criterion, device)

    print(f"Epoch {epoch+1}/{num_epochs} — Train Loss: {epoch_loss:.4f}, Train Acc: {epoch_acc:.4f} | Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}")

    # Early stopping logic
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        epochs_no_improve = 0
        # Save best model if needed
        torch.save(model.state_dict(), 'best_model.pth')
    else:
        epochs_no_improve += 1
        if epochs_no_improve >= patience:
            print(f"Early stopping at epoch {epoch+1} due to no improvement in val loss.")
            break

                                                                                         

Epoch 1/20 — Train Loss: 2.9705, Train Acc: 0.2301 | Val Loss: 2.4400, Val Acc: 0.3576


                                                                                          

Epoch 2/20 — Train Loss: 2.1419, Train Acc: 0.4068 | Val Loss: 2.1868, Val Acc: 0.4152


                                                                                          

Epoch 3/20 — Train Loss: 1.7706, Train Acc: 0.4711 | Val Loss: 2.0930, Val Acc: 0.4352


                                                                                           

Epoch 4/20 — Train Loss: 1.4915, Train Acc: 0.5178 | Val Loss: 2.1030, Val Acc: 0.4573


                                                                                          

Epoch 5/20 — Train Loss: 1.2416, Train Acc: 0.5624 | Val Loss: 2.1703, Val Acc: 0.4629


                                                                                          

Epoch 6/20 — Train Loss: 1.0319, Train Acc: 0.6047 | Val Loss: 2.2474, Val Acc: 0.4684
Early stopping at epoch 6 due to no improvement in val loss.


In [30]:
idx2label = {i: label for i, label in enumerate(le.classes_)}

In [31]:
def predict_category(text, model, word2idx, idx2label, max_len=32, device='cpu'):
    model.eval()
    tokens = word_tokenize(text.lower())
    token_ids = [word2idx.get(token, word2idx['<UNK>']) for token in tokens]

    # Padding
    if len(token_ids) < max_len:
        token_ids += [word2idx['<PAD>']] * (max_len - len(token_ids))
    else:
        token_ids = token_ids[:max_len]

    input_tensor = torch.tensor([token_ids], dtype=torch.long).to(device)

    with torch.no_grad():
        logits = model(input_tensor)  # shape: (1, num_classes)
        probs = torch.softmax(logits, dim=1)
        pred_class = torch.argmax(probs, dim=1).item()

    return idx2label[pred_class], probs.cpu().numpy().flatten()

In [32]:
idx2label = {i: label for i, label in enumerate(le.classes_)}

In [33]:
texts_to_predict = [
    "The economy is showing signs of recovery after a challenging year.",
    "New breakthrough in cancer research offers hope for many patients.",
    "The government has announced new policies to tackle climate change.",
    "The latest smartphone model features a stunning new design.",
    "Local sports team wins championship after a thrilling final match.",
    "Artists around the world gather to celebrate cultural diversity.",
    "The fashion industry is embracing sustainable and eco-friendly materials.",
    "A new cafe in town offers delicious vegan options.",
    "The education system is evolving to include more technology in classrooms.",
    "Travel restrictions have eased, boosting tourism worldwide."
]

for text in texts_to_predict:
    predicted_category, probabilities = predict_category(text, model, word2idx, idx2label, max_len=32, device=device)
    print(f"Text: {text}")
    print(f"Predicted category: {predicted_category}")
    print(f"Probabilities: {probabilities}\n")

Text: The economy is showing signs of recovery after a challenging year.
Predicted category: BUSINESS
Probabilities: [7.1672985e-05 6.2448024e-05 4.0127542e-03 5.0124645e-01 3.9949510e-04
 2.8582654e-04 1.1320011e-03 2.3961327e-05 4.7754154e-05 4.5505296e-03
 2.3165389e-04 8.0711935e-03 7.7646480e-05 2.0634379e-05 5.2365763e-03
 2.2581477e-01 7.2665550e-03 7.0395108e-05 1.2457173e-01 1.2219480e-03
 6.0723198e-04 1.3797791e-02 1.7507903e-04 1.1187567e-03 3.3984665e-02
 7.3305513e-05 2.6453963e-05 1.4153704e-04 1.6003679e-03 7.1123155e-05
 1.8926252e-05 3.2589879e-04 7.6379399e-03 8.2760658e-03 2.0206635e-04
 1.3346942e-03 9.6809381e-06 2.4740578e-04 6.1990658e-04 5.9244683e-04
 2.6351145e-02 1.8371470e-02]

Text: New breakthrough in cancer research offers hope for many patients.
Predicted category: HEALTHY LIVING
Probabilities: [1.7785391e-05 6.7837021e-07 1.0860733e-04 2.5166050e-03 3.2954402e-02
 1.8130962e-04 2.8438328e-05 3.6480219e-06 1.8190220e-05 4.6419227e-03
 4.2989021e-05 8.66

In [34]:
import numpy as np

def top_k_categories(probs, idx2label, k=3):
    topk_idx = np.argsort(probs)[::-1][:k]
    return [(idx2label[i], probs[i]) for i in topk_idx]

In [35]:
top3 = top_k_categories(probabilities, idx2label)
print(top3)

[('TRAVEL', 0.9387517), ('TASTE', 0.016633136), ('WEIRD NEWS', 0.007969035)]
