<a href="https://colab.research.google.com/github/Amar-cmd/GenAI-with-Python-and-PyTorch-Code/blob/main/Chapter%203/FastText.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **1.1 FastText kya hai?**

FastText Facebook AI ka model hai jo mainly **text classification + word embeddings** ke liye use hota hai.
Basic idea:

* Har word ko sirf ek token na maan kar, **character n-grams** ka bag banaya jata hai.

  * Example: word = `where`
  * Hum boundary add karte hain: `<where>`
  * 3-gram: `<wh`, `whe`, `her`, `ere`
  * 4-gram: `<whe`, `wher`, `here` …
* Har n-gram ka apna embedding hota hai.
* Word ka embedding = uske saare n-grams ka average/sum.

Isse fayda:

* **OOV (out-of-vocabulary)** words bhi handle ho jate hain (kyunki unke characters toh dekhe ja sakte hain).
* Morphology (prefix/suffix) bhi capture hoti hai (e.g., play, player, playing).

# **1.2 FastText for classification (jo hum implement karenge)**

Supervised FastText classifier roughly yeh karta hai:

1. Input: Ek sentence / document ka text.
2. Preprocess:

   * Lowercase, punctuation remove, simple tokenization (split by spaces).
3. Har word → **char n-grams** + ek special “whole word” n-gram.
4. Har n-gram ko ek **integer id** se map karte hain (hum yahan hashing trick use karenge).
5. Sentence ke sabhi n-grams ke embeddings ka **average** nikalte hain (bag-of-ngrams).
6. Yeh average vector ek **linear layer + softmax** mein jata hai → class probabilities.

Mathematically:

* Sentence text ke liye n-gram ids: ({g_1, g_2, \dots, g_M})
* Embedding matrix (E \in \mathbb{R}^{V \times d}), jahan (V) = bucket size (hashed vocab size), (d) = embedding dim.

Sentence vector:
$$
v = \frac{1}{M} \sum_{i=1}^M E[g_i]
$$

Phir classifier:

$$
\hat{y} = \text{softmax}(W v + b)
$$

Loss: standard **cross-entropy** loss between predicted distribution and true label.


In [1]:
!pip install datasets



In [2]:
# we need to clean raw text (remove punct, lowercase etc)
# python ka re module regex ke liye best hai
import re

# FastText ke liye har character n-gram ko hash karna hoga
# To hmko ek deterministic hash chahiye...jo string -> integer de.
import hashlib

# Reproducality impt hai (same shuffling, same initialisation etc.)
# Isliye global python RNG ko seend krne k liye 'random' chahiye
import random

# For implementation of code using pytorch, torch is needed
import torch

# 'nn' module se neural network building blocks milte hain
# Linear Layer, Embedding, Losses etc
import torch.nn as nn

# Pytorch ka Dataset + DataLoader pattern standard hai
# Dataset -> how to get one sample
# DataLoader → batching, shuffling nd all...
from torch.utils.data import Dataset, DataLoader


# Dataset jo hugging face se lenge...uske liye ye chahiye
# 'preinstalled nhi hota: "pip install datasets" karna hoga
from datasets import load_dataset

In [14]:
# ======================
# 1. CONFIG AND SEED
# ======================

# I want my experiments to be reproducible.
# Using a fixed random seed means shuffling, weight init, etc. will be the same every run.
SEED = 42  # classic "magic" seed, koi bhi fixed int chalega

# Python's own RNG ko seed kar raha hoon (things like random.shuffle, etc.)
random.seed(SEED)

# PyTorch ke RNG ko bhi same seed deta hoon,
# taaki weight initialisation + DataLoader ke kuch random parts stable rahein.
torch.manual_seed(SEED)


# Ab mujhe decide karna hai ki training CPU pe hogi ya GPU pe.
# If a CUDA GPU is available, use it; otherwise fall back to CPU.
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("Using device:", DEVICE)


# ======================
# 2. DATASET SIZE CONFIG
# ======================

# AG News ka pura dataset kaafi bada hai; learning/demo ke liye mujhe
# itna bada dataset nahi chahiye, warna training slow ho jayegi.
# So I'll cap the number of training and test samples.
MAX_TRAIN_SAMPLES = 5000   # small but non-toy size → model kuch sikh lega
MAX_TEST_SAMPLES = 2000    # enough to get a reasonable estimate of accuracy


# ======================
# 3. FASTTEXT HYPERPARAMETERS
# ======================

# FastText ka core idea: character n-grams.
# Mujhe n ki range choose karni hai. 3–6 typical hai:
#  - 3-grams capture small patterns (pre/sufixes)
#  - 6-grams thoda longer substrings ko capture karte hain.
MIN_N = 3
MAX_N = 6

# Har unique n-gram ko dictionary me store nahi karna (bohot bada ho sakta hai),
# isliye main hashing trick use kar raha hoon.
# BUCKET_SIZE = kitne possible hashed IDs (0 .. BUCKET_SIZE-1).
# Zyada bucket size → kam collisions, but zyada memory.
BUCKET_SIZE = 200_000  # 200k n-gram "slots" should be enough for demo


# Kitne-dimensional embedding chahiye har n-gram ke liye?
# 100 is a nice middle ground: not too small, not too large.
EMBED_DIM = 100

# ======================
# 4. TRAINING HYPERPARAMETERS
# ======================

# Batch size: bada rakho to training stable hoti hai but memory usage badhta hai.
# 64 is a safe default for most GPUs/CPUs for this model.
BATCH_SIZE = 64

# Epochs: training passes over the entire dataset.
# Demo ke liye 5 epochs rakhta hoon; agar zyada accuracy chahiye,
# baad me isse increase kar sakte hain.
NUM_EPOCHS = 5

# Learning rate for SGD. FastText-style models usually kaafi simple hote hain
# (embedding + linear), toh relatively high LR (0.1) often works.
# Agar training unstable lagti hai (loss explode ho), isse kam (0.05 ya 0.01) kar sakte hain.
LR = 0.1


Using device: cuda


In [15]:
class NgramEncoder:
    """
    Converts raw text into a 1D tensor of hashed n-gram IDs.
    FastText-style: char n-grams from <word> boundaries + whole-word n-gram.
    """

    def __init__(self, min_n=3, max_n=6, bucket_size=200_000):
        # Yahan main yeh decide kar raha hoon ki shortest aur longest
        # character n-gram kitne length ka hoga.
        # Typical FastText ke liye 3–6 ek sensible default hai.
        self.min_n = min_n
        self.max_n = max_n

        # Hashing trick use kar raha hoon instead of explicit vocab dict.
        # bucket_size = kitne possible "slots" hain jahan n-grams jaa sakte hain.
        # int in [0, bucket_size).
        self.bucket_size = bucket_size

    def _basic_tokenize(self, text):
        # Pehle mujhe crude level pe text clean karna hai:
        # - Lowercase taaki "Apple" aur "apple" alag na treat hon.
        text = text.lower()

        # - Non-alphanumeric characters (punctuation etc.) ko spaces se replace karna,
        #   taaki splitting simple ho jaye.
        #   Regex "[^a-z0-9\s]" ka matlab: a-z, 0-9, ya whitespace ke alawa sab kuch.
        text = re.sub(r"[^a-z0-9\s]", " ", text)

        # Ab simple whitespace split se tokens nikal leta hoon.
        tokens = text.split()

        # Kuch cases me extra spaces se empty strings aa sakte hain,
        # isliye unhe filter out kar deta hoon.
        return [t for t in tokens if t]

    def _char_ngrams(self, word):
        """
        Create character n-grams with < > boundaries, plus a whole-word token.
        Example: word="where" -> "<where>"
        """
        # FastText ka trick: word ke start aur end pe boundary markers add karte hain.
        # Yahan "<word>" ki tarah.
        # Isse prefix/suffix properly capture hote hain (e.g., <un, ing> etc.).
        w = f"<{word}>"
        ngrams = []

        # Ab main chosen n range (min_n .. max_n) ke liye
        # sliding window se character n-grams banaunga.
        for n in range(self.min_n, self.max_n + 1):
            # Agar word ki length hi n se chhoti hai, to is length ke n-grams
            # exist nahi karenge -> continue.
            if len(w) < n:
                continue

            # Standard sliding window:
            # w[i : i + n] gives substring of length n starting at i.
            for i in range(len(w) - n + 1):
                ngrams.append(w[i : i + n])

        # FastText typically ek "full word" representation bhi rakhta hai.
        # Yahan main usko ek special pattern se mark kar raha hoon: "#<word>#"
        # taaki char n-grams se thoda distinguish ho sake (pure word embedding effect).
        ngrams.append(f"#{w}#")
        return ngrams

    def _hash_ngram(self, ngram):
        """
        Hash n-gram string to an integer in [0, bucket_size).
        We use md5 for deterministic hashing.
        """
        # Mujhe deterministic mapping chahiye from string -> integer.
        # md5 ka hexdigest stable hota hai (same input -> same output).
        h = hashlib.md5(ngram.encode("utf-8")).hexdigest()

        # Hex string ko base-16 integer me convert kar raha hoon.
        # Phir modulo bucket_size se compress kar ke
        # valid index range [0, bucket_size) me le aata hoon.
        return int(h, 16) % self.bucket_size

    def encode(self, text):
        """
        text -> LongTensor of hashed n-gram IDs for the whole sentence/document.
        """
        # Step 1: raw text ko tokens me convert karo (basic clean + split).
        tokens = self._basic_tokenize(text)

        # Yahan IDs store karne ke liye ek normal Python list rakhta hoon,
        # baad me ise torch.tensor me convert karunga.
        ids = []

        # Har token ke liye:
        # - uske char n-grams banao
        # - har n-gram ko hash karke ID me convert karo
        # - sab IDs ek hi list me daal do
        for tok in tokens:
            for ng in self._char_ngrams(tok):
                ids.append(self._hash_ngram(ng))

        # Edge case: agar cleaning ke baad text khaali ho gaya
        # (e.g., sirf punctuation tha) to ids list empty hogi.
        # EmbeddingBag ko kam se kam ek index chahiye, warna crash ho sakta hai.
        # Isliye yahan ek dummy id (0) push kar raha hoon.
        if not ids:
            ids.append(0)

        # Finally, PyTorch model ke liye LongTensor banana zaruri hai
        # (embedding layers indexes ko Long type chahiye).
        return torch.tensor(ids, dtype=torch.long)


In [16]:
class FastTextDataset(Dataset):
    """
    Wrap HF dataset into a PyTorch Dataset:
    returns (ngram_ids_tensor, label_int).
    """

    def __init__(self, texts, labels, encoder: NgramEncoder):
        # Sabse pehle sanity check: texts aur labels ki length same honi chahiye.
        # Agar mismatch ho, matlab data corrupt/config galat hai → turant assert fail karwa do.
        assert len(texts) == len(labels)

        # Raw texts (list of strings ya HF objects) ko store kar leta hoon.
        # Main unko yahin encode nahi kar raha, kyunki encoding thoda heavy hai.
        # Instead, __getitem__ pe on-the-fly encode karunga (lazy encoding).
        self.texts = texts

        # Labels ko bhi ek list ke form me store kar raha hoon (integers hone chahiye).
        self.labels = labels

        # Encoder object (NgramEncoder) ko store karna zaroori hai,
        # taki har sample ke liye text -> n-gram IDs convert kar saku.
        self.encoder = encoder

    def __len__(self):
        # DataLoader ko pata hona chahiye dataset me kitne samples hain.
        # Simple answer: jitne texts hain utne hi samples.
        return len(self.texts)

    def __getitem__(self, idx):
        # DataLoader jab koi index maangta hai (0..len-1),
        # to yahan se ek single sample return hoga.

        # Kabhi-kabhi HF dataset elements string ke alawa bhi object ho sakte hain,
        # so main defensive programming kar raha hoon: ensure it's a string.
        text = str(self.texts[idx])

        # Labels ideally already int hone chahiye, but
        # safe side: explicitly int() cast kar deta hoon.
        label = int(self.labels[idx])

        # Ab main encoder ko call karta hoon:
        # raw sentence -> 1D LongTensor of hashed n-gram IDs.
        # Ye hi humara "input representation" hai model ke liye.
        ngram_ids = self.encoder.encode(text)  # 1D LongTensor

        # Dataset ka contract: __getitem__ returns a single training example.
        # Yahan main (ngram_ids_tensor, label_int) return kar raha hoon.
        return ngram_ids, label


In [17]:
def collate_batch(batch):
    """
    batch: list of (ngram_ids_tensor, label_int)
    We need:
      - text: all n-gram ids concatenated into one long tensor
      - offsets: start index of each example inside text
      - labels: tensor of labels
    Suitable for nn.EmbeddingBag.
    """

    # DataLoader ne batch bana ke diya hai ek list ke form me:
    #   [ (ngram_ids_0, label_0),
    #     (ngram_ids_1, label_1),
    #     ... ]
    #
    # Mujhe inn ko alag-alag lists me todna hai:
    #   - ngram_tensors: [tensor0, tensor1, ...]
    #   - labels: [l0, l1, ...]
    #
    # zip(*batch) exactly ye kaam karta hai -> transpose jaisa behavior.
    ngram_tensors, labels = zip(*batch)  # list of tensors, list of ints

    # ==========================
    # Offsets banana (EmbeddingBag style)
    # ==========================
    # EmbeddingBag expect karta hai:
    #   text: ek single 1D tensor jisme saare examples ke n-gram IDs concat hon
    #   offsets: 1D tensor jiska har element batata hai
    #            "ye example kahan se start ho raha hai 'text' ke andar?"
    #
    # Example:
    #   example 0 ids: [5, 10, 11]  (len=3)
    #   example 1 ids: [7, 8]       (len=2)
    #
    #   text    = [5, 10, 11, 7, 8]
    #   offsets = [0, 3]
    #
    #   EmbeddingBag ko pata chal jaata hai:
    #     - sample 0 -> text[0:3]
    #     - sample 1 -> text[3:5]

    # Main offsets list [0] se start karta hoon (first sample always index 0 se start).
    offsets = [0]

    # Ab har tensor ko (last wale ko chhod ke) dekhunga:
    # next offset = previous offset + length of current tensor
    for t in ngram_tensors[:-1]:
        offsets.append(offsets[-1] + len(t))

    # Python list ko PyTorch LongTensor me convert karna zaroori hai,
    # kyunki EmbeddingBag ko Long dtype chahiye.
    offsets = torch.tensor(offsets, dtype=torch.long)

    # ==========================
    # text banana (concat all IDs)
    # ==========================
    # Ab mujhe saare individual n-gram id tensors ko ek single 1D vector me
    # jodna hai. torch.cat exactly yahi karta hai.
    #
    # Note: yeh sab tensors already 1D hain (thanks to encoder),
    # isliye dim=0 default hi sahi hai.
    text = torch.cat(ngram_tensors)

    # Labels ko bhi list se tensor me convert kar raha hoon.
    # CrossEntropyLoss etc. ko LongTensor labels chahiye hote hain.
    labels = torch.tensor(labels, dtype=torch.long)

    # Final return:
    #   - text: [total_ngrams_in_batch]
    #   - offsets: [batch_size]
    #   - labels: [batch_size]
    return text, offsets, labels


In [18]:
class FastTextClassifier(nn.Module):
    """
    FastText-style classifier:
      - EmbeddingBag over n-gram ids (mean pooling)
      - Linear layer to num_classes
    """

    def __init__(self, vocab_size, embed_dim, num_classes):
        # Sabse pehle main soch raha hoon: yeh PyTorch module hai,
        # toh nn.Module ko properly init karna hoga.
        super().__init__()

        # FastText ka core idea: ek simple bag-of-ngrams embedding + linear classifier.
        #
        # Yahan main nn.EmbeddingBag use kar raha hoon instead of plain nn.Embedding:
        #  - EmbeddingBag directly multiple indices ka mean/sum nikal deta hai
        #    (bag-of-words style), bina manually average karne ke.
        #  - "mode='mean'" ka matlab: har sentence ke n-gram embeddings ka average lo.
        #
        # vocab_size  -> hashing bucket size (kitne possible n-gram IDs)
        # embed_dim   -> har n-gram ka vector size
        self.embedding = nn.EmbeddingBag(
            vocab_size, embed_dim, mode="mean"
        )

        # Ab mujhe sentence embedding (embed_dim) ko class logits me map karna hai.
        # Sabse simple classifier: ek linear layer.
        # Input: [batch_size, embed_dim]
        # Output: [batch_size, num_classes]
        self.fc = nn.Linear(embed_dim, num_classes)

        # Custom weight initialization karna hai, taaki training stable start ho.
        self._init_weights()

    def _init_weights(self):
        # Initialize embeddings and linear layer
        #
        # Yahan main ek chhota sa init range decide kar raha hoon:
        # 0.5 / EMBED_DIM -> jaise-jaise dimension badhta hai,
        # range chhoti ho jati hai, taaki values zyada spread out na ho.
        # (Note: yeh global EMBED_DIM use kar raha hoon;
        #  alternatively, embed_dim argument se bhi le sakta tha.)
        init_range = 0.5 / EMBED_DIM

        # Embedding weights ko uniform distribution se init kar raha hoon
        # [-init_range, init_range] ke beech.
        self.embedding.weight.data.uniform_(-init_range, init_range)

        # Linear layer ke weights ko bhi same range me init kar raha hoon.
        self.fc.weight.data.uniform_(-init_range, init_range)

        # Bias ka best default: zero se start karo.
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        # forward pass me mujhe 2 cheeze milti hain:
        #   text:    [N_total_ngrams_in_batch]
        #   offsets: [batch_size]
        #
        # Ye exactly woh format hai jo collate_fn ne banaya tha
        # EmbeddingBag ke liye.

        # EmbeddingBag yahan ye karega:
        #   - offsets ke hisaab se text ko sentences me split karega
        #   - har sentence ke indices ke embeddings ka mean lega
        #
        # Result: embedded shape = [batch_size, embed_dim]
        embedded = self.embedding(text, offsets)  # [batch_size, embed_dim]

        # Ab simple linear classifier:
        #   embedded -> logits (pre-softmax scores)
        # logits shape: [batch_size, num_classes]
        logits = self.fc(embedded)                # [batch_size, num_classes]

        # CrossEntropyLoss automatically logits + integer labels se
        # softmax + negative log-likelihood handle karega,
        # isliye yahan sirf logits return karna kaafi hai.
        return logits


In [19]:
# Pehle main user ko thoda feedback dena chahta hoon ki ab dataset load ho raha hai,
# taaki console me silent na lage.
print("Loading AG News dataset from Hugging Face...")

# Hugging Face datasets library ka built-in text classification dataset use kar raha hoon:
# "ag_news" -> 4 news categories (World, Sports, Business, Sci/Tech).
# Isme already 'train' aur 'test' splits defined hain.
raw_dataset = load_dataset("ag_news")


# Ab mujhe class labels ki info chahiye.
# HF dataset me label feature ke andar names stored hote hain.
label_names = raw_dataset["train"].features["label"].names

# Kitni classes hain? -> simply len(label_names)
num_classes = len(label_names)

print("Classes:", label_names)


# ==========================
# Limit dataset size for learning/demo
# ==========================
# Pura AG News training set ~120k samples hai.
# Learning purpose ke liye mujhe itna bada dataset nahi chahiye,
# warna training slow ho jayegi, especially CPU pe.
#
# Isliye upar define kiya hua MAX_TRAIN_SAMPLES / MAX_TEST_SAMPLES use karke
# subset le raha hoon.

# Training texts aur labels (subset)
train_texts = raw_dataset["train"]["text"][:MAX_TRAIN_SAMPLES]
train_labels = raw_dataset["train"]["label"][:MAX_TRAIN_SAMPLES]

# Test texts aur labels (subset)
test_texts = raw_dataset["test"]["text"][:MAX_TEST_SAMPLES]
test_labels = raw_dataset["test"]["label"][:MAX_TEST_SAMPLES]


# ==========================
# Encoder + Dataset objects
# ==========================

# Ab mujhe NgramEncoder ka ek instance chahiye,
# jo raw text ko hashed n-gram IDs me convert kare.
# MIN_N, MAX_N, BUCKET_SIZE global config se aa rahe hain.
encoder = NgramEncoder(min_n=MIN_N, max_n=MAX_N, bucket_size=BUCKET_SIZE)

# Training ke liye FastTextDataset banata hoon
# jo HF lists ko PyTorch Dataset interface me wrap karega.
train_dataset = FastTextDataset(train_texts, train_labels, encoder)

# Similarly, test side ke liye bhi ek Dataset
test_dataset = FastTextDataset(test_texts, test_labels, encoder)


# ==========================
# DataLoader (batching, shuffling)
# ==========================

# Training DataLoader:
# - batch_size = BATCH_SIZE (e.g., 64)
# - shuffle=True: har epoch me data ko randomize karna, taaki training better generalize kare.
# - collate_fn=collate_batch: humara custom collate function jo
#   n-gram tensors ko concat + offsets bana deta hai (EmbeddingBag format).
train_loader = DataLoader(
    train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    collate_fn=collate_batch,
)

# Test DataLoader:
# - shuffle=False: evaluation ke liye order matter nahi karta,
#   but deterministic rakhna sahi habit hai.
# - baaki sab train jaisa, sirf dataset change.
test_loader = DataLoader(
    test_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,
    collate_fn=collate_batch,
)


Loading AG News dataset from Hugging Face...
Classes: ['World', 'Sports', 'Business', 'Sci/Tech']


In [20]:
# ======================
# 6. Train & Evaluate
# ======================

# Ab tak mere paas:
#  - encoder ready
#  - Dataset + DataLoader ready
#  - config ready
# Ab mujhe actual FastText classifier ka instance banana hai.

model = FastTextClassifier(
    vocab_size=BUCKET_SIZE,   # hashing bucket size (kitne possible n-gram IDs)
    embed_dim=EMBED_DIM,      # har n-gram ka vector size
    num_classes=num_classes,  # AG News ke 4 classes
).to(DEVICE)                  # model ko CPU/GPU jahan bhi available ho, wahan bhej do

# Classification problem hai, logits + integer labels:
# CrossEntropyLoss is the standard choice (softmax + NLL combined).
criterion = nn.CrossEntropyLoss()

# Optimizer choose karna hai.
# FastText jaisa simple linear+embedding model ke liye SGD kaafi acha kaam karta hai.
# LR pehle config me set kiya tha (LR = 0.1).
optimizer = torch.optim.SGD(model.parameters(), lr=LR)


def evaluate(model, dataloader):
    # Evaluation mode:
    #  - dropout, batchnorm jaise layers alag behave karte hain, isliye model.eval() zaroori
    model.eval()

    total_loss = 0.0
    total_correct = 0
    total_examples = 0

    # Evaluation me gradients ki zarurat nahi hoti,
    # to torch.no_grad() use karke memory + compute dono bachate hain.
    with torch.no_grad():
        for text, offsets, labels in dataloader:
            # Batch ko correct device (CPU/GPU) pe move karo.
            text = text.to(DEVICE)
            offsets = offsets.to(DEVICE)
            labels = labels.to(DEVICE)

            # Forward pass se logits milenge.
            logits = model(text, offsets)

            # Loss compute karo (per batch).
            loss = criterion(logits, labels)

            # total_loss me "sum of losses weighted by batch size" store kar raha hoon,
            # taaki baad me average loss nikal saku:
            # avg_loss = total_loss / total_examples
            total_loss += loss.item() * labels.size(0)

            # Predictions: logits ka argmax class dimension (dim=1) pe.
            preds = logits.argmax(dim=1)

            # Kitne correct predictions hue is batch me?
            total_correct += (preds == labels).sum().item()

            # Total examples count bhi maintain karna hoga.
            total_examples += labels.size(0)

    # Loop ke baad: average loss aur accuracy nikal lo.
    avg_loss = total_loss / total_examples
    accuracy = 100.0 * total_correct / total_examples
    return avg_loss, accuracy


print("Starting training...")

# Epoch loop:
# Har epoch = poora training dataset ek baar dekhna.
for epoch in range(1, NUM_EPOCHS + 1):
    # Training mode ON (dropout, batchnorm etc. training behavior).
    model.train()

    total_loss = 0.0
    total_correct = 0
    total_examples = 0

    # Train DataLoader se batch-by-batch data aayega.
    for text, offsets, labels in train_loader:
        # Batch ko correct device pe move karo.
        text = text.to(DEVICE)
        offsets = offsets.to(DEVICE)
        labels = labels.to(DEVICE)

        # Har step pe gradients ko reset karna zaroori hai,
        # warna PyTorch previous gradients accumulate karta hai.
        optimizer.zero_grad()

        # Forward pass: logits compute karo.
        logits = model(text, offsets)

        # Loss compute karo current batch ka.
        loss = criterion(logits, labels)

        # Backward pass: dLoss/dParams calculate karo.
        loss.backward()

        # Optimizer step: parameters ko update karo gradients ke according.
        optimizer.step()

        # Logging ke liye training loss + accuracy track karta hoon.
        total_loss += loss.item() * labels.size(0)

        # Predictions for accuracy:
        preds = logits.argmax(dim=1)
        total_correct += (preds == labels).sum().item()
        total_examples += labels.size(0)

    # Ek epoch khatam hone ke baad:
    # Average training loss and accuracy nikalte hain.
    train_loss = total_loss / total_examples
    train_acc = 100.0 * total_correct / total_examples

    # Validation / test metrics evaluate() function se:
    val_loss, val_acc = evaluate(model, test_loader)

    # Console pe progress print kar raha hoon,
    # taaki pata chale training improve ho rahi hai ya nahi.
    print(
        f"Epoch {epoch}/{NUM_EPOCHS} | "
        f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}% | "
        f"Test Loss: {val_loss:.4f}, Test Acc: {val_acc:.2f}%"
    )

print("Training complete.")


Starting training...
Epoch 1/5 | Train Loss: 1.3803, Train Acc: 29.84% | Test Loss: 1.3955, Test Acc: 25.70%
Epoch 2/5 | Train Loss: 1.3781, Train Acc: 29.94% | Test Loss: 1.3973, Test Acc: 25.70%
Epoch 3/5 | Train Loss: 1.3779, Train Acc: 29.94% | Test Loss: 1.3964, Test Acc: 25.70%
Epoch 4/5 | Train Loss: 1.3781, Train Acc: 29.94% | Test Loss: 1.3957, Test Acc: 25.70%
Epoch 5/5 | Train Loss: 1.3780, Train Acc: 29.94% | Test Loss: 1.3964, Test Acc: 25.70%
Training complete.


In [21]:
def evaluate(model, dataloader):
    # Evaluation mode:
    #  - dropout, batchnorm jaise layers alag behave karte hain, isliye model.eval() zaroori
    model.eval()

    total_loss = 0.0
    total_correct = 0
    total_examples = 0

    # Evaluation me gradients ki zarurat nahi hoti,
    # to torch.no_grad() use karke memory + compute dono bachate hain.
    with torch.no_grad():
        for text, offsets, labels in dataloader:
            # Batch ko correct device (CPU/GPU) pe move karo.
            text = text.to(DEVICE)
            offsets = offsets.to(DEVICE)
            labels = labels.to(DEVICE)

            # Forward pass se logits milenge.
            logits = model(text, offsets)

            # Loss compute karo (per batch).
            loss = criterion(logits, labels)

            # total_loss me "sum of losses weighted by batch size" store kar raha hoon,
            # taaki baad me average loss nikal saku:
            # avg_loss = total_loss / total_examples
            total_loss += loss.item() * labels.size(0)

            # Predictions: logits ka argmax class dimension (dim=1) pe.
            preds = logits.argmax(dim=1)

            # Kitne correct predictions hue is batch me?
            total_correct += (preds == labels).sum().item()

            # Total examples count bhi maintain karna hoga.
            total_examples += labels.size(0)

    # Loop ke baad: average loss aur accuracy nikal lo.
    avg_loss = total_loss / total_examples
    accuracy = 100.0 * total_correct / total_examples
    return avg_loss, accuracy

In [22]:
print("Starting training...")

# Epoch loop:
# Har epoch = poora training dataset ek baar dekhna.
for epoch in range(1, NUM_EPOCHS + 1):
    # Training mode ON (dropout, batchnorm etc. training behavior).
    model.train()

    total_loss = 0.0
    total_correct = 0
    total_examples = 0

    # Train DataLoader se batch-by-batch data aayega.
    for text, offsets, labels in train_loader:
        # Batch ko correct device pe move karo.
        text = text.to(DEVICE)
        offsets = offsets.to(DEVICE)
        labels = labels.to(DEVICE)

        # Har step pe gradients ko reset karna zaroori hai,
        # warna PyTorch previous gradients accumulate karta hai.
        optimizer.zero_grad()

        # Forward pass: logits compute karo.
        logits = model(text, offsets)

        # Loss compute karo current batch ka.
        loss = criterion(logits, labels)

        # Backward pass: dLoss/dParams calculate karo.
        loss.backward()

        # Optimizer step: parameters ko update karo gradients ke according.
        optimizer.step()

        # Logging ke liye training loss + accuracy track karta hoon.
        total_loss += loss.item() * labels.size(0)

        # Predictions for accuracy:
        preds = logits.argmax(dim=1)
        total_correct += (preds == labels).sum().item()
        total_examples += labels.size(0)

    # Ek epoch khatam hone ke baad:
    # Average training loss and accuracy nikalte hain.
    train_loss = total_loss / total_examples
    train_acc = 100.0 * total_correct / total_examples

    # Validation / test metrics evaluate() function se:
    val_loss, val_acc = evaluate(model, test_loader)

    # Console pe progress print kar raha hoon,
    # taaki pata chale training improve ho rahi hai ya nahi.
    print(
        f"Epoch {epoch}/{NUM_EPOCHS} | "
        f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}% | "
        f"Test Loss: {val_loss:.4f}, Test Acc: {val_acc:.2f}%"
    )

print("Training complete.")

Starting training...
Epoch 1/5 | Train Loss: 1.3781, Train Acc: 29.94% | Test Loss: 1.3947, Test Acc: 25.70%
Epoch 2/5 | Train Loss: 1.3781, Train Acc: 29.94% | Test Loss: 1.3941, Test Acc: 25.70%
Epoch 3/5 | Train Loss: 1.3783, Train Acc: 29.94% | Test Loss: 1.3970, Test Acc: 25.70%
Epoch 4/5 | Train Loss: 1.3782, Train Acc: 29.94% | Test Loss: 1.3943, Test Acc: 25.70%
Epoch 5/5 | Train Loss: 1.3782, Train Acc: 29.94% | Test Loss: 1.3945, Test Acc: 25.70%
Training complete.


In [23]:
# ======================
# 7. Inference Demo
# ======================

def predict(text: str):
    # Prediction time pe mujhe ensure karna hai ki model evaluation mode me ho:
    #  - dropout, batchnorm jaise layers (agar hote) inference behavior use karein.
    model.eval()

    # Inference ke liye gradients ki zarurat nahi hai,
    # to torch.no_grad() use karke compute + memory dono bachata hoon.
    with torch.no_grad():
        # Step 1: raw input text ko wahi encoder se pass karna hai
        # jo training me use kiya tha, taaki representation consistent rahe.
        # encoder.encode() -> 1D LongTensor of n-gram IDs.
        ngram_ids = encoder.encode(text).to(DEVICE)

        # EmbeddingBag ko do cheeze chahiye: text + offsets.
        # Yahan batch size = 1 hai, to offsets hamesha [0] hoga
        # (single example ke n-grams text[0:] se start ho rahe hain).
        offsets = torch.tensor([0], dtype=torch.long, device=DEVICE)

        # Forward pass:
        #  - ngram_ids: [N_ngrams]
        #  - offsets: [1]
        # Output: logits shape [1, num_classes]
        logits = model(ngram_ids, offsets)

        # Mujhe human-friendly probabilities chahiye,
        # isliye softmax laga raha hoon class dimension (dim=1) pe.
        probs = torch.softmax(logits, dim=1)

        # Ab highest probability wali class ka index nikalna hai.
        # probs.argmax(dim=1) -> tensor([class_id])
        # .item() se python int banaya, fir explicit int(...) for safety.
        pred_label_id = int(probs.argmax(dim=1).item())

        # label_names list me se us class id ka naam le leta hoon
        # (e.g., "World", "Sports", etc.) aur confidence (probability) bhi return karta hoon.
        return label_names[pred_label_id], probs[0, pred_label_id].item()


# Ab ek chhota sanity check / demo:
# Ek sentence leke dekhte hain model kya predict karta hai.
example_text = "Apple introduces new phone model for the Indian market."

# predict() se label + confidence tuple milega.
pred_label, confidence = predict(example_text)

# Output thoda readable form me print kar dete hain.
print("\nExample text:", example_text)
print(f"Predicted class: {pred_label} (confidence: {confidence:.3f})")



Example text: Apple introduces new phone model for the Indian market.
Predicted class: Sci/Tech (confidence: 0.288)


# Book Implementation

In [None]:
import re
import pandas as pd
import numpy as np
import nltk
from sklearn.datasets import fetch_20newsgroups

In [None]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
stop_words = nltk.corpus.stopwords.words('english')

In [None]:
def normalize_document(doc):
  doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)
  doc = doc.lower()
  doc = doc.strip()

  tokens = nltk.word_tokenize(doc)

  filtered_tokens = [token for token in tokens if token not in stop_words]

  doc = ' '.join(filtered_tokens)

  return doc

normalize_corpus = np.vectorize(normalize_document)

In [None]:
cats = ['alt.atheism', 'sci.space']
newsgroup_train = fetch_20newsgroups(subset='train',
                                     categories=cats,
                                     remove=('headers', 'footers', 'quotes'))

In [None]:
print('Number of news articles = {}'.format(len(newsgroup_train.data)))

Number of news articles = 1073


In [None]:
norm_corpus = normalize_corpus(newsgroup_train.data)
norm_corpus

array(['please enlighten omnipotence contradictory definition occur universe governed rules nature thus god break anything god must allowed rules somewhere therefore omnipotence exist contradicts rules nature obviously omnipotent god change rules say definition exactly defined certainly omnipotence seem saying rules nature preexistant somehow define nature actually cause thats mean id like hear thoughts question',
       'aprkelvinjplnasagov baalkekelvinjplnasagov sorry think missed bit info transition experiment mean loss data magellan transmit data later btw nasa cut connection magellan looking forward day curious believe something funding goverment rather funding ok thats see guys around jurriaan',
       'henry made assumption gets firstest mostest wins ohhh want put fine print says thou shall wonderous rd rather use offtheshelf hardware sorry didnt see copy pournellesque proposals run along lines dollar amount reward simple goal go ahead development ill buy shelf higher cost even 

In [None]:
# !pip install gensim

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m81.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [None]:
from gensim.models.fasttext import FastText

In [None]:
tokenized_corpus = [nltk.word_tokenize(doc) for doc in norm_corpus]

In [None]:
import time

In [None]:

embedding_size = 32
context_window = 20
min_word_count = 1
sample = 1e-3
sg = 1

start_time = time.time()

ft_model = FastText(tokenized_corpus,
                    vector_size=embedding_size,
                    window=context_window,
                    min_count=min_word_count,
                    sg = sg,
                    sample=sample,
                    epochs=100)

end_time = time.time()
time_taken = end_time - start_time
print(f"Time taken to train FastText model: {time_taken:.2f} seconds")

Time taken to train FastText model: 169.64 seconds


In [None]:
minutes, seconds = divmod(time_taken, 60)
print(f"Time taken to train FastText model: {int(minutes)} min {seconds:.2f} sec")

Time taken to train FastText model: 2 min 49.64 sec


In [None]:
print(f"Unique numbers of words in the model={ft_model.wv.vectors.shape[0]}")

Unique numbers of words in the model=19421


In [None]:
ft_model.wv['sun']

array([-0.68865156,  0.16804571, -0.9665691 , -0.3416829 ,  0.78023183,
        0.25930217,  0.3499796 , -0.48844534, -0.5560808 , -0.5228781 ,
       -0.50807303, -0.55770886, -0.3148962 , -0.9736987 , -0.31482577,
       -0.22821371,  0.06310669,  0.5029337 ,  0.2532907 , -0.4708344 ,
        0.29784685, -0.25402132,  0.19896385, -0.02769833,  0.23079884,
       -0.32078335, -1.1128368 , -0.45634848, -0.19624002, -0.15969302,
       -0.03623481, -0.34593585], dtype=float32)

In [None]:
ft_model.wv['sunny']

array([-0.7969927 ,  1.3445771 , -0.8033236 , -0.31200108,  0.69085395,
        0.39706072,  1.8380454 , -0.5289925 , -1.0682884 , -1.4317443 ,
       -0.8753653 ,  0.34827802, -1.9453781 , -0.0436891 , -0.33408877,
        0.39011073,  0.0323919 ,  1.2287263 ,  0.13863759, -0.64394426,
        0.41518155,  0.19215867,  0.3469433 ,  0.41325116, -0.41990545,
        0.8155734 , -0.7166068 , -0.11166925,  0.44339675, -0.59900945,
        0.32291928, -1.049846  ], dtype=float32)

In [None]:
ft_model.wv.most_similar(positive=['god'])

[('existence', 0.863061249256134),
 ('interestingly', 0.8570623993873596),
 ('ontology', 0.8477120995521545),
 ('eternal', 0.8461553454399109),
 ('nonexistence', 0.8411461114883423),
 ('undying', 0.8393584489822388),
 ('believing', 0.8391947746276855),
 ('denials', 0.8355113863945007),
 ('exists', 0.8312482237815857),
 ('trivially', 0.8309420943260193)]

In [None]:
ft_model.wv.most_similar(positive=['sunny'])

[('wilderness', 0.8077141642570496),
 ('much', 0.8049019575119019),
 ('bind', 0.7919706106185913),
 ('would', 0.7868427634239197),
 ('matter', 0.7867603302001953),
 ('renounce', 0.7801792025566101),
 ('concrete', 0.7763912677764893),
 ('going', 0.7736693620681763),
 ('mas', 0.773422122001648),
 ('orifices', 0.7714561820030212)]