# PyTorch Workflow: Custom Text Datasets & DataLoaders

This notebook demonstrates a complete workflow for text classification in PyTorch from scratch.

**Key Concepts Covered:**
1.  **Custom Dataset:** subclassing `torch.utils.data.Dataset`.
2.  **Vocabulary Building:** Mapping raw words to integer IDs.
3.  **Custom Collate Function:** Handling variable-length text sequences using padding.
4.  **Embedding Model:** A simple neural network for binary classification.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from collections import Counter

# Set seed for reproducibility
torch.manual_seed(42)

<torch._C.Generator at 0x7f7880749ed0>

## 1. Prepare Synthetic Data
In a real scenario, this would be loaded from a CSV or JSON file. Here we create a list of `(text, label)` tuples.

In [2]:
# Labels: 0 = Negative, 1 = Positive
raw_data = [
    ("I love this movie", 1),
    ("This was terrible", 0),
    ("I enjoyed it", 1),
    ("Disgusting food", 0),
    ("Amazing service", 1),
    ("Bad experience", 0),
    ("Really great", 1),
    ("Not good at all", 0),
    ("I hated the plot but loved the acting", 0), # Complex example
    ("Best day ever", 1)
]

## 2. Vocabulary Builder
Neural networks cannot understand strings. We must convert words to integers.

* **<PAD>**: Used to fill short sentences to match the length of long ones (Index 0).
* **<UNK>**: Used for words the model hasn't seen before (Index 1).

In [3]:
class Vocabulary:
    def __init__(self, data):
        self.word2idx = {"<PAD>": 0, "<UNK>": 1}
        self.idx2word = {0: "<PAD>", 1: "<UNK>"}
        self.idx = 2

        # Build vocab from data
        all_text = " ".join([text for text, label in data])
        words = all_text.lower().split()
        counter = Counter(words)

        for word, count in counter.items():
            if word not in self.word2idx:
                self.word2idx[word] = self.idx
                self.idx2word[self.idx] = word
                self.idx += 1

    def encode(self, text):
        # Convert string to list of indices
        return [self.word2idx.get(w, self.word2idx["<UNK>"]) for w in text.lower().split()]

    def __len__(self):
        return len(self.word2idx)

# Initialize vocab
vocab = Vocabulary(raw_data)
print(f"Vocabulary size: {len(vocab)}")
print(f"Sample mapping: 'love' -> {vocab.word2idx.get('love')}")

Vocabulary size: 31
Sample mapping: 'love' -> 3


## 3. Custom Dataset Class
The Dataset class retrieves a single item. It converts the text to a list of integers immediately.

In [4]:
class TextDataset(Dataset):
    def __init__(self, data, vocabulary):
        self.data = data
        self.vocab = vocabulary

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        text, label = self.data[idx]
        # Convert text to indices immediately when accessed
        text_indices = self.vocab.encode(text)
        return torch.tensor(text_indices), torch.tensor(label, dtype=torch.float)

# Create Dataset instance
dataset = TextDataset(raw_data, vocab)

## 4. Custom Collate Function (Padding)
This is the most critical part for text data.
Because sentences have different lengths, we cannot simply stack them into a matrix. We must find the longest sentence in the batch and pad the others with `0`.

In [5]:
def collate_batch(batch):
    # batch is a list of tuples: [(tensor([1, 2]), label), (tensor([3, 4, 5]), label)]

    label_list, text_list = [], []

    for (_text, _label) in batch:
        label_list.append(_label)
        text_list.append(_text)

    # Pad sequences to the length of the longest sentence in this batch
    # padding_value=0 corresponds to "<PAD>" in our vocab
    text_stacked = pad_sequence(text_list, batch_first=True, padding_value=0)
    label_stacked = torch.tensor(label_list).unsqueeze(1) # Reshape for BCELoss

    return text_stacked, label_stacked

## 5. DataLoader
We combine the dataset and the collate function.

In [6]:
dataloader = DataLoader(
    dataset,
    batch_size=2,
    shuffle=True,
    collate_fn=collate_batch # Connects the padding logic
)

# Verify it works
text_batch, label_batch = next(iter(dataloader))
print("Batch Shape:", text_batch.shape) # [batch_size, max_seq_len_in_batch]
print("Sample Batch (padded):\n", text_batch)

Batch Shape: torch.Size([2, 8])
Sample Batch (padded):
 tensor([[ 2, 22, 23, 24, 25, 26, 23, 27],
        [ 2,  3,  4,  5,  0,  0,  0,  0]])


## 6. The Model
A standard text classification architecture: `Embedding` -> `Pooling` -> `Linear` -> `Sigmoid`.

In [7]:
class SimpleTextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim=16):
        super().__init__()
        # Embedding: maps ID -> Vector
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.fc = nn.Linear(embed_dim, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        # x shape: [batch_size, seq_len]
        embedded = self.embedding(x)

        # Simple Average Pooling (Ignores padding logic for simplicity,
        # but sufficient for this demo)
        pooled = embedded.mean(dim=1)

        output = self.fc(pooled)
        return self.sigmoid(output)

model = SimpleTextClassifier(vocab_size=len(vocab))

## 7. Training Loop

In [8]:
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.05)

epochs = 10

print("Starting Training...")
for epoch in range(epochs):
    total_loss = 0
    for text_batch, label_batch in dataloader:
        optimizer.zero_grad()
        predictions = model(text_batch)
        loss = criterion(predictions, label_batch)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    if (epoch+1) % 2 == 0:
        print(f"Epoch {epoch+1} | Loss: {total_loss/len(dataloader):.4f}")

Starting Training...
Epoch 2 | Loss: 0.5689
Epoch 4 | Loss: 0.2684
Epoch 6 | Loss: 0.0840
Epoch 8 | Loss: 0.0307
Epoch 10 | Loss: 0.0122


## 8. Inference (Testing)
Try your own sentences below.

In [9]:
def predict_sentiment(text):
    model.eval()
    with torch.no_grad():
        # 1. Encode
        indices = vocab.encode(text)
        # 2. Tensor & Add Batch Dim [1, seq_len]
        tensor_in = torch.tensor(indices).unsqueeze(0)
        # 3. Predict
        prob = model(tensor_in).item()

        label = "Positive" if prob > 0.5 else "Negative"
        print(f"Input: '{text}'")
        print(f"Prediction: {label} ({prob:.4f})\n")

# Test cases
predict_sentiment("I love this")
predict_sentiment("Absolute garbage")
predict_sentiment("Not good")

Input: 'I love this'
Prediction: Positive (0.9966)

Input: 'Absolute garbage'
Prediction: Positive (0.7468)

Input: 'Not good'
Prediction: Negative (0.0021)

