# 🔢 Notebook 04: Padding, Tensors & DataLoader

## From Variable Sequences to Fixed-Size Batches

This notebook teaches you how to handle the fundamental challenge of neural networks: they need fixed-size inputs, but text sequences vary in length. You'll implement padding, convert to PyTorch tensors, and create efficient batch loading.


## 🧠 Concept Primer: Padding and Batching

### What We're Doing
Converting variable-length encoded sequences into fixed-size tensors that neural networks can process in batches.

### Why This Step is Critical
**Neural networks require fixed-size inputs.** The challenges:
- **Variable sequence lengths** (reviews have different numbers of words)
- **Batch processing** (neural networks are faster with batches)
- **Memory efficiency** (tensors must be rectangular)

### What We'll Build
- **Padding function** that extends short sequences with `<pad>` tokens
- **Truncation logic** that cuts long sequences to max length
- **Tensor conversion** using `torch.tensor()`
- **DataLoader setup** for efficient batch loading

### Shape Expectations
- **Input sequences**: Variable length → Fixed length (128)
- **Tensors**: `[batch_size, sequence_length]` → `[16, 128]`
- **Labels**: `[batch_size]` → `[16]`

### Expected Output Example
```python
for X_batch, y_batch in train_dataloader:
    print(X_batch.shape, y_batch.shape)
# Output: torch.Size([16, 128]) torch.Size([16])
```


## 🔧 TODO #1: Implement Padding Function

**Task:** Create function that pads short sequences and truncates long ones to fixed length.

**Hint:** If `len(seq) < max_len`, extend with `1` (pad token); if longer, slice `seq[:max_len]`

**Expected Function Signature:**
```python
def pad_or_truncate(seq, max_len=128):
    # Your implementation here
    return padded_seq  # List of length max_len
```

**Expected Output Example:**
```python
pad_or_truncate([2, 3, 4], max_len=5)
# Returns: [2, 3, 4, 1, 1]  # padded

pad_or_truncate([2, 3, 4, 5, 6, 7, 8], max_len=5)
# Returns: [2, 3, 4, 5, 6]  # truncated
```


In [1]:
# TODO #1: Implement padding function
# Your code here
from matplotlib.cm import ScalarMappable


def pad_or_truncate(sequence, max_len, padding_value=0):
    """
    Pads or truncates a sequence to a specified target length.

    Parameters:
    sequence (list): The input sequence to be padded or truncated.
    target_length (int): The desired length of the output sequence.
    padding_value (any): The value to use for padding if the sequence is shorter than the target length.

    Returns:
    list: The padded or truncated sequence.
    """
    if len(sequence) < max_len:
        # Pad the sequence
        return sequence + [padding_value] * (max_len - len(sequence))
    else:
        # Truncate the sequence
        return sequence[:max_len]

# Test the function
sample_sequence = [1, 2, 3]
padded_sequence = pad_or_truncate(sample_sequence, 5, padding_value=0)
truncated_sequence = pad_or_truncate(sample_sequence, 2)
print("Padded Sequence:", padded_sequence)        # Output: [1, 2, 3, 0, 0]
print("Truncated Sequence:", truncated_sequence)  # Output: [1, 2]

Padded Sequence: [1, 2, 3, 0, 0]
Truncated Sequence: [1, 2]


## 🔧 TODO #2: Encode and Pad All Texts

**Task:** Apply encoding and padding to all training texts.

**Hint:** Use list comprehension: `padded_text_seqs = [pad_or_truncate(encode_text(text, vocab)) for text in train_texts]`

**Expected Variable:**
- `padded_text_seqs` → List of lists, each inner list has length 128

**Shape Check:** All sequences should have the same length (128)


In [6]:
# TODO #2: Encode and pad all texts
# Your code here
import pandas as pd
import re
from collections import Counter

def tokenize(text):
    # Use regex to find words, ignoring punctuation
    tokens = re.findall(r'\b\w+\b', text.lower())
    return tokens

train_reviews_df = pd.read_csv('../data/imdb_movie_reviews_train.csv')
test_reviews_df = pd.read_csv('../data/imdb_movie_reviews_test.csv')

tokenized_corpus_train = train_reviews_df['review'].apply(tokenize).tolist() 
tokenized_corpus_test = test_reviews_df['review'].apply(tokenize).tolist()

combined_corpus = [token for sublist in tokenized_corpus_train + tokenized_corpus_test for token in sublist]

word_freqs = Counter(combined_corpus)
print(word_freqs.most_common(10))

max_vocab_size = 1002
most_common_words = word_freqs.most_common(max_vocab_size - 2)  # Reserve 2 for <PAD> and <UNK>
vocab = {'<PAD>': 0, '<UNK>': 1, **{word: idx + 2 for idx, (word, _) in enumerate(most_common_words)}}
print(list(vocab.items())[:10])  # Print first 10 items in vocabulary dictionary
vocab_size = len(vocab)

def encode_text(text, vocab):
    tokens = tokenize(text)
    encoded = [vocab.get(token, vocab['<UNK>']) for token in tokens]
    return encoded

encoded_reviews_train = train_reviews_df['review'].apply(lambda x: encode_text(x, vocab)).tolist()
encoded_reviews_test = test_reviews_df['review'].apply(lambda x: encode_text(x, vocab)).tolist()

max_sequence_length = 128
padded_encoded_reviews_train = [pad_or_truncate(seq, max_sequence_length, padding_value=vocab['<PAD>']) for seq in encoded_reviews_train]
padded_encoded_reviews_test = [pad_or_truncate(seq, max_sequence_length, padding_value=vocab['<PAD>']) for seq in encoded_reviews_test]
print("First encoded and padded review (train):", padded_encoded_reviews_train[0])
print("First encoded and padded review (test):", padded_encoded_reviews_test[0])
print("Checking that all sequences are of length", max_sequence_length)
print("Train lengths:", [len(seq) for seq in padded_encoded_reviews_train][:5])
print("Test lengths:", [len(seq) for seq in padded_encoded_reviews_test][:5])
print("Train lengths (last 5):", [len(seq) for seq in padded_encoded_reviews_train][-5:])


[('the', 1009), ('a', 423), ('of', 405), ('and', 399), ('to', 295), ('is', 295), ('in', 235), ('it', 191), ('s', 147), ('that', 140)]
[('<PAD>', 0), ('<UNK>', 1), ('the', 2), ('a', 3), ('of', 4), ('and', 5), ('to', 6), ('is', 7), ('in', 8), ('it', 9)]
First encoded and padded review (train): [1, 242, 536, 131, 27, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
First encoded and padded review (test): [2, 66, 7, 1, 9, 36, 292, 135, 136, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## 🔧 TODO #3: Convert to PyTorch Tensors

**Task:** Convert padded sequences and labels to PyTorch tensors.

**Hint:** Use `torch.tensor(padded_text_seqs, dtype=torch.long)` and `torch.tensor(train_labels, dtype=torch.long)`

**Expected Variables:**
- `X_tensor` → Tensor of shape `[num_samples, 128]`
- `y_tensor` → Tensor of shape `[num_samples]`

**Data Types:** Use `torch.long` for both (integers for embeddings and class labels)


In [7]:
# TODO #3: Convert to PyTorch tensors
import torch

# Your code here
X_tensor_train = torch.tensor(padded_encoded_reviews_train, dtype=torch.long)
X_tensor_test = torch.tensor(padded_encoded_reviews_test, dtype=torch.long)
y_tensor_train = torch.tensor(train_reviews_df['aspect_encoded'].values, dtype=torch.long)
y_tensor_test = torch.tensor(test_reviews_df['aspect_encoded'].values, dtype=torch.long)
print("X_tensor_train shape:", X_tensor_train.shape)
print("y_tensor_train shape:", y_tensor_train.shape)
print("X_tensor_test shape:", X_tensor_test.shape)
print("y_tensor_test shape:", y_tensor_test.shape)

X_tensor_train shape: torch.Size([369, 128])
y_tensor_train shape: torch.Size([369])
X_tensor_test shape: torch.Size([132, 128])
y_tensor_test shape: torch.Size([132])


## 🔧 TODO #4: Create DataLoader

**Task:** Build PyTorch DataLoader for efficient batch processing.

**Hint:** Use `TensorDataset(X_tensor, y_tensor)` then `DataLoader(train_dataset, batch_size=16, shuffle=True)`

**Expected Variables:**
- `train_dataset` → TensorDataset combining features and labels
- `train_dataloader` → DataLoader with batch size 16 and shuffling

**Expected Output:** When you iterate through the DataLoader, you should get batches of shape `[16, 128]` and `[16]`


In [8]:
# TODO #4: Create DataLoader
from torch.utils.data import TensorDataset, DataLoader

# Your code here
batch_size = 16
train_dataset = TensorDataset(X_tensor_train, y_tensor_train)
test_dataset = TensorDataset(X_tensor_test, y_tensor_test)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
print("Number of batches in train_loader:", len(train_loader))
print("Number of batches in test_loader:", len(test_loader))
#Shape of first batch
for X_batch, y_batch in train_loader:
    print("X_batch shape:", X_batch.shape)
    print("y_batch shape:", y_batch.shape)
    break




Number of batches in train_loader: 24
Number of batches in test_loader: 9
X_batch shape: torch.Size([16, 128])
y_batch shape: torch.Size([16])


## 📝 Reflection Prompts

### 🤔 Understanding Check
1. **Why shuffle training data?** What would happen if you didn't shuffle?

2. **What's the tradeoff of max_len=128 vs 256?** Consider memory usage and information loss.

3. **Why use `torch.long` for both inputs and labels?** What would happen with other data types?

4. **How does padding affect the model's understanding?** Will the model "see" the padding tokens?

### 🎯 Batching Strategy
- Why is batch processing more efficient than processing one sample at a time?
- How does the batch size of 16 affect training speed vs memory usage?
- What happens if your dataset size isn't divisible by batch size?

---

**Write your reflections here:**


## 📝 My Reflections

### 🤔 Understanding Check Answers

1. **Why shuffle training data?**
   - Shuffling prevents the neural network from learning the sequence order of the data
   - This randomness adds difficulty to the NN, forcing it to learn patterns rather than memorizing data order
   - Without shuffling, the model might overfit to the specific order of samples in the dataset

2. **What's the tradeoff of max_len=128 vs 256?**
   - **Shorter sequences (128)**: Less memory usage, more computationally efficient, but potential information loss
   - **Longer sequences (256)**: More information preserved, but higher memory usage and slower computation
   - The trade-off balances computational efficiency with information retention

3. **Why use `torch.long` for both inputs and labels?**
   - `torch.long` represents 64-bit integers, which are needed for:
     - **Inputs**: Word indices for embedding lookup (must be integers)
     - **Labels**: Class indices for classification (must be integers)
   - Other data types like `float` would be inappropriate for discrete indices

4. **How does padding affect the model's understanding?**
   - The model will see padding tokens (0s) and mathematically learn to ignore them
   - The embedding layer and masking mechanisms help the model distinguish between real tokens and padding
   - The model learns to focus on actual content while treating padding as neutral information

### 🎯 Batching Strategy Analysis

**Why batch processing is more efficient:**
- Processing one sample at a time would interrupt the learning phase, making it less smooth
- Batches allow for parallel computation and more stable gradient updates
- A batch size of 16 is something a CPU can handle and avoids crashes

**Batch size considerations:**
- Batch size of 16 balances training speed with memory usage
- The dataset should ideally be divisible by batch size to avoid incomplete batches
- Larger batches provide more stable gradients but require more memory

**Key insight:** Batching transforms individual samples into efficient parallel processing units, enabling smooth gradient-based learning while maintaining computational feasibility.
