# 03 — Sequences, Dataset & DataLoader
## Creating Training Data with Sliding Windows

---


## 🎯 Concept Primer

### What Are Sliding Windows?

Language modeling uses **sliding windows** to create (input, target) pairs:

**Example:**
```
Text IDs: [5, 8, 12, 15, 20, 23, 30, 35]
seq_length = 4

Sample 1:
  Features: [5, 8, 12, 15]     ← input sequence
  Labels:   [8, 12, 15, 20]    ← shifted right by 1

Sample 2:
  Features: [8, 12, 15, 20]
  Labels:   [12, 15, 20, 23]

...and so on
```

**Key Insight**: The labels are the features shifted by one position. This teaches the model: *given these characters, predict the next one*.

### Why PyTorch Dataset & DataLoader?

**Dataset**: Custom class that:
- Stores the data (our ID sequence)
- Knows how many samples exist (`__len__`)
- Can fetch one sample at a time (`__getitem__`)

**DataLoader**: Handles:
- Batching (grouping multiple samples)
- Shuffling (randomizing order each epoch)
- Parallel loading (optional, not needed here)

### What Breaks If We Skip This?

- No batching = training one sample at a time (slow!)
- Manual batching = error-prone, hard to maintain
- No Dataset = can't use PyTorch's ecosystem

### Shapes
- **Single sample features**: `[seq_length]` (e.g., `[48]`)
- **Single sample labels**: `[seq_length]` (e.g., `[48]`)
- **Batch features**: `[batch_size, seq_length]` (e.g., `[36, 48]`)
- **Batch labels**: `[batch_size, seq_length]` (e.g., `[36, 48]`)

---


## ✅ Objectives

By the end of this notebook, you should:

- [ ] Understand sliding window concept for sequence prediction
- [ ] Create a custom `TextDataset` class with `__init__`, `__len__`, `__getitem__`
- [ ] Implement feature/label shifting (labels = features + 1 position)
- [ ] Instantiate the dataset with `seq_length=48`
- [ ] Create a `DataLoader` with `batch_size=36`, `shuffle=True`
- [ ] Print one batch to verify shapes: `[36, 48]`

---


## 🎓 Acceptance Criteria

**You pass this notebook when:**

✅ `TextDataset` class is defined with all three methods  
✅ One batch prints with shapes `[36, 48]` for both features and labels  
✅ You can explain: "Why are labels shifted by 1?"  
✅ You understand: "What does `__len__` return and why `len(ids) - seq_length`?"

---


## 📝 TODO 0: Setup & Imports

**Note:** Load the variables from previous notebooks + import PyTorch


In [None]:
# Imports
import torch
from torch.utils.data import Dataset, DataLoader

# Load data from previous notebooks
with open('../datasets/frankenstein.txt', 'r', encoding='utf-8') as f:
    frankenstein = f.read()
    
first_letter_text = frankenstein[1380:8230]
tokenized_text = list(first_letter_text)
unique_char_tokens = sorted(set(tokenized_text))
c2ix = {char: idx for idx, char in enumerate(unique_char_tokens)}
ix2c = {idx: char for char, idx in c2ix.items()}
vocab_size = len(c2ix)
tokenized_id_text = [c2ix[char] for char in tokenized_text]

print(f"Data loaded: {len(tokenized_id_text)} IDs, vocab_size={vocab_size}")


## 📝 TODO 1: Define TextDataset Class — `__init__`

**Hint:**  
Store the ID list and sequence length.

**Steps:**
1. Define class `TextDataset` that inherits from `torch.utils.data.Dataset`
2. In `__init__(self, tokenized_ids, seq_length)`:
   - Store `self.ids = tokenized_ids`
   - Store `self.seq_length = seq_length`

**Why?**  
The `__init__` method saves the data and hyperparameters we'll need in other methods.


In [None]:
# TODO: Define TextDataset class with __init__

class TextDataset(Dataset):
    def __init__(self, tokenized_ids, seq_length):
        # TODO: Store tokenized_ids and seq_length
        # self.ids = ...
        # self.seq_length = ...
        pass  # Remove this line after implementing
    
    # We'll add __len__ and __getitem__ next


## 📝 TODO 2: Implement `__len__` Method

**Hint:**  
How many sliding windows can we fit?

**Logic:**
- We have `len(self.ids)` total IDs
- Each window needs `seq_length` IDs for features + 1 more for the last label
- Number of samples = `len(self.ids) - seq_length`

**Example:**
```
IDs: [1, 2, 3, 4, 5, 6]  (6 IDs)
seq_length = 3

Sample 0: features=[1,2,3], labels=[2,3,4]
Sample 1: features=[2,3,4], labels=[3,4,5]
Sample 2: features=[3,4,5], labels=[4,5,6]

Total samples = 6 - 3 = 3
```

**Return:**  
`len(self.ids) - self.seq_length`


In [None]:
# TODO: Add __len__ to TextDataset

class TextDataset(Dataset):
    def __init__(self, tokenized_ids, seq_length):
        self.ids = tokenized_ids
        self.seq_length = seq_length
    
    def __len__(self):
        # TODO: Return the number of possible sliding windows
        # return len(self.ids) - self.seq_length
        pass  # Remove this line after implementing
    
    # We'll add __getitem__ next


## 📝 TODO 3: Implement `__getitem__` Method

**Hint:**  
Slice the ID sequence to get features and labels.

**Logic:**
- Features: `self.ids[idx : idx + seq_length]`
- Labels: `self.ids[idx + 1 : idx + seq_length + 1]` (shifted by 1)

**Example:**
```
self.ids = [10, 20, 30, 40, 50, 60]
seq_length = 3
idx = 1

Features: self.ids[1:4] = [20, 30, 40]
Labels:   self.ids[2:5] = [30, 40, 50]  ← shifted right by 1
```

**Return:**  
- Convert to PyTorch tensors: `torch.tensor(..., dtype=torch.long)`
- Return tuple: `(features_tensor, labels_tensor)`

**Why `torch.long`?**  
IDs are integers, and CrossEntropyLoss expects `long` type.


In [None]:
# TODO: Complete TextDataset with __getitem__

class TextDataset(Dataset):
    def __init__(self, tokenized_ids, seq_length):
        self.ids = tokenized_ids
        self.seq_length = seq_length
    
    def __len__(self):
        return len(self.ids) - self.seq_length
    
    def __getitem__(self, idx):
        # TODO: Slice features and labels
        # features = self.ids[idx : idx + self.seq_length]
        # labels = self.ids[idx + 1 : idx + self.seq_length + 1]
        
        # TODO: Convert to tensors
        # features_tensor = torch.tensor(features, dtype=torch.long)
        # labels_tensor = torch.tensor(labels, dtype=torch.long)
        
        # TODO: Return the tuple
        # return features_tensor, labels_tensor
        pass  # Remove this line after implementing


## 📝 TODO 4: Instantiate the Dataset

**Hint:**  
Create an instance with your tokenized IDs and `seq_length=48`.

**Steps:**
1. `dataset = TextDataset(tokenized_id_text, seq_length=48)`
2. Print `len(dataset)` to see how many samples


In [None]:
# TODO: Create dataset instance
# dataset = TextDataset(tokenized_id_text, seq_length=48)

dataset = None  # Replace this line

# Verify
if dataset:
    print(f"Dataset created with {len(dataset)} samples")
    
    # Test one sample
    features, labels = dataset[0]
    print(f"\nSample 0:")
    print(f"  Features shape: {features.shape}")
    print(f"  Labels shape: {labels.shape}")
    print(f"  First 10 feature IDs: {features[:10].tolist()}")
    print(f"  First 10 label IDs: {labels[:10].tolist()}")


## 📝 TODO 5: Create DataLoader

**Hint:**  
Use `torch.utils.data.DataLoader` to batch and shuffle.

**Steps:**
1. Create `DataLoader(dataset, batch_size=36, shuffle=True)`
2. Why `shuffle=True`? Randomizes training order each epoch for better generalization

**Batch size 36:**  
Arbitrary choice. Smaller batches = more updates but noisier gradients. Larger = more stable but fewer updates.


In [None]:
# TODO: Create DataLoader
# dataloader = DataLoader(dataset, batch_size=36, shuffle=True)

dataloader = None  # Replace this line

# Verify
if dataloader:
    print(f"DataLoader created with batch_size=36, shuffle=True")


## 📝 TODO 6: Test the DataLoader

**Hint:**  
Iterate once to get a batch and print shapes.

**Steps:**
1. Loop: `for batch_features, batch_labels in dataloader:`
2. Print shapes (should be `[36, 48]` for both)
3. `break` after first batch


In [None]:
# TODO: Get one batch and print shapes
# for batch_features, batch_labels in dataloader:
#     print(f"Batch features shape: {batch_features.shape}")
#     print(f"Batch labels shape: {batch_labels.shape}")
#     print(f"\nFirst sample in batch:")
#     print(f"  Features: {batch_features[0][:20]}")  # First 20 IDs
#     print(f"  Labels:   {batch_labels[0][:20]}")
#     break

# Your code here


## 💭 Reflection Prompts

**Write your observations:**

1. **Shifting by 1**: Why are labels exactly the same as features but shifted right by one position?

2. **Dataset length**: If we have 6,850 IDs and `seq_length=48`, how many samples do we get? Why?

3. **Batch shape**: What does `[36, 48]` mean? (36 = ?, 48 = ?)

4. **Shuffling**: What would happen if we set `shuffle=False`? Would the model still learn?

5. **Why tensors**: Why do we convert to `torch.tensor` instead of keeping as Python lists?

---


## 🚀 Next Steps

Once you've completed all TODOs and printed batch shapes:

➡️ **Move to Notebook 04**: Define the LSTM model architecture

---

## 📌 Key Takeaways

- ✅ Sliding windows create (input, target) pairs for training
- ✅ Labels = Features shifted by 1 position
- ✅ `Dataset` knows how to fetch one sample
- ✅ `DataLoader` handles batching and shuffling
- ✅ Batch shape `[B, T]` where B=batch_size, T=seq_length
- ✅ All data must be PyTorch tensors for the model

---

*Next up: Building the LSTM model that will process these batches!*
