# 02 - Vocabulary Building and Data Loading

## Learning Objectives
- Build vocabulary from cleaned text data
- Create efficient PyTorch data loaders
- Implement text-to-sequence conversion
- Handle variable-length sequences
- Understand PyTorch Dataset and DataLoader concepts

## Phase 1: PyTorch Fundamentals 🧠
*Build everything from scratch to understand the foundations*

## Phase 2: Transformers Enhancement 🚀
*Enhance with modern NLP tools after mastering fundamentals*

---

## Overview

In this notebook, you'll build the data pipeline that converts your cleaned text into numerical sequences that PyTorch models can process. This involves:

1. **Vocabulary Building**: Creating word-to-index mappings
2. **Text-to-Sequence Conversion**: Converting text to numerical sequences
3. **PyTorch Dataset**: Creating custom dataset classes
4. **Data Loading**: Implementing efficient data loaders with batching
5. **Pipeline Validation**: Testing and optimizing your data pipeline

## Prerequisites

Make sure you have completed:
- ✅ `00_exploration.ipynb` - Data exploration and EDA
- ✅ `01_preprocessing.ipynb` - Text cleaning and preprocessing

You'll be working with the cleaned datasets from `data/interim/` folder.

---

## TODO 1: Build Vocabulary

**Goal**: Create word-to-index mapping from cleaned text

**Steps**:
1. Load cleaned text data from `data/interim/`
2. Tokenize text into words
3. Build vocabulary with special tokens:
   - `<PAD>` for padding (index 0)
   - `<UNK>` for unknown words (index 1)
   - `<START>` and `<END>` tokens (optional)
4. Calculate vocabulary statistics

**Hint**: Use `Counter` to count word frequencies, set minimum frequency threshold

**Expected Output**: Word-to-index mapping and vocabulary size


In [None]:
# TODO 1: Build vocabulary
# Your implementation here


## TODO 2: Text to Sequence Conversion

**Goal**: Convert text to numerical sequences

**Steps**:
1. Implement function to convert text to sequence of indices
2. Handle unknown words with `<UNK>` token
3. Add padding to sequences for batch processing
4. Test conversion on sample texts

**Hint**: Use vocabulary mapping and consider sequence length limits

**Expected Output**: Numerical sequences ready for model input


In [None]:
# TODO 2: Text to sequence conversion
# Your implementation here


## TODO 3: Custom PyTorch Dataset

**Goal**: Create PyTorch Dataset class for disaster tweets

**Steps**:
1. Inherit from `torch.utils.data.Dataset`
2. Implement `__len__` and `__getitem__` methods
3. Handle text-to-sequence conversion
4. Return tensors for text and labels

**Hint**: Use `torch.tensor()` for tensor creation

**Expected Output**: Custom dataset class ready for DataLoader


In [None]:
# TODO 3: Custom PyTorch dataset
# Your implementation here


## TODO 4: Data Loading and Batching

**Goal**: Create efficient data loaders with proper batching

**Steps**:
1. Create train/validation/test splits
2. Implement custom collate function for variable-length sequences
3. Create DataLoaders with appropriate batch sizes
4. Test data loading pipeline

**Hint**: Use `torch.nn.utils.rnn.pad_sequence` for padding

**Expected Output**: Efficient data loading pipeline ready for training


In [None]:
# TODO 4: Data loading and batching
# Your implementation here


## TODO 5: Vocabulary Analysis and Optimization

**Goal**: Analyze vocabulary characteristics and optimize for modeling

**Steps**:
1. Calculate vocabulary statistics (size, coverage, frequency distribution)
2. Analyze sequence length distribution
3. Determine optimal sequence length for padding/truncation
4. Visualize vocabulary and sequence statistics
5. Save vocabulary for later use

**Hint**: Consider vocabulary size impact on model performance and memory usage

**Expected Output**: Optimized vocabulary ready for model training


In [None]:
# TODO 5: Vocabulary analysis and optimization
# Your implementation here


## TODO 6: Data Pipeline Testing and Validation

**Goal**: Validate the complete data pipeline before moving to modeling

**Steps**:
1. Test data loading with sample batches
2. Verify tensor shapes and data types
3. Check for any data inconsistencies
4. Measure data loading performance
5. Document pipeline characteristics

**Hint**: Use `torch.utils.data.DataLoader` with `num_workers` for efficiency

**Expected Output**: Validated data pipeline ready for model training

---

## Phase 2: Transformers Enhancement

*After completing Phase 1, consider these enhancements:*

- Use HuggingFace tokenizers for consistent tokenization
- Leverage pre-trained vocabularies (BERT, RoBERTa)
- Compare custom vocabulary vs. pre-trained tokenizers
- Analyze vocabulary size impact on model performance


In [None]:
# TODO 6: Data pipeline testing and validation
# Your implementation here
