# 📝 Lab Notes: Your Learning Journal

## Reflection Solidifies Learning

This notebook serves as your personal learning journal. Research shows that reflection is one of the most powerful ways to solidify understanding and identify patterns in your learning process.

**How to use this notebook:**
- Complete at least one entry per notebook you finish
- Be honest about what worked and what didn't
- Notice patterns across your learning journey
- Use insights to improve your approach

---

## 📋 Entry Template

**Date:** [Today's date]  
**Notebook:** [Notebook number and name]  
**What Worked:** [What went well, what you understood easily]  
**What Broke:** [What was difficult, where you got stuck]  
**One Surprise:** [Something unexpected you learned or discovered]  
**One Change for Next Time:** [How you'll approach similar problems differently]

---


## 📝 Entry 1: Notebook 00 - Overview

**Date:** [Fill in when you complete this notebook]  
**Notebook:** 00_overview.ipynb  
**What Worked:**  
**What Broke:**  
**One Surprise:**  
**One Change for Next Time:**  

---


## 📝 Entry 2: Notebook 01 - Import and Inspect

**Date:** October 23, 2025  
**Notebook:** 01_import_and_inspect.ipynb  

**What Worked:**  
Loading and inspecting the data was straightforward. Understanding that `aspect_encoded` represents the numerical labels (0, 1, 2) for the three categories (Cinematography, Characters, Story) was intuitive once explained.

**What Broke:**  
Initial confusion about what `aspect_encoded` meant - it looked like random numbers (0, 1, 2, 2, 2) until I realized it was the encoded categorical labels.

**One Surprise:**  
The dataset is quite small (369 training, ~130 test samples) - much smaller than I expected for a machine learning task. Also surprised that there are only 3 aspect categories, making this a manageable multi-class classification problem.

**One Change for Next Time:**  
Would spend more time upfront understanding the data dictionary and what each column represents before diving into exploration.

---


## 📝 Entry 3: Notebook 02 - Preprocessing & Tokenization

**Date:** October 23, 2025  
**Notebook:** 02_preprocessing_tokenization.ipynb  

**What Worked:**  
The basic tokenization with `re.findall(r'\\b\\w+\\b', text.lower())` worked well for splitting text into words.

**What Broke:**  
Regex is really confusing! I don't understand what `\\b` does and there are so many special characters that I just ask an LLM to do it - I don't think memorizing regex patterns makes sense when an LLM can spit it out in seconds. Also noticed that "let's" splits into "let" and "s", where "s" alone provides no information. Emojis are deleted entirely.

**One Surprise:**  
How much information gets lost with simple tokenization - contractions split poorly, emojis disappear, and we lose context. TinyBERT will handle these cases better (maybe "Lets" instead of "let"+"s", and emojis converted to text).

**One Change for Next Time:**  
Would explore different tokenization strategies earlier to understand the trade-offs. Maybe compare word-level vs character-level vs subword tokenization.

---


## 📝 Entry 4: Notebook 03 - Vocabulary Building & Encoding

**Date:** October 23, 2025  
**Notebook:** 03_vocab_and_encoding.ipynb  

**What Worked:**  
Building the vocabulary with `Counter` and limiting to top 1000 words made sense. Using special tokens `<PAD>` and `<UNK>` is a clever way to handle padding and unknown words.

**What Broke:**  
Lots of confusion about vocab creation! Why start at index 2? (Answer: to not overwrite `<PAD>=0` and `<UNK>=1`). How to handle test data with unseen words? (Answer: map to `<UNK>`). The vocab only has 1002 words total, which seems very limiting.

**One Surprise:**  
Most frequent words are stopwords ("the", "a", "and") which don't carry much meaning. I expected more movie-specific words. Also surprised that limiting to 1000 words means ~30-50% of test words map to `<UNK>` - that's a LOT of lost information!

**One Change for Next Time:**  
Would experiment with different vocab sizes (500, 2000, 5000) to see the trade-off between model size and performance. Also consider filtering stopwords to keep more domain-specific terms.

---


## 📝 Entry 5: Notebook 04 - Padding, Tensors & DataLoader

**Date:** October 23, 2025  
**Notebook:** 04_padding_tensors_dataloader.ipynb  

**What Worked:**  
Padding sequences to a fixed length (128) makes sense for batch processing. Converting to PyTorch tensors with `torch.long` was straightforward.

**What Broke:**  
Still confused about why we need tensors and what `torch.long` means exactly - they're still just arrays to me. Also unclear why shuffle=True is important (Answer: prevents the NN from learning sequence patterns in the data order). The math of why batch_size=16 is chosen wasn't clear.

**One Surprise:**  
Padding to max_len=128 means shorter reviews get lots of `<PAD>` tokens - the model needs to learn to ignore these! Also surprising that we process samples in batches rather than one at a time - it's more efficient but adds complexity. The dataset should be divisible by batch size for clean processing.

**One Change for Next Time:**  
Would experiment with different batch sizes (8, 32, 64) and max_length values (64, 256) to understand their impact on training speed and memory usage.

---


## 📝 Entry 6: Notebook 05 - Simple Neural Network

**Date:** October 23, 2025  
**Notebook:** 05_simpleNN_with_embedding.ipynb  

**What Worked:**  
Understanding that the embedding layer learns spatial representations of words where semantic meaning = mathematical distance. The forward pass flow (embedding → pooling → fc1 → relu → fc2) made sense eventually.

**What Broke:**  
Really struggled to define the `SimpleNN` class! The forward pass 3D → 2D tensor transformation was confusing. Where does the mask go? How does pooling work? Why do we need it? Took a lot of guidance to understand mean pooling averages embeddings while masking out padding.

**One Surprise:**  
The embedding layer (embed_size=50) means each word gets 50 numbers to represent it - these are learned during training! Also surprising that we mask padding by multiplying by 0 rather than removing it. Mean pooling is smarter than I thought - it handles variable-length sequences elegantly.

**One Change for Next Time:**  
Would draw out the tensor shapes at each step (before coding) to visualize the transformations. Also would read more about different pooling strategies (max, mean, last token).

---


## 📝 Entry 7: Notebook 06 - Training & Evaluation

**Date:** October 23, 2025  
**Notebook:** 06_eval_baseline_metrics.ipynb  

**What Worked:**  
The training loop structure made sense (forward pass → loss → backward → optimizer step). Training was fast (~2 minutes for 50 epochs). Visualizing loss and accuracy curves helped understand overfitting.

**What Broke:**  
Hit two bugs: `RuntimeError` about FloatTensor vs LongTensor (duplicate embedding call), and `AttributeError` for missing `pad_id` attribute. Also struggled to understand confusion matrices - they're really hard to read! Which axis is which?

**One Surprise:**  
Training accuracy (70%) vs test accuracy (49%) shows massive overfitting - 21% gap! The model memorized training data but can't generalize. The confusion matrix revealed severe bias toward "Characters" class (77/132 predictions). Training was surprisingly fast but results were disappointing.

**One Change for Next Time:**  
Would implement regularization techniques (dropout, weight decay) earlier to combat overfitting. Also would track more metrics during training (per-class accuracy, F1-score) to catch bias issues sooner.

---


## 📝 Entry 8: Notebook 07 - TinyBERT Setup

**Date:** October 23, 2025  
**Notebook:** 07_tinybert_setup_and_freeze.ipynb  

**What Worked:**  
Loading pre-trained TinyBERT was easy with `transformers` library. Understanding that `num_labels=3` sets output size made sense. Freezing layers with `param.requires_grad = False` was straightforward.

**What Broke:**  
Confusion about `num_labels` (is it input or output?), attention masks (why different from padding?), and layer freezing strategy (why layer 3?). Also unclear why we don't get catastrophic forgetting when unfreezing layers - seems risky!

**One Surprise:**  
TinyBERT has 4 layers vs BERT's 12 - it's really "tiny"! The pre-trained vocab (30K+ tokens) is WAY bigger than my custom vocab (1002 words). Attention masks are different from padding - they tell the model which tokens to "pay attention to". Strategic freezing (95% frozen, 5% trainable) prevents catastrophic forgetting.

**One Change for Next Time:**  
Would experiment with different freezing strategies (freeze nothing, freeze everything except classifier, freeze first 2 layers vs first 3 layers) to understand the trade-offs.

---


## 📝 Entry 9: Notebook 08 - TinyBERT Fine-Tuning

**Date:** October 23, 2025  
**Notebook:** 08_tinybert_finetune_trainloop.ipynb  

**What Worked:**  
The training loop with early stopping worked beautifully! `lr=5e-4` was the "Goldilocks zone" - fast convergence without oscillation. The professional visualization (dual subplots with overfitting gap) made training behavior crystal clear.

**What Broke:**  
Initial `lr=1e-5` was too conservative (barely learned). `lr=2.5e-3` was too aggressive (oscillated wildly). Hit a `TypeError` comparing list vs float in early stopping logic. Confusion about "patience" mechanism and what "New best!" vs "Patience" messages meant.

**One Surprise:**  
Early stopping saved 24 epochs! Best model was at epoch 26, not epoch 50. Test loss dropped from 1.08 (baseline) to 0.1045 (TinyBERT) - a 90% reduction! The patience mechanism automatically restores the best checkpoint. AdamW optimizer with weight decay is specifically designed for transformers.

**One Change for Next Time:**  
Would try learning rate scheduling (start high, decay over time) to potentially squeeze out even better performance. Also would experiment with different patience values (3, 7, 10).

---


## 📝 Entry 10: Notebook 09 - Evaluation & Comparison

**Date:** October 23, 2025  
**Notebook:** 09_tinybert_eval_compare.ipynb  

**What Worked:**  
Loading both models from checkpoints for fair comparison was straightforward. The comparison table clearly showed TinyBERT's superiority (94% vs 49% accuracy). Using `seaborn.heatmap` fixed the confusion matrix legend overlap issue.

**What Broke:**  
Initially tried importing `SimpleNN` from another notebook with `nbimporter` - didn't work! Had to copy the class definition directly. The manual confusion matrix plotting had overlapping legends. Confusion about why I was preprocessing data again (needed baseline model's vocab for fair comparison).

**One Surprise:**  
**94% accuracy with TinyBERT!** A 91% improvement over baseline. Story classification went from 36% to 98% recall (+173%!). The baseline's severe bias toward "Characters" completely disappeared with TinyBERT. Transfer learning is TRANSFORMATIVELY better than training from scratch.

**One Change for Next Time:**  
Would do error analysis on the 8 misclassified TinyBERT samples to understand failure modes. Also would try ensemble methods (multiple TinyBERT models with different seeds) to push accuracy even higher.

---


## 🎯 Final Reflection: Learning Patterns

**At the end of your project, read all your entries and reflect on these questions:**

### 📊 Pattern Recognition
- What themes do you notice across your learning journey?
- Which concepts were consistently challenging?
- Where did you experience the most "aha" moments?

### 🚀 Growth Insights
- How did your understanding of NLP evolve?
- What skills did you develop that surprised you?
- Which approach worked best for your learning style?

### 🔮 Future Applications
- How will you apply these concepts to new projects?
- What would you do differently if starting over?
- What are you excited to explore next?

---

**My Final Reflections:**

### 📊 Pattern Recognition

**Recurring Themes:**
1. **Confusion → Clarity → Confidence**: Almost every notebook started with confusion (vocab indices, tensor shapes, attention masks), followed by guided exploration, then eventual understanding. This pattern repeated consistently.

2. **The Power of Visualization**: Every time I visualized something (confusion matrices, loss curves, overfitting gaps), my understanding deepened significantly. Seeing data >> hearing explanations.

3. **Hands-On Debugging = Best Learning**: My deepest learning moments came from fixing bugs (FloatTensor error, early stopping TypeError, confusion matrix overlap). Debugging forced me to truly understand what each line does.

**Consistently Challenging Concepts:**
- **Tensor shape transformations** (3D → 2D via mean pooling)
- **Why certain design choices matter** (batch size, learning rate, vocab size)
- **Regex patterns** (still don't love them, still rely on LLMs!)
- **Reading confusion matrices** (which axis is predicted vs true?)

**"Aha!" Moments:**
1. **Notebook 03**: Realizing that limiting vocab to 1000 words means 30-50% of test words map to `<UNK>` - huge information loss!
2. **Notebook 05**: Understanding embeddings as spatial representations where semantic meaning = mathematical distance 🤯
3. **Notebook 06**: Seeing 70% train accuracy vs 49% test accuracy - overfitting visualized!
4. **Notebook 08**: Discovering early stopping saved 24 epochs by automatically finding the best checkpoint
5. **Notebook 09**: The 91% accuracy improvement with TinyBERT - transfer learning validation!

---

### 🚀 Growth Insights

**Evolution of NLP Understanding:**

**Beginning (Notebooks 01-03):**
- NLP = "text → numbers somehow"
- Thought tokenization was simple (just split on spaces!)
- Didn't understand why vocab size mattered

**Middle (Notebooks 04-06):**
- Realized NLP pipeline complexity (tokenize → encode → pad → embed → pool → classify)
- Understood embeddings as learned spatial representations
- Saw firsthand how small vocab + no context = poor performance (49% accuracy)

**End (Notebooks 07-09):**
- Grasped transfer learning paradigm: pre-training on millions of examples >> training from scratch
- Understood attention mechanisms (context-aware vs simple embeddings)
- Realized subword tokenization (30K tokens) >> word-level (1K words)
- **Key insight**: For production NLP, always start with pre-trained transformers

**Skills That Surprised Me:**
1. **Hyperparameter tuning intuition**: Learning to "feel" when LR is too low (slow convergence) vs too high (oscillation)
2. **Debugging PyTorch errors**: Tensor type mismatches, shape errors - I can read these now!
3. **Interpreting metrics**: Confusion matrices, precision/recall trade-offs, overfitting gaps
4. **Strategic layer freezing**: Understanding 95% frozen + 5% trainable prevents catastrophic forgetting

**Best Learning Approach:**
**Guided discovery with immediate feedback** worked best for me:
- Attempt solution → hit error → understand why → fix → reflect
- This beats pure lecture or pure trial-and-error
- Having hints without complete solutions forced me to think

---

### 🔮 Future Applications

**How I'll Apply These Concepts:**

1. **For Any NLP Task:**
   - Start with pre-trained transformer (TinyBERT, BERT, RoBERTa)
   - Use strategic layer freezing (freeze most, train classifier + last layer)
   - Employ early stopping with patience (don't waste epochs!)
   - Always visualize: loss curves, confusion matrices, overfitting gaps
   - Compare to simple baseline (appreciate improvements!)

2. **For Small Datasets (<1000 samples):**
   - Transfer learning is ESSENTIAL (can't learn from scratch)
   - Freeze most layers to prevent overfitting
   - Higher LR works when only 5% of model is trainable
   - Data augmentation might help (back-translation, paraphrasing)

3. **For Production Systems:**
   - TinyBERT >> baseline for real users (94% vs 49%)
   - Model size trade-off: TinyBERT (64MB) is acceptable for most apps
   - Always track per-class metrics to catch bias (baseline: 77% predicted "Characters"!)

**What I'd Do Differently:**

1. **Spend more time on data exploration** (Notebook 01): Understanding the 3-class balance, review length distribution, vocab coverage upfront would have informed later decisions.

2. **Experiment with vocab sizes earlier** (Notebook 03): Trying 500, 2000, 5000 word vocabs would have taught me the sweet spot faster.

3. **Implement learning rate scheduling from the start** (Notebook 08): Start high (5e-4), decay to low (1e-4) might have reached 95%+ accuracy.

4. **Do error analysis immediately** (Notebook 09): Manually reviewing the 8 misclassified TinyBERT samples could reveal systematic failures.

5. **Track more metrics during training**: Per-class accuracy, F1-score, not just overall loss/accuracy - would have caught "Characters" bias sooner.

**What I'm Excited to Explore:**

1. **Larger Models**: BERT-Base (110M params) vs TinyBERT (14M params) - diminishing returns?
2. **Different Domains**: Sentiment analysis, question-answering, named entity recognition
3. **Multimodal Models**: Combining text + images (e.g., VisualBERT)
4. **Efficient Fine-Tuning**: LoRA, QLoRA - update <1% of parameters!
5. **Explainability**: Why did TinyBERT misclassify those 8 samples?
6. **Ensemble Methods**: Combine 5 TinyBERT models trained with different seeds
7. **Larger Datasets**: How does TinyBERT scale to 10K, 100K, 1M samples?

---

### 🎓 **The Big Picture**

**What This Project Taught Me:**

This wasn't just about getting 94% accuracy. It was about:
- **Understanding the fundamentals**: Tokenization → Embeddings → Classification
- **Appreciating modern NLP**: Transfer learning changed everything
- **Learning to learn**: Debugging, visualizing, iterating
- **Thinking like an ML engineer**: Hyperparameters, metrics, trade-offs

**The Most Important Lesson:**

**The era of building NLP from scratch is over for production.** Transfer learning with pre-trained transformers (TinyBERT, BERT, etc.) is:
- 91% more accurate than baseline
- 48% faster to train (early stopping)
- More generalizable (6.6% vs 21% overfitting)
- Handles any vocabulary (subword tokenization)

But understanding the baseline pipeline (Notebooks 01-06) was ESSENTIAL to appreciate why transformers are so powerful. You can't understand the revolution without knowing what came before.

---

### 🚀 **Next Steps**

1. **Short Term** (This Week):
   - Error analysis on 8 misclassified TinyBERT samples
   - Try unfreezing layer 2 as well (layers 2-3 + classifier)
   - Experiment with learning rate scheduling

2. **Medium Term** (This Month):
   - Apply TinyBERT to a different dataset (e.g., product reviews)
   - Try BERT-Base vs TinyBERT comparison
   - Implement data augmentation (back-translation)

3. **Long Term** (This Year):
   - Explore multimodal models (text + images)
   - Learn about LLMs (GPT-style models)
   - Build a production NLP app (deployed API)

**Final Thought:**

I started this project confused about tokenization and vocab indices. I'm ending it with a working transformer model at 94% accuracy, a deep understanding of the NLP pipeline, and confidence to tackle any text classification task. That's genuine progress. 🎉

**Thank you to this structured, didactic approach!** The TODO-style notebooks with hints (not solutions) forced me to think. The reflection prompts forced me to articulate understanding. The bug fixes forced me to truly learn. This is how ML education should be done. 🚀
