# 🧪 Lab Notes & Experiments

**Project**: Natural Language Processing with Disaster Tweets  
**Competition**: Kaggle NLP Getting Started  
**Timeline**: October 2025  
**Goal**: Learn PyTorch fundamentals through hands-on implementation

---

## 📝 Experiment Log

### Experiment 1: Baseline Model - First Attempt
**Date**: October 2025  
**Goal**: Establish baseline performance with simple architecture

**Architecture**:
- Vocabulary: 3,160 words (min_freq=5 likely)
- Embedding dim: 100
- Hidden dim: 128
- Dropout: 0.5
- Learning rate: 0.001
- Optimizer: Adam
- Epochs: 10

**Results**:
- **Validation Accuracy**: 72.0%
- **Validation F1**: 0.68
- **Train Accuracy**: 98.4%
- **Train/Val Gap**: 26.4% ❌

**Observations**:
- 🚨 **SEVERE OVERFITTING**: Model memorizing training data
- Train loss → 0, Val loss → 18 (complete divergence)
- Training accuracy near perfect, validation stuck at 72%
- Model has too much capacity for the data

**Conclusion**:
- ❌ **DISCARD**: Architecture needs major changes
- **Root Cause**: 329K parameters for only 6K training samples (54 params/sample!)
- **Lesson**: Model capacity must match dataset size

**Key Insight**: High training accuracy is NOT a success metric if validation lags behind!

---

### Experiment 2: Vocabulary Expansion + Architecture Reduction
**Date**: October 2025  
**Goal**: Fix overfitting through vocabulary expansion and smaller network

**Changes**:
- ✅ Vocabulary: 3,160 → **41,400 words** (min_freq=1)
  - Rationale: Capture rare disaster-specific terms
- ✅ Embedding dim: 100 → **50** (50% reduction)
  - Rationale: Less capacity to memorize
- ✅ Hidden dim: 128 → **64** (50% reduction)
  - Rationale: Simpler model for better generalization

**Results**:
- **Validation Accuracy**: 77.7% (+5.7% 🎉)
- **Validation F1**: 0.74 (+0.06)
- **Train Accuracy**: 92.8%
- **Train/Val Gap**: 15.1% (improved from 26.4%)

**Observations**:
- ✅ Loss curves much healthier (both decreasing together)
- ✅ Validation loss plateaus around 0.50 (vs 18 before!)
- ✅ Both accuracy curves rise together initially
- ⚠️ Still some overfitting but manageable

**Conclusion**:
- ✅ **MAJOR SUCCESS**: Right direction!
- Vocabulary size MATTERS for NLP
- Smaller networks can generalize better
- Proved: Less is more!

**Key Insight**: Large vocabulary + small network > small vocabulary + large network

---

### Experiment 3: Aggressive Regularization
**Date**: October 2025  
**Goal**: Further reduce overfitting with regularization techniques

**Changes**:
- ✅ Learning rate: 0.001 → **0.00001** (100× slower!)
  - Rationale: Careful, gradual learning
- ✅ Weight decay: None → **1e-4**
  - Rationale: L2 regularization penalizes large weights
- ✅ Dropout: 0.5 → **0.6** (likely)
  - Rationale: More aggressive dropout
- ✅ Early stopping: 10 → **4 epochs**
  - Rationale: Stop before overfitting sets in

**Results**:
- **Validation Accuracy**: 79.4% (+1.7%)
- **Validation F1**: 0.75 (+0.01)
- **Train Accuracy**: 85.7%
- **Train/Val Gap**: 5.9% ✅ (near-perfect!)

**Observations**:
- 🎯 **EXCELLENT generalization**: Only 6% gap!
- Train and val losses stay close throughout training
- Best val performance at Epoch 4
- Both curves still improving (could train slightly longer)

**Conclusion**:
- ✅ **KEEP**: This is the sweet spot!
- Learning rate had HUGE impact (biggest single change)
- Multiple regularization techniques stack effectively
- Model learning generalizable patterns, not memorizing

**Key Insight**: Learning rate is often the most important hyperparameter!

---

### Experiment 4: Extended Training
**Date**: October 2025  
**Goal**: Push for 80% validation accuracy milestone

**Changes**:
- ✅ Epochs: 4 → **6**
  - Rationale: Both curves still improving at Epoch 4

**Results**:
- **Validation Accuracy**: 80.0% (+0.6% 🎯)
- **Validation F1**: 0.76 (+0.01)
- **Train Accuracy**: 93.0%
- **Train/Val Gap**: 13.0% (increased from 5.9%)
- **Peak Validation**: Epoch 5 (79.8%)

**Observations**:
- 🎯 **HIT THE TARGET**: 80% validation accuracy!
- Val loss started rising after Epoch 3
- Best performance was actually Epoch 5 (79.8%)
- Training for 6 epochs increased overfitting slightly

**Conclusion**:
- ✅ **SUCCESS**: Achieved 80% goal!
- ⚠️ Should have stopped at Epoch 5
- Mild overfitting returned (13% gap)
- Demonstrates importance of early stopping

**Key Insight**: More training ≠ better performance. Monitor validation loss!

---

### Experiment 5: Kaggle Submission
**Date**: October 2025  
**Goal**: Test generalization on unseen test set

**Model Used**: Experiment 4 (6 epochs, best checkpoint)

**Results**:
- **Public F1-Score**: 0.78516
- **Validation F1**: 0.76
- **Difference**: +0.026 (+2.6%)
- **Leaderboard Rank**: #658

**Observations**:
- 🎉 **TEST > VALIDATION**: Rare and good sign!
- Model generalized better to test than validation
- No overfitting to training set
- Preprocessing pipeline works on new data

**Conclusion**:
- ✅ **EXCELLENT**: Better generalization than expected!
- Proves robust preprocessing
- Confirms hyperparameter choices were good
- Solid baseline for from-scratch implementation

**Key Insight**: When test > validation, you did something right!

---

## 📊 Complete Experiment Comparison

| Exp | Vocab | Emb | Hidden | LR | Dropout | WD | Epochs | Val Acc | Val F1 | Train/Val Gap | Status |
|-----|-------|-----|--------|----|---------|----|--------|---------|--------|---------------|--------|
| **1** | 3.2K | 100 | 128 | 1e-3 | 0.5 | 0 | 10 | 72.0% | 0.68 | **26.4%** ❌ | Overfit |
| **2** | 41K | 50 | 64 | 1e-3 | 0.6? | 0 | 10 | 77.7% | 0.74 | 15.1% | Better |
| **3** | 41K | 50 | 64 | **1e-5** | 0.6 | **1e-4** | **4** | 79.4% | 0.75 | **5.9%** ✅ | Excellent |
| **4** | 41K | 50 | 64 | 1e-5 | 0.6 | 1e-4 | **6** | **80.0%** | **0.76** | 13.0% | Goal! |
| **5** | 41K | 50 | 64 | 1e-5 | 0.6 | 1e-4 | 6 | **Test: 78.5%** | **0.785** | N/A | Submitted |

**Progression**: 72% → 77.7% → 79.4% → 80.0% → **78.5% (test)**

**Total Improvement**: +8.5% validation, +10.5% from first model to test!

---

## 💡 Ideas Tried & Lessons Learned

### ✅ What Worked:

1. **Large Vocabulary (41K words)**
   - Captured rare disaster terms (tsunami, wildfire, evacuation)
   - min_freq=1 was crucial
   - More words > fewer words for NLP

2. **Smaller Network Architecture**
   - Embedding: 100 → 50
   - Hidden: 128 → 64
   - Less capacity = better generalization
   - Counter-intuitive but effective!

3. **Very Low Learning Rate (1e-5)**
   - Biggest single improvement
   - Slow, careful learning
   - Prevented overfitting dramatically
   - Worth the longer training time

4. **Multiple Regularization Stacking**
   - Dropout (0.6) + Weight Decay (1e-4) + Early Stopping
   - Each contributed ~1-2%
   - Combined effect was powerful

5. **Early Stopping (4-6 epochs)**
   - Validation loss was the guide
   - Best performance at Epoch 5
   - Prevented overtraining

### ❌ What Didn't Work:

1. **Large Network with Small Vocab**
   - 329K params for 6K samples = disaster
   - Model memorized everything
   - High training accuracy is a trap!

2. **High Learning Rate (0.001)**
   - Too aggressive for this dataset
   - Caused overfitting
   - Validation loss diverged

3. **Training Too Long (10 epochs)**
   - Validation peaked at Epoch 5
   - Extra epochs hurt performance
   - Wasted compute and increased overfitting

### 🤔 Surprises:

1. **Test Score > Validation Score**
   - Usually it's the opposite!
   - Shows excellent generalization
   - Robust preprocessing pipeline

2. **Vocabulary Size > Model Size**
   - Expected larger embeddings to help
   - Actually, large vocab + small embeddings won
   - Representation diversity > capacity

3. **100× Learning Rate Reduction**
   - Didn't expect such a huge change to work
   - 1e-5 is very slow but highly effective
   - Sometimes patience is the answer

---

## 🎯 Ideas to Try Next

### Phase 2: Transformers (High Priority)

- [x] Complete Phase 1 baseline
- [ ] **Implement DistilBERT** (smallest, fastest)
  - Expected: +5-7% F1
  - Target: 0.82-0.83 F1
- [ ] **Try RoBERTa** (more powerful)
  - Expected: +6-8% F1
  - Target: 0.83-0.84 F1
- [ ] **Fine-tune carefully**
  - Lower LR (1e-5 or 2e-5)
  - Few epochs (2-4)
  - Monitor for overfitting

### Quick Wins (Medium Priority)

- [ ] **Optimize Decision Threshold**
  - Currently using 0.5
  - F1-score maximized at different threshold
  - Expected: +1-2% F1

- [ ] **Try Max Pooling**
  - Currently using mean pooling
  - Max might capture stronger signals
  - Expected: +0.5-1% F1

- [ ] **Ensemble Multiple Models**
  - Combine 3-5 different models
  - Majority vote or average probabilities
  - Expected: +2-3% F1

### Advanced (Low Priority for Now)

- [ ] LSTM/GRU architecture
- [ ] Pre-trained embeddings (GloVe)
- [ ] Data augmentation
- [ ] Cross-validation
- [ ] Pseudo-labeling

---

## 🐛 Debugging Notes

### Issue 1: AttributeError - float.split()
**Problem**: Vocabulary building failed on NaN values  
**Cause**: `text_clean` column had NaN (float) instead of strings  
**Solution**: 
- Added `fillna("UNK")` in preprocessing
- Added type checking before `.split()`
- Ensured preprocessing returns strings, not NaN

**Lesson**: Always validate data types after preprocessing!

### Issue 2: KeyError in DataLoader
**Problem**: `KeyError: 1335` when iterating through batches  
**Cause**: Using label-based indexing `[idx]` instead of positional `.iloc[idx]`  
**Solution**: Changed to `.iloc[idx]` in `__getitem__`  

**Lesson**: Pandas Series indexing can be tricky!

### Issue 3: Import Error - ModuleNotFoundError: 'src'
**Problem**: Can't import from `src.models.baseline_model`  
**Cause**: Python path doesn't include project root  
**Solution**: 
```python
import sys, os
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.insert(0, project_root)
```

**Lesson**: Always fix Python path BEFORE importing custom modules!

### Issue 4: RuntimeError - mat1 and mat2 shapes mismatch
**Problem**: Matrix multiplication error in forward pass  
**Cause**: Missing pooling step after embedding  
**Solution**: Added `pooled = x.mean(dim=1)` to aggregate sequence  

**Lesson**: Always track tensor shapes through the network!

---

## 💭 Key Learnings & Insights

### On Overfitting:
> "High training accuracy is not success—it's often a warning sign. The train/val gap is your most important metric."

### On Hyperparameters:
> "Learning rate matters more than any other hyperparameter. Spend time here first."

### On Architecture:
> "Bigger is not always better. Sometimes a smaller model learns more general patterns."

### On Data:
> "In NLP, vocabulary richness often beats model complexity. Don't throw away rare words too quickly."

### On Patience:
> "Training slower (low LR) with fewer epochs often beats training faster for longer."

### On Validation:
> "Early stopping based on validation loss is not optional—it's essential."

### On Generalization:
> "When test performance > validation performance, you've achieved something special. It means your model truly learned, not memorized."

---

## 🎓 Skills Developed

Through this project, I learned:

1. ✅ **End-to-end ML pipeline** - From raw data to Kaggle submission
2. ✅ **Systematic debugging** - Identify and fix issues methodically
3. ✅ **Hyperparameter tuning** - Test hypotheses and iterate
4. ✅ **Overfitting diagnosis** - Recognize and fix through multiple techniques
5. ✅ **PyTorch fundamentals** - Build custom models, datasets, training loops
6. ✅ **NLP preprocessing** - Text cleaning, tokenization, vocabulary building
7. ✅ **Evaluation metrics** - Precision, recall, F1, confusion matrices
8. ✅ **Training dynamics** - Interpret loss curves, accuracy plots
9. ✅ **Professional workflow** - Version control, documentation, reproducibility

---

## 📈 Next Milestones

- [x] **Phase 1**: PyTorch Fundamentals (72% → 80% → 78.5% test) ✅
- [ ] **Phase 2**: Transformers (Target: 82-83% test)
- [ ] **Phase 3**: Ensemble & Optimization (Target: 84%+ test, Top 10%)

**Current Status**: Ready for Phase 2! 🚀

---

## 🙏 Acknowledgments

**Learning Resources:**
- Kaggle Competition: Natural Language Processing with Disaster Tweets
- PyTorch Documentation
- Codecademy ML Course
- Trial, error, and persistence!

**Key Insight:**
> "It ain't much, but it's honest work." 🚜
> 
> Building from scratch teaches 10× more than using pre-built solutions.

---

**Last Updated**: October 2025  
**Status**: Phase 1 Complete ✅ | Kaggle Score: 0.78516 | Rank: #658
