# CIFAR-100 Competition - Data Exploration

Welcome to the CIFAR-100 image classification competition!

**Goal:** Build a CNN that achieves the highest accuracy on the test set.

**Dataset:** CIFAR-100 has 100 classes of 32×32 color images.

**What You'll Do:**
1. Explore the CIFAR-100 dataset in this notebook
2. Modify `model.py` to improve the CNN architecture
3. Modify `main.py` to add data augmentations
4. Run `python main.py` to train and generate `submission.csv`
5. Submit `submission.csv` to Kaggle

Good luck! 🚀

In [None]:
# Import libraries
import torch
from torchvision import datasets, transforms
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 
                      'mps' if torch.backends.mps.is_available() else 'cpu')
print(f'Using device: {device}')

## Part 1: Load and Explore CIFAR-100

CIFAR-100 is much more challenging than CIFAR-10:
- 100 fine-grained classes (vs 10 in CIFAR-10)
- Same tiny 32×32 pixel images
- 500 training images per class (50,000 total)
- 100 test images per class (10,000 total)

In [None]:
# Load CIFAR-100 dataset
basic_transform = transforms.Compose([transforms.ToTensor()])

train_dataset = datasets.CIFAR100(root='./data', train=True, download=True, transform=basic_transform)
test_dataset = datasets.CIFAR100(root='./data', train=False, download=True, transform=basic_transform)

print(f'Training images: {len(train_dataset)}')
print(f'Test images: {len(test_dataset)}')
print(f'Number of classes: 100')
print(f'Image shape: {train_dataset[0][0].shape}')  # (3, 32, 32)

## Part 2: Visualize Random Samples

Let's see what kinds of images we're working with!

In [None]:
# Visualize random samples from the training set
fig, axes = plt.subplots(4, 8, figsize=(16, 8))
for i, ax in enumerate(axes.flat):
    img, label = train_dataset[np.random.randint(len(train_dataset))]
    ax.imshow(img.permute(1, 2, 0))  # Convert from (C, H, W) to (H, W, C)
    ax.set_title(f'Class {label}', fontsize=8)
    ax.axis('off')
plt.suptitle('Random CIFAR-100 Training Samples', fontsize=16)
plt.tight_layout()
plt.show()

print('\nCIFAR-100 has 100 different classes!')
print('Much more challenging than CIFAR-10 (10 classes)')

## Part 3: Class Distribution

Let's verify that the dataset is balanced (equal number of images per class).

In [None]:
# Count images per class in training set
train_labels = [label for _, label in train_dataset]
label_counts = Counter(train_labels)

# Plot distribution
plt.figure(figsize=(14, 4))
plt.bar(label_counts.keys(), label_counts.values(), color='steelblue', alpha=0.7)
plt.xlabel('Class ID')
plt.ylabel('Number of Images')
plt.title('CIFAR-100 Training Set - Class Distribution')
plt.axhline(y=500, color='red', linestyle='--', label='Expected: 500 per class')
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.show()

print(f'Total classes: {len(label_counts)}')
print(f'Images per class: {label_counts[0]}')
print('Dataset is balanced!' if len(set(label_counts.values())) == 1 else 'Dataset is imbalanced!')

## Part 4: Image Statistics

Understanding pixel value distributions helps us choose good normalization strategies.

In [None]:
# Sample 1000 random images and compute statistics
sample_images = [train_dataset[i][0] for i in np.random.choice(len(train_dataset), 1000, replace=False)]
sample_tensor = torch.stack(sample_images)

# Compute mean and std per channel
mean = sample_tensor.mean(dim=[0, 2, 3])
std = sample_tensor.std(dim=[0, 2, 3])

print('Pixel Statistics (from 1000 random images):')
print(f'Mean (R, G, B): {mean.numpy()}')
print(f'Std  (R, G, B): {std.numpy()}')
print('\nNote: Values are in [0, 1] range after ToTensor()')
print('These statistics can be used for normalization in your transforms!')

## Part 5: Visualize Samples from Specific Classes

Let's look at multiple samples from the same class to understand intra-class variation.

In [None]:
# Pick a random class
target_class = np.random.randint(0, 100)

# Find all images from this class
class_images = [(img, label) for img, label in train_dataset if label == target_class][:16]

# Visualize
fig, axes = plt.subplots(4, 4, figsize=(10, 10))
for i, ax in enumerate(axes.flat):
    if i < len(class_images):
        img, label = class_images[i]
        ax.imshow(img.permute(1, 2, 0))
        ax.axis('off')
plt.suptitle(f'16 Samples from Class {target_class}', fontsize=16)
plt.tight_layout()
plt.show()

print(f'Notice the variation within the same class!')
print(f'Different angles, lighting, colors - this is why data augmentation helps!')

## Part 6: Compare with Test Set

Let's visualize some test images to see if they look different from training.

In [None]:
# Visualize test samples
fig, axes = plt.subplots(2, 8, figsize=(16, 4))
for i, ax in enumerate(axes.flat):
    img, label = test_dataset[np.random.randint(len(test_dataset))]
    ax.imshow(img.permute(1, 2, 0))
    ax.set_title(f'Class {label}', fontsize=8)
    ax.axis('off')
plt.suptitle('Random CIFAR-100 Test Samples', fontsize=16)
plt.tight_layout()
plt.show()

print('Standard CIFAR-100 test set looks similar to training set.')
print('However, the COMPETITION test set will have augmentations!')
print('(noise, blur, color shifts, etc.)')

## Next Steps: Training Your Model

Now that you've explored the data, it's time to build and train your model!

### 1. Improve the Model (`model.py`)
- Add more convolutional layers
- Add BatchNorm for better training
- Try different architectures
- Experiment with dropout rates

### 2. Add Data Augmentation (`main.py`)
**This is KEY to success!** The competition test set has augmentations.

Suggested augmentations in `get_transforms()`:
```python
transforms.ColorJitter(brightness=0.3, contrast=0.3, saturation=0.3)
transforms.RandomRotation(15)
transforms.RandomAffine(degrees=0, translate=(0.1, 0.1))
transforms.RandomGrayscale(p=0.1)
```

### 3. Train the Model
```bash
python main.py
```

This will:
- Train your model for 10 epochs (adjust EPOCHS in main.py for longer training)
- Save the best model as `best_model.pth`
- Generate `submission.csv` for Kaggle (if test.csv and test_images/ are present)

### 4. Submit to Kaggle
1. Download `test.csv` and `test_images.zip` from Kaggle
2. Unzip `test_images.zip` in the `cifar100_comp/` folder
3. Run `python main.py` to generate `submission.csv`
4. Upload `submission.csv` to Kaggle!

---

## Tips for Success 💡

1. **Data Augmentation is CRITICAL!** The test set has augmentations.
2. Train longer (20-50 epochs) for better performance
3. Experiment with different learning rates and optimizers
4. Add BatchNorm to your model architecture
5. Monitor training vs test accuracy to detect overfitting

Good luck! 🚀