# Image Caption Generator - Quick Start

This notebook provides an interactive quick start guide to the image captioning project.

**What you'll do:**
1. Setup environment
2. Download and explore data
3. Train a simple model
4. Generate captions
5. Visualize results

## 1. Setup Environment

In [1]:
# Install PyTorch
!pip install torch torchvision torchaudio



In [2]:
# Install other required packages
!pip install pandas numpy matplotlib seaborn pillow tqdm nltk




In [5]:
!pip install --upgrade typing_extensions




In [7]:
# Check GPU availability
import torch
import sys
from pathlib import Path

print(f"Python version: {sys.version}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

Python version: 3.11.7 (main, Dec 15 2023, 12:09:56) [Clang 14.0.6 ]
PyTorch version: 2.9.1
CUDA available: False


In [9]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
from tqdm.notebook import tqdm
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("âœ“ All libraries imported successfully!")

âœ“ All libraries imported successfully!


## 2. Download Dataset

**Option A:** Run this if you have Kaggle API setup

In [11]:
# Download Flickr8k dataset
import os

# Save current directory
notebook_dir = os.getcwd()
print(f"Current directory: {notebook_dir}")

# Go up to CAPTION_GEN folder
os.chdir('..')
print(f"Changed to: {os.getcwd()}")

# Run download script
!python data/download_dataset.py

# Go back to Jupyter folder
os.chdir(notebook_dir)


Current directory: /Users/bahar/Caption_Gen/Jupyter
Changed to: /Users/bahar/Caption_Gen
Downloading Flickr8k dataset from Kaggle...
Make sure you have kaggle.json in ~/.kaggle/
Get your API token from: https://www.kaggle.com/settings/account


âœ— Kaggle CLI not found. Install it with:
pip install kaggle

Or download manually from:
https://www.kaggle.com/datasets/adityajn105/flickr8k


**Option B:** Manual download

If Kaggle API doesn't work:
1. Go to https://www.kaggle.com/datasets/adityajn105/flickr8k
2. Download and extract to `data/flickr8k/`

In [None]:
# Check if dataset exists
data_dir = Path('../data/flickr8k')
images_dir = data_dir / 'Images'
captions_file = data_dir / 'captions.txt'

if captions_file.exists():
    num_images = len(list(images_dir.glob('*.jpg')))
    print(f"âœ“ Dataset found!")
    print(f"  Images: {num_images}")
    print(f"  Location: {data_dir}")
else:
    print("âœ— Dataset not found. Please download it first.")

## 3. Preprocess Data

In [None]:
# Preprocess data
import os

notebook_dir = os.getcwd()
os.chdir('..')  # Go to CAPTION_GEN

!python data/preprocess.py

os.chdir(notebook_dir)  # Return to Jupyter


## 4. Quick Data Exploration

In [None]:
# Load processed captions
df = pd.read_csv('../data/flickr8k/processed/captions_clean.csv')

print(f"Total captions: {len(df):,}")
print(f"Unique images: {df['image'].nunique():,}")
print(f"Captions per image: {len(df) / df['image'].nunique():.1f}")
print(f"\nFirst few rows:")
df.head(10)

In [None]:
# Caption length distribution
df['length'] = df['caption'].apply(lambda x: len(x.split()))

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(df['length'], bins=30, edgecolor='black', alpha=0.7)
plt.axvline(df['length'].mean(), color='red', linestyle='--', label=f'Mean: {df["length"].mean():.1f}')
plt.xlabel('Caption Length (words)')
plt.ylabel('Frequency')
plt.title('Distribution of Caption Lengths')
plt.legend()
plt.grid(alpha=0.3)

plt.subplot(1, 2, 2)
df['length'].plot(kind='box', vert=False)
plt.xlabel('Caption Length (words)')
plt.title('Caption Length Box Plot')
plt.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Statistics:")
print(f"  Mean: {df['length'].mean():.2f} words")
print(f"  Median: {df['length'].median():.0f} words")
print(f"  Min: {df['length'].min()} words")
print(f"  Max: {df['length'].max()} words")

In [None]:
# Show sample images with captions
import random

sample_images = df['image'].unique()[:6]

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for idx, img_name in enumerate(sample_images):
    # Load image
    img_path = images_dir / img_name
    img = Image.open(img_path)
    
    # Get captions
    captions = df[df['image'] == img_name]['caption'].values
    
    # Display
    axes[idx].imshow(img)
    axes[idx].axis('off')
    axes[idx].set_title(f"{img_name}\n{captions[0][:50]}...", fontsize=9)

plt.tight_layout()
plt.show()

## 5. Train Model (Small Demo)

**Note:** For full training, use the Python script:
```bash
python training/train.py --model cnn_lstm --epochs 20
```

Here we'll train for just 2 epochs to demonstrate:

In [None]:
# Train model
import os

notebook_dir = os.getcwd()
os.chdir('..')  # Go to CAPTION_GEN

!python training/train.py \
  --model cnn_lstm \
  --epochs 30 \
  --batch_size 32 \
  --learning_rate 0.0001 \
  --experiment_name jupyter_training

os.chdir(notebook_dir)  # Return to Jupyter


## 6. Generate Captions

Load a trained model and generate captions:

In [None]:
# Load trained model
sys.path.append('../')
from models.cnn_lstm import CNNLSTMModel
from torchvision import transforms

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load checkpoint (use your best model)
checkpoint_path = '../checkpoints/cnn_lstm_best.pth'

if Path(checkpoint_path).exists():
    model, vocab, epoch, loss = CNNLSTMModel.load_from_checkpoint(checkpoint_path, device)
    print(f"âœ“ Model loaded from epoch {epoch}")
    print(f"  Validation loss: {loss:.4f}")
    print(f"  Vocabulary size: {len(vocab)}")
else:
    print("âœ— No trained model found. Please train first.")
    print("  Run: python training/train.py --model cnn_lstm --epochs 20")

In [None]:
# Define transform
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                       std=[0.229, 0.224, 0.225])
])

def generate_caption_for_image(image_path, beam_size=3):
    """Generate caption for an image"""
    # Load and preprocess
    image = Image.open(image_path).convert('RGB')
    image_tensor = transform(image).unsqueeze(0).to(device)
    
    # Generate caption
    caption = model.generate_caption(image_tensor, vocab, max_length=40, beam_size=beam_size)
    
    return image, caption

In [None]:
# Generate captions for sample images
test_images = list(images_dir.glob('*.jpg'))[:6]

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for idx, img_path in enumerate(test_images):
    image, caption = generate_caption_for_image(img_path)
    
    axes[idx].imshow(image)
    axes[idx].axis('off')
    axes[idx].set_title(f"Generated: {caption}", fontsize=10, wrap=True)

plt.tight_layout()
plt.show()

## 7. Compare with Ground Truth

In [None]:
# Compare generated vs reference captions
sample_img = random.choice(list(images_dir.glob('*.jpg')))
img_name = sample_img.name

# Get reference captions
reference_captions = df[df['image'] == img_name]['caption'].values

# Generate caption
image, generated_caption = generate_caption_for_image(sample_img, beam_size=5)

# Display
plt.figure(figsize=(10, 6))
plt.imshow(image)
plt.axis('off')
plt.title(f"Image: {img_name}", fontsize=12, fontweight='bold')
plt.show()

print("\n" + "="*70)
print("GENERATED CAPTION:")
print("="*70)
print(f"â†’ {generated_caption}")
print("\n" + "="*70)
print("REFERENCE CAPTIONS:")
print("="*70)
for i, ref in enumerate(reference_captions, 1):
    print(f"{i}. {ref}")
print("="*70)

## 8. Next Steps

Now that you've completed the quick start:

1. **Full Training:** Train for 20+ epochs using Python script
2. **Evaluation:** Check `02_model_evaluation.ipynb`
3. **Explainability:** See `03_explainability.ipynb` for Grad-CAM
4. **Analysis:** Explore `04_error_analysis.ipynb`
5. **Presentation:** Use `05_presentation.ipynb` for final results

---

**Good luck with your project! ðŸš€**