# üå∏ Image-Grounded Botany-VQA Dataset Generation (Google Colab)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/yourusername/Botany-VQA-Corrected/blob/main/generate_dataset_colab.ipynb)

This notebook generates a corrected, image-grounded VQA dataset using **free GPU** from Google Colab!

**Estimated Time**: 3-4 hours for full dataset (8,189 images) with Colab GPU

## ‚öôÔ∏è Step 1: Setup GPU Runtime

**IMPORTANT**: Make sure you're using GPU runtime!

1. Click **Runtime** ‚Üí **Change runtime type**
2. Select **T4 GPU** (or any available GPU)
3. Click **Save**

Let's verify GPU is available:

In [None]:
import torch
print(f"GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("‚ö†Ô∏è WARNING: No GPU detected! Please enable GPU runtime.")

## üì¶ Step 2: Install Dependencies

In [None]:
# Install required packages
!pip install -q transformers accelerate bitsandbytes scikit-learn opencv-python

print("‚úì Dependencies installed!")

## üì• Step 3: Clone Repository and Download Dataset

In [None]:
# Clone your repository (update with your GitHub URL)
!git clone https://github.com/yourusername/Botany-VQA-Corrected.git
%cd Botany-VQA-Corrected

# Or upload files manually if not using git
# from google.colab import files
# uploaded = files.upload()  # Upload the Python files

In [None]:
# Download Oxford Flowers 102 dataset
!mkdir -p oxford_flowers_102
%cd oxford_flowers_102

# Download images
!wget -q https://www.robots.ox.ac.uk/~vgg/data/flowers/102/102flowers.tgz
!tar -xzf 102flowers.tgz
print("‚úì Images downloaded")

# Download labels
!wget -q https://www.robots.ox.ac.uk/~vgg/data/flowers/102/imagelabels.mat
print("‚úì Labels downloaded")

# Download category names (you may need to create this)
!wget -q https://gist.githubusercontent.com/JosephKJ/94c7728ed1a8e0cd87fe6a029769cde1/raw/403325f5110cb0f3099734c5edb9f457539c77e9/Oxford-102_Flower_dataset_labels.txt
print("‚úì Category names downloaded")

%cd ..

## üè∑Ô∏è Step 4: Create Labels JSON

We need to create a mapping from image filenames to flower names:

In [None]:
import json
import scipy.io
import os

# Load image labels
mat_data = scipy.io.loadmat('oxford_flowers_102/imagelabels.mat')
image_labels = mat_data['labels'][0]  # Array of category IDs (1-102)

# Load category names
with open('oxford_flowers_102/Oxford-102_Flower_dataset_labels.txt', 'r') as f:
    category_names = [line.strip() for line in f.readlines()]

# Create category ID to name mapping
cat_to_name = {str(i+1): name for i, name in enumerate(category_names)}

# Save cat_to_name.json
with open('oxford_flowers_102/cat_to_name.json', 'w') as f:
    json.dump(cat_to_name, f, indent=2)

# Get image files
image_files = sorted([f for f in os.listdir('oxford_flowers_102/jpg') if f.endswith('.jpg')])

# Create labels.json
labels_dict = {}
for idx, image_file in enumerate(image_files):
    if idx < len(image_labels):
        category_id = str(image_labels[idx])
        flower_name = cat_to_name.get(category_id, f"unknown_{category_id}")
        labels_dict[image_file] = flower_name

# Save labels.json
with open('oxford_flowers_102/labels.json', 'w') as f:
    json.dump(labels_dict, f, indent=2)

print(f"‚úì Created labels for {len(labels_dict)} images")
print(f"\nSample labels:")
for i, (img, label) in enumerate(list(labels_dict.items())[:3]):
    print(f"  {img}: {label}")

## üìù Step 5: Upload Python Modules

If you didn't clone from GitHub, upload the Python files:

In [None]:
# Option 1: If you cloned from GitHub, skip this cell

# Option 2: Upload files manually
from google.colab import files

print("Please upload these files:")
print("  - dataset_generator.py")
print("  - question_templates.py")
print("  - visual_feature_extractor.py")
print("  - vqa_validator.py")

uploaded = files.upload()

## ü§ñ Step 6: Load BLIP-2 Model

In [None]:
from dataset_generator import BotanyVQAGenerator

# Initialize generator with GPU
generator = BotanyVQAGenerator(
    model_name="Salesforce/blip2-opt-2.7b",  # or "Salesforce/blip2-flan-t5-xl" for better quality
    device="cuda"  # Use GPU
)

print("‚úì Model loaded on GPU!")

## üß™ Step 7: Test on Sample Image

In [None]:
from PIL import Image
import matplotlib.pyplot as plt

# Load a sample image
sample_image_path = "oxford_flowers_102/jpg/image_00001.jpg"
sample_image = Image.open(sample_image_path)

# Display image
plt.figure(figsize=(6, 6))
plt.imshow(sample_image)
plt.axis('off')
plt.title("Sample Flower Image")
plt.show()

# Ask test questions
test_questions = [
    "What type of flower is this?",
    "What color are the petals?",
    "How many petals are visible?"
]

print("\nTest Questions and Answers:")
print("=" * 60)
for question in test_questions:
    answer = generator.ask_question(sample_image_path, question)
    print(f"Q: {question}")
    print(f"A: {answer}")
    print()

## üöÄ Step 8: Generate Pilot Dataset (100 images)

Let's start with a pilot to test quality:

In [None]:
# Generate pilot dataset
pilot_df = generator.generate_dataset(
    image_dir="oxford_flowers_102/jpg",
    labels_file="oxford_flowers_102/labels.json",
    output_csv="botany_vqa_pilot.csv",
    num_images=100,  # Only 100 images for pilot
    qa_per_image=10
)

print(f"\n‚úì Pilot dataset generated!")
print(f"Total QA pairs: {len(pilot_df)}")

## ‚úÖ Step 9: Validate Pilot Dataset

In [None]:
from vqa_validator import VQAValidator

# Run validation
validator = VQAValidator(pilot_df['flower_category'].unique().tolist())
validation_results = validator.run_all_validations(pilot_df)

# Print report
report = validator.generate_validation_report(validation_results)
print(report)

## üìä Step 10: Visualize Pilot Results

In [None]:
import matplotlib.pyplot as plt

# Question type distribution
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
pilot_df['question_type'].value_counts().plot(kind='bar')
plt.title('Question Type Distribution')
plt.xlabel('Question Type')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')

plt.subplot(1, 2, 2)
pilot_df['difficulty_level'].value_counts().sort_index().plot(kind='bar')
plt.title('Difficulty Level Distribution')
plt.xlabel('Difficulty Level')
plt.ylabel('Count')

plt.tight_layout()
plt.show()

## üíæ Step 11: Download Pilot Dataset

In [None]:
from google.colab import files

# Download pilot dataset
files.download('botany_vqa_pilot.csv')
print("‚úì Pilot dataset downloaded!")

## üéØ Step 12: Generate FULL Dataset (All 8,189 images)

‚ö†Ô∏è **WARNING**: This will take 3-4 hours with Colab GPU!

**Tips to avoid timeout**:
1. Keep the browser tab active
2. Click in the notebook occasionally to prevent idle timeout
3. Consider using Colab Pro for longer runtime

In [None]:
# Generate full dataset
full_df = generator.generate_dataset(
    image_dir="oxford_flowers_102/jpg",
    labels_file="oxford_flowers_102/labels.json",
    output_csv="botany_vqa_grounded.csv",
    num_images=None,  # Process ALL images
    qa_per_image=10
)

print(f"\n‚úì Full dataset generated!")
print(f"Total QA pairs: {len(full_df)}")

## üìà Step 13: Generate Statistics

In [None]:
# Generate statistics
generator.generate_statistics(full_df, "dataset_statistics.json")

# Load and display
with open("dataset_statistics.json", "r") as f:
    stats = json.load(f)

print("\nDataset Statistics:")
print(json.dumps(stats, indent=2))

## ‚úÖ Step 14: Final Validation

In [None]:
# Run final validation
validator = VQAValidator(full_df['flower_category'].unique().tolist())
validation_results = validator.run_all_validations(full_df)
report = validator.generate_validation_report(validation_results)

print(report)

# Save report
with open("validation_report.txt", "w") as f:
    f.write(report)

## üíæ Step 15: Download Full Dataset

In [None]:
from google.colab import files

# Download all generated files
files.download('botany_vqa_grounded.csv')
files.download('dataset_statistics.json')
files.download('validation_report.txt')

print("‚úì All files downloaded!")

## üì§ Step 16: Save to Google Drive (Optional)

To avoid losing your work if Colab disconnects:

In [None]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# Create directory in Drive
!mkdir -p '/content/drive/MyDrive/Botany-VQA-Dataset'

# Copy files to Drive
!cp botany_vqa_grounded.csv '/content/drive/MyDrive/Botany-VQA-Dataset/'
!cp dataset_statistics.json '/content/drive/MyDrive/Botany-VQA-Dataset/'
!cp validation_report.txt '/content/drive/MyDrive/Botany-VQA-Dataset/'

print("‚úì Files saved to Google Drive!")

## üéâ Done!

Your image-grounded Botany-VQA dataset is ready!

**What you have:**
- ‚úÖ `botany_vqa_grounded.csv` - Full dataset (~82,000 QA pairs)
- ‚úÖ `dataset_statistics.json` - Dataset metrics
- ‚úÖ `validation_report.txt` - Quality validation results

**Next steps:**
1. Review the validation report
2. Upload dataset to GitHub/HuggingFace
3. Use for your VQA research
4. Train your VQA model on the corrected dataset!