# Prepare Food-101 Dataset

This notebook processes the Food-101 dataset, organizing it into train, validation, and test directories for use in training the food recognition model.

In [1]:
import os
import shutil
import random
import json

## Define Paths

Set up paths for the raw dataset and output directories.

In [2]:
# Paths
raw_data_dir = '../data/food101/food-101/images'
output_dir = '../data/food101'
train_dir = os.path.join(output_dir, 'train')
valid_dir = os.path.join(output_dir, 'validation')
test_dir = os.path.join(output_dir, 'test')

# Create output directories
for dir_path in [train_dir, valid_dir, test_dir]:
    os.makedirs(dir_path, exist_ok=True)

## Split Dataset

Split the dataset into 70% train, 15% validation, and 15% test, maintaining class balance.

In [3]:
# Get list of food classes
food_classes = sorted(os.listdir(raw_data_dir))
random.seed(42)  # For reproducibility

for food_class in food_classes:
    class_path = os.path.join(raw_data_dir, food_class)
    if not os.path.isdir(class_path):
        continue
    
    # Get all images
    images = [f for f in os.listdir(class_path) if f.endswith('.jpg')]
    random.shuffle(images)
    
    # Calculate split sizes
    total = len(images)
    train_count = int(total * 0.70)
    valid_count = int(total * 0.15)
    test_count = total - train_count - valid_count
    
    # Assign images to splits
    train_images = images[:train_count]
    valid_images = images[train_count:train_count + valid_count]
    test_images = images[train_count + valid_count:]
    
    # Copy images to respective directories
    for split, split_images, split_dir in [
        ('train', train_images, train_dir),
        ('validation', valid_images, valid_dir),
        ('test', test_images, test_dir)
    ]:
        class_split_dir = os.path.join(split_dir, food_class)
        os.makedirs(class_split_dir, exist_ok=True)
        for img in split_images:
            src = os.path.join(class_path, img)
            dst = os.path.join(class_split_dir, img)
            shutil.copyfile(src, dst)
        print(f'Copied {len(split_images)} images to {split}/{food_class}')

Copied 700 images to train/apple_pie
Copied 150 images to validation/apple_pie
Copied 150 images to test/apple_pie
Copied 700 images to train/baby_back_ribs
Copied 150 images to validation/baby_back_ribs
Copied 150 images to test/baby_back_ribs
Copied 700 images to train/baklava
Copied 150 images to validation/baklava
Copied 150 images to test/baklava
Copied 700 images to train/beef_carpaccio
Copied 150 images to validation/beef_carpaccio
Copied 150 images to test/beef_carpaccio
Copied 700 images to train/beef_tartare
Copied 150 images to validation/beef_tartare
Copied 150 images to test/beef_tartare
Copied 700 images to train/beet_salad
Copied 150 images to validation/beet_salad
Copied 150 images to test/beet_salad
Copied 700 images to train/beignets
Copied 150 images to validation/beignets
Copied 150 images to test/beignets
Copied 700 images to train/bibimbap
Copied 150 images to validation/bibimbap
Copied 150 images to test/bibimbap
Copied 700 images to train/bread_pudding
Copied 15

## Save Class Labels

Save the list of food classes to `food_classes.json` for use in model inference.

In [4]:
# Save class labels
with open('../models/food_classes.json', 'w') as f:
    json.dump(food_classes, f)
print('Saved food classes to models/food_classes.json')

Saved food classes to models/food_classes.json


## Verify Splits

Check the number of images in each split to ensure correctness.

In [5]:
def count_images(directory):
    total = 0
    for root, _, files in os.walk(directory):
        total += len([f for f in files if f.endswith('.jpg')])
    return total

print(f'Train images: {count_images(train_dir)}')
print(f'Validation images: {count_images(valid_dir)}')
print(f'Test images: {count_images(test_dir)}')

Train images: 70700
Validation images: 15150
Test images: 15150
