Dataset Preparation for Image Recommendation System





This notebook handles downloading, organizing, and preprocessing the dataset for training and testing the image recommendation system. It fetches fashion product images from Kaggle and prepares them into train/test folders.

The notebook is structured as follows:

- Import necessary libraries and set up environment
- Uses os, pandas, shutil, zipfile, and train_test_split from sklearn.
- Configure downloaded dataset from Kaggle.
- Load and preprocess metadata
- Loads styles.csv, filters valid image files, and stratifies data into train/test.
- Copy images to train/test folders
- Copies images into corresponding directories based on split labels.

In [1]:
import os
import shutil
import pandas as pd
from sklearn.model_selection import train_test_split

Imports modules for file operations (`os`, `shutil`), data handling (`pandas`), and splitting datasets into training and testing sets (`train_test_split`).


In [None]:
# Load and prepare CSV
df = pd.read_csv(r'fashion-dataset\styles.csv', on_bad_lines='skip')
df.fillna({'masterCategory': 'Apparel', 'subCategory': 'Topwear', 'season': 'Summer', 'usage': 'Casual'}, inplace=True)
df['stratify_col'] = df['masterCategory'] + "_" + df['subCategory'] + "_" + df['season'] + "_" + df['usage']

Loads a CSV file into a DataFrame, fills missing values in specific columns with default categories, and creates a new column combining several features for stratified sampling.


In [3]:
# Filter out rare classes
min_instances = 2
class_counts = df['stratify_col'].value_counts()
valid_classes = class_counts[class_counts >= min_instances].index
df = df[df['stratify_col'].isin(valid_classes)]

# Train-test split
train_df, test_df = train_test_split(df, test_size=0.2, stratify=df['stratify_col'], random_state=42)

Filters out classes with fewer than 2 instances, then splits the dataset into training and testing sets with stratification based on combined categories to maintain class distribution.


In [None]:
# Paths
image_dataset_path = "fashion-dataset/images"
train_path = "data/Train/train"
test_path = "data/Test/test"
os.makedirs(train_path, exist_ok=True)
os.makedirs(test_path, exist_ok=True)

Sets directory paths for the image dataset and training/testing folders, then creates the training and testing directories if they don’t already exist.


In [6]:
# Copy images
def copy_images(df, dest_folder):
    for _, row in df.iterrows():
        src = os.path.join(image_dataset_path, f"{row['id']}.jpg")
        dst = os.path.join(dest_folder, f"{row['id']}.jpg")
        if os.path.exists(src):
           shutil.copy(src, dst)

copy_images(train_df, train_path)
copy_images(test_df, test_path)
print("Images copied to train and test folders.")

Images copied to train and test folders.


Defines a function to copy images from the dataset to destination folders based on DataFrame entries, then copies training and testing images accordingly.
