# Image Download Notebook

The purpose of this notebook is to automatically fetch and download a portion of the images into your local storage. Since it is generally not good practice to commit huge media files (even with 10% it's still 1.5GB), this is a great way to control your usage of data, allowing anyone to work with different slices of the image pool, for performance reasons of course.

In [1]:
# RUN CELL TO LOAD CSV FILES

import pandas as pd

# Load image index (images.csv)
links = pd.read_csv('./images.csv')

# Load styles index (styles.csv)
styles = pd.read_csv('./styles.csv', on_bad_lines='skip')

styles.sample(5)

Unnamed: 0,id,gender,masterCategory,subCategory,articleType,baseColour,season,year,usage,productDisplayName
13290,34307,Women,Accessories,Stoles,Stoles,Black,Summer,2012.0,Ethnic,Fabindia Women Black Silk & Wool Stole
28464,18940,Men,Apparel,Topwear,Tshirts,Olive,Fall,2011.0,Casual,Levi's Men Polo Olive Tshirt
2788,35588,Men,Footwear,Sandal,Sandals,Black,Fall,2012.0,Casual,Gliders Men Black Sandals
14290,7804,Men,Footwear,Shoes,Casual Shoes,Black,Fall,2011.0,Casual,Puma Men's Benecio Mid Leather Black White Shoe
41900,27856,Women,Apparel,Topwear,Shirts,Red,Summer,2012.0,Casual,Scullers For Her Check Red Shirt


## Sampling

Our main goal here is to pick a 10% (lower or higher as you wish) sample from the image pool. This sample will not just be random, but also reflect the right distribution of attributes (e.g. 16% of the orignal images are T-shirts, so should the sample be). This is called stratified sampling.

The cell below will take a controlled fraction of the image pool, according to a predefined RNG seed for reproducibility. You may change the values of the magic strings SAMPLE_FRAC and SEED to produce a different size or composition, respectively.

The output DataFrame is df_sampled.

In [2]:
# CONTROL PARAMETERS FOR SAMPLING
SAMPLE_FRAC = 0.1
SEED = 123


cols = ['gender', 'masterCategory', 'subCategory', 'articleType', 'baseColour', 'season', 'usage']

df_sampled = styles.groupby(cols, group_keys=False).apply(
    lambda x: x.sample(frac=SAMPLE_FRAC, random_state=SEED)
).reset_index(drop=True)

print(f"Sampled {len(df_sampled)} entries ({SAMPLE_FRAC*100}%) from original {len(styles)} entries as 'df_sampled'.\n")

# Comparing distributions between sample and original
for col in cols:
    print(f"\n--- Summary for '{col}' ---")
    orig_dist = styles[col].value_counts(normalize=True).sort_index()
    samp_dist = df_sampled[col].value_counts(normalize=True).sort_index()
        
    # Align indices (add missing with 0)
    all_vals = set(orig_dist.index) | set(samp_dist.index)
    orig_aligned = orig_dist.reindex(all_vals, fill_value=0)
    samp_aligned = samp_dist.reindex(all_vals, fill_value=0)
        
    # Calculate % diffs for aligned values
    pct_diffs = []
    for val in all_vals:
        orig_p = orig_aligned[val]
        samp_p = samp_aligned[val]
        if orig_p == 0:
            if samp_p == 0:
                continue  # Skip 0-0
            else:
                pct_diffs.append(float('inf'))  # Flag new values
        else:
            pct_diff = ((samp_p - orig_p) / orig_p) * 100
            pct_diffs.append(abs(pct_diff))  # Absolute for median
        
    # Median abs % diff (ignore inf for median calc)
    numeric_diffs = [d for d in pct_diffs if isinstance(d, float) and d < float('inf')]
    if numeric_diffs:
        sorted_diffs = sorted(numeric_diffs)
        n = len(sorted_diffs)
        if n % 2 == 1:
            median_abs_diff = sorted_diffs[n//2]
        else:
            median_abs_diff = (sorted_diffs[n//2 - 1] + sorted_diffs[n//2]) / 2
    else:
        median_abs_diff = 0
        
    print(f"Median absolute % difference across all unique values: {median_abs_diff:.2f}%")
    if any(d == float('inf') for d in pct_diffs):
        print("Warning: New categories appeared in sample (check manually)")
print("\n\n")

Sampled 3823 entries (10.0%) from original 44424 entries as 'df_sampled'.


--- Summary for 'gender' ---
Median absolute % difference across all unique values: 13.20%

--- Summary for 'masterCategory' ---
Median absolute % difference across all unique values: 13.92%

--- Summary for 'subCategory' ---
Median absolute % difference across all unique values: 30.77%

--- Summary for 'articleType' ---
Median absolute % difference across all unique values: 61.27%

--- Summary for 'baseColour' ---
Median absolute % difference across all unique values: 29.23%

--- Summary for 'season' ---
Median absolute % difference across all unique values: 1.49%

--- Summary for 'usage' ---
Median absolute % difference across all unique values: 32.82%





  df_sampled = styles.groupby(cols, group_keys=False).apply(


The diagnostics above are just a way to see how closely the distribution of unique values in each column (except *id* and *productDisplayName*) matches the original 44.4k set. While some **rare** uniques may be dropped, this shouldn't matter much as we aim to reinforce the model with more commonly observed values. This is where the median % error comes in handy - to check how much more/fewer samples were taken in the respect to the original proportions of unique values in each column.

Ultimately, this comes across as trade-off: higher sample fraction = less error, but also more data to handle. Make your decision.

## Download Images

Once you've decided on your personal slice of the dataset, it's time to pull them via URLs from the *images.csv* file.

To break it down, these cells will create a new folder, *raw_images* in the current directory, and an HTTP request client will be used to download the specified URLs and store them in said folder.

If you execute the **2ND CELL** before reassigning *DOWNLOAD_INDEX*, 10 random images from the sampled pool will be downloaded instead of the whole thing. This is a safety measure to prevent the unintended consequences of Run Alls and uninformed decisions :p

To download the **sample set**, make sure *DOWNLOAD_INDEX = df_images_to_download* (not *df_test*), then run. The images will be downloaded in around 10-15 minutes and named accordingly in the *raw_images* folder. Be aware that this will use multiple threads out of your CPU to download as quickly as possible.

Don't worry if the download was interrupted, this will skip the ones that were already downloaded and pick up where you left off.

Have a go at it and good luck!

In [3]:
# Shortlist images by matching IDs from sampled styles with images.csv
links['id'] = links['filename'].str.split('.').str[0]
sampled_ids = set(df_sampled['id'].astype(str))
df_images_to_download = links[links['id'].isin(sampled_ids)].reset_index(drop=True)

# A small test sample
df_test = df_images_to_download.sample(10, random_state=SEED)

In [None]:
import os
import requests
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor, as_completed

# IMPORTANT: BEFORE RUNNING, CHANGE TO df_images_to_download TO DOWNLOAD FULL SET
DOWNLOAD_INDEX = df_test

# Making directory to save images
download_dir = 'raw_images'
os.makedirs(download_dir, exist_ok=True)



# Download images with progress bar
# Create session for connection pooling (reuses TCP connections for speed)
session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0'})  # Avoid blocks

successful = 0
failed = []
max_workers = 10

# Function to download a single image
def download_image(args):
    name, url = args
    filename = f"{name}.jpg"  # Customize as before
    
    filepath = os.path.join(download_dir, filename)
    
    if os.path.exists(filepath):
        return name, 'already_exists'  # Skip if done
    
    try:
        response = session.get(url, timeout=10)
        response.raise_for_status()
        
        with open(filepath, 'wb') as f:
            f.write(response.content)
        return name, 'success'
    except Exception as e:
        return name, f'error: {str(e)}'

# Multithreaded download with progress bar
with ThreadPoolExecutor(max_workers=max_workers) as executor:
    # Submit all tasks
    future_to_idx = {
        executor.submit(download_image, (row['id'], row['link'])): idx 
        for idx, row in DOWNLOAD_INDEX.iterrows() 
        if not pd.isna(row['link']) and isinstance(row['link'], str) and row['link'].startswith('http')
    }
    
    # Progress bar over completed futures
    with tqdm(total=len(future_to_idx), desc="Downloading") as pbar:
        for future in as_completed(future_to_idx):
            idx = future_to_idx[future]
            try:
                result = future.result()
                status = result[1]
                if status == 'success':
                    successful += 1
                else:
                    failed.append((idx, status))
            except Exception as e:
                failed.append((idx, f'unexpected: {str(e)}'))
            pbar.update(1)
    
# Summary
print("\nDownload complete!")
print(f"Successful: {successful}/{len(DOWNLOAD_INDEX)}")
print(f"Failed: {len(failed)}")
    
if failed:
    print("\nSample failures:")
    for idx, reason in failed[:5]:
        print(f"\tRow {idx}: {reason}")

Downloading:   0%|          | 0/10 [00:00<?, ?it/s]

Downloading: 100%|██████████| 10/10 [00:00<00:00, 18.43it/s]


Download complete!
Successful: 10/10
Failed: 0





After the last cell has done executing, you should have the same number of images as there are samples selected from the first step. Don't worry if it says 0 successful, it's just clunky spaghetti code sourced from Grok, your images should be fine.

Beyond this point, we can start looking into getting a basic CLIP model up and fitting the images.