# Module 3: CNN Training Data Preparation

**Objective**: Clean and prepare stock codes from CNN training dataset for image scraping.

**Input**: CNN_Model_Train_Data.csv with target product categories

**Output**: Clean stock codes ready for web scraping

**Why This Matters**: Clean stock codes ensure we scrape the right product images for CNN training.

## Load CNN Training Codes

Load the specific products we want to train the CNN model to recognize.

In [None]:
import pandas as pd
import re

# Load CNN training codes
cnn_df = pd.read_csv('data/raw/CNN_Model_Train_Data.csv')

print(f"Total training products: {len(cnn_df)}")
print("\nStock codes to train on:")
print(cnn_df)

## Clean Stock Codes

Apply same cleaning logic as main dataset to ensure consistency.

In [None]:
def clean_stockcode(code):
    """Remove non-alphanumeric characters from stock code"""
    if pd.isna(code):
        return code
    return re.sub(r'[^A-Za-z0-9]', '', str(code))

# Before cleaning
print("Before cleaning:")
print(cnn_df['StockCode'].values)

# Clean
cnn_df['StockCode'] = cnn_df['StockCode'].apply(clean_stockcode)

# After cleaning
print("\nAfter cleaning:")
print(cnn_df['StockCode'].values)

# Remove any empty values
cnn_df = cnn_df[cnn_df['StockCode'] != '']

print(f"\nFinal count: {len(cnn_df)} clean codes")

## Save Cleaned Codes

Save to processed directory for use by scraper.

In [None]:
import os

# Create output directory
os.makedirs('data/processed', exist_ok=True)

# Save
output_path = 'data/processed/cnn_train_codes_clean.csv'
cnn_df.to_csv(output_path, index=False)

print(f"Saved to: {output_path}")
print(f"Ready for image scraping!")

## Next Steps

1. ✅ Clean codes prepared
2. ➡️ Run scraper to download product images
3. ➡️ Train CNN model on scraped images

**Expected**: ~50 images per product category for training