## Data Loading and Insights

In [1]:
import ast
import re

import pandas as pd

In [2]:
DATASET_PATH = "../data/product_data.csv"
df = pd.read_csv(DATASET_PATH)
df.head()

Unnamed: 0,title,brand,description,price,categories,images,manufacturer,package_dimensions,country_of_origin,material,color,uniq_id
0,"GOYMFK 1pc Free Standing Shoe Rack, Multi-laye...",GOYMFK,"multiple shoes, coats, hats, and other items E...",$24.99,"['Home & Kitchen', 'Storage & Organization', '...",['https://m.media-amazon.com/images/I/416WaLx1...,GOYMFK,"2.36""D x 7.87""W x 21.6""H",China,Metal,White,02593e81-5c09-5069-8516-b0b29f439ded
1,"subrtex Leather ding Room, Dining Chairs Set o...",subrtex,subrtex Dining chairs Set of 2,,"['Home & Kitchen', 'Furniture', 'Dining Room F...",['https://m.media-amazon.com/images/I/31SejUEW...,Subrtex Houseware INC,"18.5""D x 16""W x 35""H",,Sponge,Black,5938d217-b8c5-5d3e-b1cf-e28e340f292e
2,Plant Repotting Mat MUYETOL Waterproof Transpl...,MUYETOL,,$5.98,"['Patio, Lawn & Garden', 'Outdoor Décor', 'Doo...",['https://m.media-amazon.com/images/I/41RgefVq...,MUYETOL,"26.8""L x 26.8""W",,Polyethylene,Green,b2ede786-3f51-5a45-9a5b-bcf856958cd8
3,"Pickleball Doormat, Welcome Doormat Absorbent ...",VEWETOL,The decorative doormat features a subtle textu...,$13.99,"['Patio, Lawn & Garden', 'Outdoor Décor', 'Doo...",['https://m.media-amazon.com/images/I/61vz1Igl...,Contrence,"24""L x 16""W",,Rubber,A5589,8fd9377b-cfa6-5f10-835c-6b8eca2816b5
4,JOIN IRON Foldable TV Trays for Eating Set of ...,JOIN IRON Store,Set of Four Folding Trays With Matching Storag...,$89.99,"['Home & Kitchen', 'Furniture', 'Game & Recrea...",['https://m.media-amazon.com/images/I/41p4d4VJ...,,"18.9""D x 14.2""W x 26""H",,Iron,Grey Set of 4,bdc9aa30-9439-50dc-8e89-213ea211d66a


From a quick preview we learn a few things:
- Prices do have null values
- Prices need to be processed and the `$` sign needs to be removed
- categories and images need to be handled with care as they might have stringified lists eg "[]"

In [3]:
# Check for null values
df.isnull().sum()

title                   0
brand                   0
description           153
price                  97
categories              0
images                  0
manufacturer          107
package_dimensions      6
country_of_origin     187
material               94
color                  47
uniq_id                 0
dtype: int64

Some of the columns have null values, but we don't need to worry about title, uniq_id, images etc to be null

---
## Pre Processing

In [None]:
def parse_price(price_str):
    """Convert price like '$59.99' or '$1,299' -> float or None"""
    if not isinstance(price_str, str):
        return None
    price_str = price_str.strip()
    match = re.search(r"[\d,.]+", price_str)
    if match:
        try:
            return float(match.group(0).replace(",", ""))
        except ValueError:
            return None
    return None

In [None]:
def parse_list_column(value):
    """Convert stringified lists like "['a', 'b']" into actual Python lists"""
    if isinstance(value, list):
        return [str(v).strip() for v in value]
    if isinstance(value, str):
        try:
            parsed = ast.literal_eval(value)
            if isinstance(parsed, list):
                return [str(v).strip() for v in parsed]
            else:
                return [value.strip()]
        except (ValueError, SyntaxError):
            return [value.strip()]
    return []

We don't need much pre processing as we are using a Deep Learning (DL) approach, DL models can take data as such.

The actual preprocessing is done inside the `scripts/preprocessing_data.py`