## Data Preprocessing

### Project Overview

- This notebook is part of a project aimed at building an AI-powered semantic search tool for Nike products. 
- The goal is to leverage natural language processing techniques to enable intuitive and accurate product recommendations from a dataset of Nike products.

### Why This Dataset?
I selected this dataset because:
- It contains detailed product information, including titles, descriptions, reviews, prices, and availability, which are essential for building a semantic search engine.
- It includes structured and semi-structured data that allow for feature extraction and preprocessing to support natural language processing (NLP) tasks.
- The dataset provides real-world e-commerce data, making it applicable for developing a practical recommendation system.


### What Needs to Be Done
1. Clean and preprocess textual data to make it usable for NLP tasks.
2. Handle missing values in critical fields like names and prices.
3. Extract meaningful information from raw HTML descriptions.
4. Parse and structure size data to enhance filtering capabilities.
5. Save the cleaned and preprocessed data for use in subsequent stages of the project.

In [58]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import re

In [59]:
# Load the dataset
data = pd.read_csv('../data/nike_data_2022_09.csv')

In [60]:
# Display the first few rows to understand the structure
data.head()

Unnamed: 0,index,url,name,sub_title,brand,model,color,price,currency,availability,description,raw_description,avg_rating,review_count,images,available_sizes,uniq_id,scraped_at
0,0,https://www.nike.com/t/dri-fit-team-minnesota-...,Nike Dri-FIT Team (MLB Minnesota Twins),Men's Long-Sleeve T-Shirt,Nike,14226571,Navy,40.0,USD,InStock,SWEAT-WICKING COMFORT.The Nike Dri-FIT Team (M...,"<div class=""pi-pdpmainbody""><p><b class=""headl...",,,https://static.nike.com/a/images/t_PDP_1280_v1...,S | M | L | XL | 2XL,c3229e54-aa58-5fdd-9f71-fbe66366b2b2,20/09/2022 23:32:28
1,1,https://www.nike.com/t/club-américa-womens-dri...,Club América,Women's Nike Dri-FIT Soccer Jersey Dress,Nike,13814665,Black/Black,90.0,USD,InStock,"Inspired by traditional soccer jerseys, the Cl...","<div class=""pi-pdpmainbody""><br/><p>Inspired b...",5.0,1.0,https://static.nike.com/a/images/t_PDP_1280_v1...,L (12–14),f8ebb2ed-17ae-5719-b750-5ea3ec69b75c,20/09/2022 23:32:40
2,2,https://www.nike.com/t/sportswear-swoosh-mens-...,Nike Sportswear Swoosh,Men's Overalls,Nike,13015648,Black/White,140.0,USD,OutOfStock,WORKING HARD TO KEEP YOU COMFORTABLE.The Nike ...,"<div class=""pi-pdpmainbody""><p><b class=""headl...",4.9,11.0,https://static.nike.com/a/images/t_PDP_1280_v1...,,88120081-e6cb-5399-b9dc-a2d3d5dd5206,20/09/2022 23:33:16
3,3,https://www.nike.com/t/dri-fit-one-luxe-big-ki...,Nike Dri-FIT One Luxe,Big Kids' (Girls') Printed Tights (Extended Size),Nike,13809796,Black/Rush Pink,22.97,USD,OutOfStock,ELEVATED COMFORT GOES FULL BLOOM.The Nike Dri-...,"<div class=""pi-pdpmainbody""><p><b class=""headl...",,,https://static.nike.com/a/images/t_PDP_1280_v1...,,98348cc5-1520-5b6e-a5f6-c42547b6a092,20/09/2022 23:33:17
4,4,https://www.nike.com/t/paris-saint-germain-rep...,Paris Saint-Germain Repel Academy AWF,Big Kids' Soccer Jacket,Nike,13327415,Dark Grey/Black/Siren Red/Siren Red,70.0,USD,InStock,WATER-REPELLENT COVERAGE GETS PSG DETAILS.The ...,"<div class=""pi-pdpmainbody""><p><b class=""headl...",,,https://static.nike.com/a/images/t_PDP_1280_v1...,XS | S | M | L | XL,f15981a5-d8c9-53fa-880d-80606be188fe,20/09/2022 23:33:22


### Dataset Overview
The dataset contains the following key columns:
- `name`: Name of the product.
- `sub_title`: A short description of the product.
- `brand`: Brand name (Nike).
- `color`: Product color.
- `price`: Product price in numeric format.
- `currency`: Currency of the price.
- `availability`: Stock status of the product.
- `raw_description`: Product description in HTML format.
- `avg_rating`: Average customer rating.
- `review_count`: Number of customer reviews.
- `available_sizes`: Sizes available for the product.
- `scraped_at`: Timestamp when the product data was scraped.

In [61]:
print(data.columns)
print(data.info())

Index(['index', 'url', 'name', 'sub_title', 'brand', 'model', 'color', 'price',
       'currency', 'availability', 'description', 'raw_description',
       'avg_rating', 'review_count', 'images', 'available_sizes', 'uniq_id',
       'scraped_at'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112 entries, 0 to 111
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   index            112 non-null    int64  
 1   url              112 non-null    object 
 2   name             112 non-null    object 
 3   sub_title        112 non-null    object 
 4   brand            112 non-null    object 
 5   model            112 non-null    int64  
 6   color            110 non-null    object 
 7   price            112 non-null    float64
 8   currency         112 non-null    object 
 9   availability     108 non-null    object 
 10  description      112 non-null    object 
 11  raw_description  112 n

In [62]:
# Step 1: Handling Missing Values
# Identify columns with missing values
missing_values = data.isnull().sum()
print("Missing values per column:\n", missing_values)

Missing values per column:
 index               0
url                 0
name                0
sub_title           0
brand               0
model               0
color               2
price               0
currency            0
availability        4
description         0
raw_description     0
avg_rating         89
review_count       89
images              4
available_sizes    56
uniq_id             0
scraped_at          0
dtype: int64


In [63]:
# Fill missing numeric values (avg_rating, review_count) with mean
data['avg_rating'] = data['avg_rating'].fillna(data['avg_rating'].mean())
data['review_count'] = data['review_count'].fillna(0)

In [64]:
# Fill missing `color` and `availability` values with placeholders
data['color'] = data['color'].fillna('unknown')
data['availability'] = data['availability'].fillna('unknown')

In [65]:
# Step 2: Drop Unnecessary Columns
data = data.drop(columns=['url', 'index', 'images', 'scraped_at'])