# 3. Pre-processing and Training Data Development

* [3 Training Data](#2_Data_training_introduction)

In [25]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
print("Loaded Libraries")

Loaded Libraries


In [27]:
# Load the data
products = pd.read_csv("../data/processed/products.csv")

# Convert non-categorical columns to categorical format
products['Brand'] = products['Brand'].astype('category')
products['Color'] = products['Color'].astype('string')
products['Size'] = products['Size'].astype('string')
products['Description'] = products['Description'].astype('string')  # Corrected column
products['ParentCategory'] = products['ParentCategory'].astype('string')  # Added column
products['Category'] = products['Category'].astype('category')  # Encoding target variable

# Drop rows with missing values
products.dropna(inplace=True)

In [28]:
# Define numerical and categorical features
numerical_features = ['MSRP']
categorical_features = ['Brand', 'Size', 'Color', 'Description', 'ParentCategory']

In [29]:
# Define preprocessing steps
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])  # No need for imputation since we dropped missing values

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)])

In [30]:
# Define X and y
X = products.drop(columns=['Category']) 
y = products['Category']

Category
Accessories - Winter - Gloves & Mittens               551
Accessories - Winter - Socks                          256
Clothing - Winter - Outerwear - Mens                  244
Accessories - Winter - Hats, Hoods, Collars           217
Snowboard Hardgoods - Boots - Men                     162
                                                     ... 
Lift Tickets                                            0
Logo Merchandise - Clothing                             0
Logo Merchandise - Clothing - Crewneck Sweatshirts      0
Accessories - Summer - Waterbottles & Cages             0
Acccessories - Winter - Beanies                         0
Name: count, Length: 269, dtype: int64

In [34]:
# Define X and y
X = products.drop(columns=['Category']) 
X.head()

Unnamed: 0,Brand,Description,Keyword,UPC,MSRP,Quantity,SKU,Color,Size,StyleNumber,StyleName,ParentCategory
3,Obermeyer,Jette Jacket Glacier melt 12,Obermeyer,888555767674,299.0,1,JET243078952,Glacier melt,12,11210,Jette Jacket,Clothing
10,Obermeyer,Obermeyer - Keystone Pant - Black - LS,Obermeyer,888555573336,109.5,0,KEY19725340S,Black,LS,25102,Keystone Pant,Clothing
15,Nordica,Nordica - Speedmachine 75 W Ski Boots - Black/...,Nordica,888341789040,400.0,0,SPE39922955,Black - Anthracite - Purple,23.5,050H4803735,SPEEDMACHINE 75 W,Ski Hardgoods
18,Smith Optics,"Smith - I/O MAG Goggles - Sunrise, ChromaPop S...",Smith Optics,716736827513,270.0,2,IOM31217987S,Sunrise | ChromaPop Sun Black,OS,M00427,I/O MAG,Accessories
21,Nordica,Santa Ana 93 Flat Blue - Rasperry 158,Nordica,888341757582,750.0,0,SAN291544758,Blue - Rasperry,158,0A031800001,SANTA ANA 93 (flat),Ski Hardgoods


In [31]:
class_counts = products['Category'].value_counts()
class_counts

Category
Accessories - Winter - Gloves & Mittens               551
Accessories - Winter - Socks                          256
Clothing - Winter - Outerwear - Mens                  244
Accessories - Winter - Hats, Hoods, Collars           217
Snowboard Hardgoods - Boots - Men                     162
                                                     ... 
Lift Tickets                                            0
Logo Merchandise - Clothing                             0
Logo Merchandise - Clothing - Crewneck Sweatshirts      0
Accessories - Summer - Waterbottles & Cages             0
Acccessories - Winter - Beanies                         0
Name: count, Length: 269, dtype: int64

In [36]:
class_percentages = pd.Series([(x / products.shape[0]) * 100.00 for x in class_counts])
print(class_percentages)

0      12.423901
1       5.772266
2       5.501691
3       4.892897
4       3.652762
         ...    
264     0.000000
265     0.000000
266     0.000000
267     0.000000
268     0.000000
Length: 269, dtype: float64


In [39]:
# Split data into training and testing sets (60% training, 40% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

In [40]:
# Train the model
pipeline.fit(X_train, y_train)

# Evaluate the model
accuracy = pipeline.score(X_test, y_test)
print("Model Accuracy:", accuracy)

Model Accuracy: 0.7925591882750845


## Summary

The provided code loads a dataset of products, ensuring appropriate data types for categorical and numerical features. It handles missing values by dropping rows with NaN entries. Preprocessing steps are defined for scaling numerical features and encoding categorical features using one-hot encoding after imputation. The dataset is then split into training and testing sets. A machine learning pipeline is constructed, incorporating preprocessing and a RandomForestClassifier model. The model is trained on the training data and evaluated for accuracy on the testing data, aiming to classify products into categories based on various features. Overall, the code orchestrates data preprocessing, model training, and evaluation to classify product categories effectively.

It appears that the first model test has an accuracy of 79%. I think we can do much better. Stay tuned for the next section!

In [43]:
%store products
%store X_train
%store X_test
%store y_train
%store y_test

Stored 'products' (DataFrame)
Stored 'X_train' (DataFrame)
Stored 'X_test' (DataFrame)
Stored 'y_train' (Series)
Stored 'y_test' (Series)
