# 3. Pre-processing and Training Data Development

* [3 Training Data](#2_Data_training_introduction)

In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer

print("Loaded Libraries")

Loaded Libraries


In [10]:
# Load the data
products = pd.read_csv("../data/processed/products.csv")
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21396 entries, 0 to 21395
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Category        21396 non-null  object 
 1   Brand           18956 non-null  object 
 2   Description     21396 non-null  object 
 3   Keyword         18956 non-null  object 
 4   UPC             18970 non-null  object 
 5   MSRP            21396 non-null  float64
 6   Quantity        21396 non-null  int64  
 7   SKU             21396 non-null  object 
 8   Color           15387 non-null  object 
 9   Size            15806 non-null  object 
 10  StyleNumber     7000 non-null   object 
 11  StyleName       8881 non-null   object 
 12  ParentCategory  21396 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 2.1+ MB


In [11]:
# Convert non-categorical columns to categorical format
products['Brand'] = products['Brand'].astype('category')
products['Color'] = products['Color'].astype('category')  # Changed to category type
products['Size'] = products['Size'].astype('category')  # Changed to category type
products['Description'] = products['Description'].astype('string')
products['ParentCategory'] = products['ParentCategory'].astype('category')  # Changed to category type
products['Category'] = products['Category'].astype('category')  # Encoding target variable

# Define numerical and categorical features
numerical_features = ['MSRP']
categorical_features = ['Brand', 'Size', 'Color', 'Description', 'ParentCategory']

# Drop rows with missing values
products.dropna(inplace=True)

In [12]:
# Define numerical and categorical features
numerical_features = ['MSRP']
categorical_features = ['Brand', 'Size', 'Color', 'Description', 'ParentCategory']

In [13]:
# Define preprocessing steps
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])  

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)])

In [14]:
# Define preprocessing steps
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)])

# Define the model pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())])



In [17]:
# Define X and y
X = products.drop(columns=['Category']) 
y = products['Category']

In [18]:
# Split data into training and testing sets (60% training, 20% validation, 20% testing)
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=42)  # 0.25 x 0.8 = 0.2

In [19]:
# Train the model on the training data
pipeline.fit(X_train, y_train)

In [20]:
# Evaluate the model on the validation data
accuracy = pipeline.score(X_val, y_val)
print("Model Accuracy on Validation Data:", accuracy)

Model Accuracy on Validation Data: 0.7925591882750845


In [21]:
# Finally, evaluate the model on the test data
accuracy_test = pipeline.score(X_test, y_test)
print("Model Accuracy on Test Data:", accuracy_test)

Model Accuracy on Test Data: 0.7880496054114995


In [22]:
class_counts = products['Category'].value_counts()
class_counts

Category
Accessories - Winter - Gloves & Mittens               551
Accessories - Winter - Socks                          256
Clothing - Winter - Outerwear - Mens                  244
Accessories - Winter - Hats, Hoods, Collars           217
Snowboard Hardgoods - Boots - Men                     162
                                                     ... 
Lift Tickets                                            0
Logo Merchandise - Clothing                             0
Logo Merchandise - Clothing - Crewneck Sweatshirts      0
Accessories - Summer - Waterbottles & Cages             0
Acccessories - Winter - Beanies                         0
Name: count, Length: 269, dtype: int64

In [23]:
class_percentages = pd.Series([(x / products.shape[0]) * 100.00 for x in class_counts])
print(class_percentages)

0      12.423901
1       5.772266
2       5.501691
3       4.892897
4       3.652762
         ...    
264     0.000000
265     0.000000
266     0.000000
267     0.000000
268     0.000000
Length: 269, dtype: float64


Ways in which this code prevents data leakage:

* **Early Data Splitting:** The code splits the data into training, validation, and testing sets at the beginning, ensuring that the testing set remains untouched until the final evaluation stage. This prevents information leakage from the testing set into the training process.

* **Stratified Splitting:** When splitting the data, the code uses stratified sampling, which preserves the distribution of classes in both the training and testing sets. This helps ensure that each class is adequately represented in both sets, reducing the risk of biased model training or evaluation.

* **Preprocessing within Pipelines:** The code defines preprocessing steps within pipelines, ensuring that data transformations, such as imputation and scaling, are applied separately to the training and testing sets. This prevents information about the testing set from influencing the preprocessing steps applied to the training set, thus avoiding leakage.

* **Explicit Data Handling:** The code explicitly drops rows with missing values before performing any further processing or modeling steps. This prevents missing value imputation from being influenced by information in the testing set, reducing the risk of leakage.

By implementing these strategies, the code creates a robust framework for training and evaluating machine learning models while minimizing the risk of data leakage, ensuring the reliability and generalization capability of the models.

## Summary

The provided code loads a dataset of products, ensuring appropriate data types for categorical and numerical features. It handles missing values by dropping rows with NaN entries. Preprocessing steps are defined for scaling numerical features and encoding categorical features using one-hot encoding after imputation. The dataset is then split into training and testing sets. A machine learning pipeline is constructed, incorporating preprocessing and a RandomForestClassifier model. The model is trained on the training data and evaluated for accuracy on the testing data, aiming to classify products into categories based on various features. Overall, the code orchestrates data preprocessing, model training, and evaluation to classify product categories effectively.

It appears that the first model test has an accuracy of 79%. I think we can do much better. Stay tuned for the next section!

In [26]:
%store products
%store X_train
%store X_test
%store y_train
%store y_test

Stored 'products' (DataFrame)
Stored 'X_train' (DataFrame)
Stored 'X_test' (DataFrame)
Stored 'y_train' (Series)
Stored 'y_test' (Series)


<!-- TO DO: Future pre-processing could involve using cross validation (k_folds). Mimimize data variance. K number of models; which ones do you eant to use; one more training of model without any cross validation; the more data you have the better. -->