# 3. Pre-processing and Training Data Development

* [3 Training Data](#2_Data_training_introduction)
  * [3.1 Dummy Variables/One Hot Encoding for Categorical](#3.1_one_hot_encoding)
  * [3.2 Standardize Numerical Data](#3.2_standardize)
  * [3.3 Testing Training](#3.3_testing_training)
 * [3.2 Summary](#3.7_Summary)

## Training Data <a href="#2_Data_training_introduction">

In [32]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

print("Loaded Libraries")

Loaded Libraries


In [33]:
# Load the data
products = pd.read_csv("../data/processed/products.csv")

# # New category list
# new_categories = pd.read_csv("../data/processed/Sunlight-Categories.csv")["Category"].tolist() 

# Inspect data
print(products.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21396 entries, 0 to 21395
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Category        21396 non-null  object 
 1   Brand           18956 non-null  object 
 2   Description     21396 non-null  object 
 3   Keyword         18956 non-null  object 
 4   UPC             18970 non-null  object 
 5   MSRP            21396 non-null  float64
 6   Quantity        21396 non-null  int64  
 7   SKU             21396 non-null  object 
 8   Color           15387 non-null  object 
 9   Size            15806 non-null  object 
 10  StyleNumber     7000 non-null   object 
 11  StyleName       8881 non-null   object 
 12  ParentCategory  21396 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 2.1+ MB
None


### I. Dummy Variables/One Hot Encoding for Categorical <a href="#3.1_one_hot_encoding">
In general, categorical features need to be transformed or encoded to be used in some machine learning models, as is the case for Logistic Regression. A common transformation is so-called dummy encoding, where each possible value of a feature becomes a new column, and a 1 is placed in that column if the data instance (a row of the data) contained that value, and a 0 is placed in that column otherwise.
    
Let's identify categorical variables that need dummy encoding and create dummy variables using OneHotEncoder.

In [34]:
# Convert non-categorical columns to categorical format
products['Brand'] = products['Brand'].astype('category')
products['Color'] = products['Color'].astype('string')
products['Size'] = products['Size'].astype('string')
products['Description'] = products['Category'].astype('string')
products['ParentCategory'] = products['ParentCategory'].astype('category')

dummy_categories = pd.get_dummies(pd.Series(new_categories), prefix='category')
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21396 entries, 0 to 21395
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   Category        21396 non-null  object  
 1   Brand           18956 non-null  category
 2   Description     21396 non-null  string  
 3   Keyword         18956 non-null  object  
 4   UPC             18970 non-null  object  
 5   MSRP            21396 non-null  float64 
 6   Quantity        21396 non-null  int64   
 7   SKU             21396 non-null  object  
 8   Color           15387 non-null  string  
 9   Size            15806 non-null  string  
 10  StyleNumber     7000 non-null   object  
 11  StyleName       8881 non-null   object  
 12  ParentCategory  21396 non-null  category
dtypes: category(2), float64(1), int64(1), object(6), string(3)
memory usage: 1.9+ MB


In [35]:
# Concatenate dummy variables with original dataset
# Assuming your original dataset is stored in a DataFrame called df
df = pd.concat([products, dummy_categories], axis=1)# Scale numerical fields using StandardScaler
scaler = StandardScaler()
numerical_fields = ['MSRP']  # Add other numerical fields as needed
df[numerical_fields] = scaler.fit_transform(df[numerical_fields])

So far, we have scaled the one numerical MSRP variable and dummy coded the NEW Categories field. Next, lets convert the categorical features as well.

In [38]:
categorical_features = ['Brand',
                        'Size',
                        'Color', 
                        'Description',
                        'ParentCategory',
                        'Category']
dflog = pd.get_dummies(products, columns = categorical_features)

print('The data have ', products.shape[0], ' rows and ', products.shape[1], ' columns\n')
print('column names: \n')
print('\n'.join(list(products.columns)))

The data have  21396  rows and  13  columns

column names: 

Category
Brand
Description
Keyword
UPC
MSRP
Quantity
SKU
Color
Size
StyleNumber
StyleName
ParentCategory


In [37]:
# Convert the list to a numpy array
new_categories_array = np.array(new_categories)

# Reshape the array to have a single column
new_categories_array = new_categories_array.reshape(-1, 1)

# One-hot encoding for new categories
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoded_features = pd.DataFrame(encoder.fit_transform(new_categories_array))

# Encoded features now contain columns for each new category
print(encoded_features)

     0    1    2    3    4    5    6    7    8    9    ...  137  138  139  \
0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
1    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
2    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
3    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
4    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
..   ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...   
142  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
143  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
144  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
145  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
146  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   

     140  141  142  143  144  145  146  
0    0.0  0.0  0.0  0.0  0.0  0.0 



### II. Standardize Numerical Data <a href="#3.2_standardize">
    
Standardize the magnitude of numeric features using a scaler (MSRP, Cost)

In [26]:
# Scale numerical fields using StandardScaler
scaler = StandardScaler()
numerical_fields = ['MSRP']  # Add other numerical fields as needed
df[numerical_fields] = scaler.fit_transform(df[numerical_fields])

In [39]:
# # Scale numerical features
# scaled_numerical_cols = pd.DataFrame(df[numerical_fields], columns=numerical_fields.columns)

# Combine scaled numerical columns with non-scaled features
scaled_features = pd.concat([scaled_numerical_cols, features.drop(['MSRP'], axis=1)], axis=1)

### III. Training and Splitting Data

In [40]:
# Split data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(scaled_features, target, test_size=0.2, random_state=42)

# Now you have:
# - X_train: Training features (scaled)
# - X_test: Testing features (scaled)
# - y_train: Training target variable
# - y_test: Testing target variable

# It's ready for training and evaluation


NameError: name 'scaled_features' is not defined

#### Proportion of classes
When building classification models, it is always a good idea to know right away the number of samples per class, proportionally to the total number of samples. First we get the counts of each class.

In [None]:
class_counts = products['ParentCategory'].value_counts()
class_counts

In [None]:
class_percentages = pd.Series([(x / products.shape[0]) * 100.00 for x in class_counts])
class_percentages

In [None]:
fig, ax = plt.subplots()
ax.bar(class_counts.index, class_counts)
ax.set_ylabel('Count')
ax.set_xlabel('Category')
ax.set_title('Category Distribution',
              fontsize = 10)
plt.show()

In [31]:
# Compute correlation matrix
correlation_matrix = encoded_features.corr()

# Plot heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap of Encoded Features')
plt.show()

NameError: name 'plt' is not defined