# ML Capstone 1 - Part 2 E-Commerce Customer Segmentation

 ## TODO
 
 ### Feature Selection
- Selecting relevant features for segmentation and classification based on EDA insights
- Scaling numerical features and encoding categorical variables

### Model Preparation
- Preparing the dataset for clustering algorithms and classification models


### Grading and Important Instructions
- Each of the above steps are mandatory and should be completed in good faith
- Make sure before submitting that the code is in fully working condition
- It is fine to make use of ChatGPT, stackoverflow type resources, just provide the reference links from where you got it
- Debugging is an art, if you find yourself stuck with errors, take help of stackoverflow and ChatGPT to resolve the issue and if it's still unresolved, reach out to me for help.
- You need to score atleast 7/10 to pass the project, anything less than that will be marked required, needing resubmission.
- Feedback will be provided on 3 levels (Awesome, Suggestion, & Required). Required changes are mandatory to be made.
- For submission, please upload the project on github and share the link to the file with us through LMS.

#### Write your code below and do not delete the above instructions

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

Loading the dataset which is already pre-processed.
The dataset has been cleaned to handle the missing values, duplicates and outliers

In [2]:
# The csv file is cleaned

# Loading the dataset
data = pd.read_csv("ecommerce_data_cleaned.csv")
#encoding='ISO-8859-1'


# Feature Selection

Ensuring 'TotalAmount' exists

In [4]:
if 'TotalAmount' not in data.columns:
    data['TotalAmount'] = data['Quantity'] * data['UnitPrice']

# Selecting numerical and categorical features
numerical_features = ['Quantity', 'UnitPrice', 'TotalAmount']
categorical_features = ['Country'] if 'Country' in data.columns else []


In [5]:
if categorical_features:
    encoder = OneHotEncoder(sparse=False, drop='first')  
    encoded_features = encoder.fit_transform(data[categorical_features])
    encoded_feature_names = encoder.get_feature_names_out(categorical_features)
    encoded_df = pd.DataFrame(encoded_features, columns=encoded_feature_names)
    data = pd.concat([data.reset_index(drop=True), encoded_df], axis=1)

Encoding categorical features using OneHotEncoding and dropping first to avoid multicollinearity

In [7]:
# Dropping original categorical columns
data = data.drop(columns=categorical_features)

# Model Preparation

Normalization/Standardization

Min-Max Normalization for Clustering

In [8]:
scaler_minmax = MinMaxScaler()
data_normalized = scaler_minmax.fit_transform(data[numerical_features])

# Standardization for Classification
scaler_standard = StandardScaler()
data_standardized = scaler_standard.fit_transform(data[numerical_features])


In [9]:
# Adding scaled features to the dataset
data_normalized_df = pd.DataFrame(data_normalized, columns=[f"{col}_normalized" for col in numerical_features])
data_standardized_df = pd.DataFrame(data_standardized, columns=[f"{col}_standardized" for col in numerical_features])
data = pd.concat([data, data_normalized_df, data_standardized_df], axis=1)

In [10]:
# Dimensionality Reduction (PCA for Clustering)
pca = PCA(n_components=2)  
# Reducing to 2 dimensions for visualization or clustering

data_pca = pca.fit_transform(data_normalized)

# Adding PCA results to the dataset
data['PCA_1'] = data_pca[:, 0]
data['PCA_2'] = data_pca[:, 1]


In [11]:
# Data Splitting (For Classification)
# Assuming 'Cluster' column exists or generating a synthetic target for demonstration
if 'Cluster' not in data.columns:
    data['Cluster'] = np.random.randint(0, 4, size=len(data))  # Simulate clusters

Splitting data into training, validation, and test sets

In [12]:
 
X = data.drop(columns=['Cluster'])  # Features
y = data['Cluster']  # Target

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42)



# Display results
Display the prepared dataset and splits using standard print statements

In [14]:

print("Prepared Dataset (sample):")
print(data.head())

print("\nTraining Set (sample):")
print(X_train.head())

print("\nValidation Set (sample):")
print(X_val.head())

print("\nTest Set (sample):")
print(X_test.head())

print("\nFeature engineering, normalization, and data splitting complete.")

Prepared Dataset (sample):
   InvoiceNo StockCode                          Description  Quantity  \
0     536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1     536365     71053                  WHITE METAL LANTERN         6   
2     536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3     536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4     536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

           InvoiceDate  UnitPrice  CustomerID  Country_Austria  \
0  2010-12-01 08:26:00       2.55     17850.0                0   
1  2010-12-01 08:26:00       3.39     17850.0                0   
2  2010-12-01 08:26:00       2.75     17850.0                0   
3  2010-12-01 08:26:00       3.39     17850.0                0   
4  2010-12-01 08:26:00       3.39     17850.0                0   

   Country_Bahrain  Country_Belgium  ...  TotalAmount  Quantity_normalized  \
0                0                0  ...        15.30      