In [None]:
#Title: Week 10.2: Project Milestone 3: Model Building and Evaluation
#Author: Brett Werner
#Date: 16 Nov 2025
#Created By: Sathya Raj Eswaran
#Description: Project Milestone 3: Model Building and Evaluation
#===========================================

In [25]:
# Importing Libraries
import pandas as pd
import numpy as np
import warnings
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import VarianceThreshold
warnings.filterwarnings('ignore')
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [3]:
# Load the dataset
df = pd.read_csv("data\\E-commerce Customer Behavior Dataset.csv")
# Display the first 5 rows to visually inspect the data.
print(f"Successfully loaded")
print("\nFirst 5 rows of the DataFrame:")
print(df.head())

Successfully loaded

First 5 rows of the DataFrame:
   Customer ID  Gender  Age           City Membership Type  Total Spend  \
0          101  Female   29       New York            Gold      1120.20   
1          102    Male   34    Los Angeles          Silver       780.50   
2          103  Female   43        Chicago          Bronze       510.75   
3          104    Male   30  San Francisco            Gold      1480.30   
4          105    Male   27          Miami          Silver       720.40   

   Items Purchased  Average Rating  Discount Applied  \
0               14             4.6              True   
1               11             4.1             False   
2                9             3.4              True   
3               19             4.7             False   
4               13             4.0              True   

   Days Since Last Purchase Satisfaction Level  
0                        25          Satisfied  
1                        18            Neutral  
2            

In [5]:
# 1. Data Cleaning
# Drop rows with missing 'Satisfaction Level'

print(f"Original Row Count: {len(df)}")
df_temp = df.dropna(subset=['Satisfaction Level']).copy()

print(f"Final Row Count (after dropping 2 NaNs): {len(df_temp)}")
print("\nFinal Data Info:")


Original Row Count: 350
Final Row Count (after dropping 2 NaNs): 348

Final Data Info:


In [7]:
# Identify the feature to drop
features_to_drop = ['Customer ID']

# Drop the non-useful feature
df_dropped = df_temp.drop(columns=features_to_drop)

In [9]:
# Display the first 5 rows of the new DataFrame to confirm the drop
print(f"Features dropped: {'Customer ID'}")
print("\nFirst 5 rows of the DataFrame after dropping 'Customer ID':")
print(df_dropped.head().to_markdown(index=False, numalign="left", stralign="left"))

print("\nShape of the DataFrame before dropping:", df.shape)
print("Shape of the DataFrame after dropping:", df_dropped.shape)

Features dropped: Customer ID

First 5 rows of the DataFrame after dropping 'Customer ID':
| Gender   | Age   | City          | Membership Type   | Total Spend   | Items Purchased   | Average Rating   | Discount Applied   | Days Since Last Purchase   | Satisfaction Level   |
|:---------|:------|:--------------|:------------------|:--------------|:------------------|:-----------------|:-------------------|:---------------------------|:---------------------|
| Female   | 29    | New York      | Gold              | 1120.2        | 14                | 4.6              | True               | 25                         | Satisfied            |
| Male     | 34    | Los Angeles   | Silver            | 780.5         | 11                | 4.1              | False              | 18                         | Neutral              |
| Female   | 43    | Chicago       | Bronze            | 510.75        | 9                 | 3.4              | True               | 42                         | Unsatis

In [11]:
# 2. Convert 'Discount Applied' from boolean to integer (0 or 1) for modeling
df_dropped['Discount Applied'] = df_dropped['Discount Applied'].astype(int)
print(df_dropped.info())
print("\nFirst 5 rows of the Final Cleaned DataFrame:")
print(df_dropped.head().to_markdown(index=False, numalign="left", stralign="left"))

<class 'pandas.core.frame.DataFrame'>
Index: 348 entries, 0 to 349
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Gender                    348 non-null    object 
 1   Age                       348 non-null    int64  
 2   City                      348 non-null    object 
 3   Membership Type           348 non-null    object 
 4   Total Spend               348 non-null    float64
 5   Items Purchased           348 non-null    int64  
 6   Average Rating            348 non-null    float64
 7   Discount Applied          348 non-null    int32  
 8   Days Since Last Purchase  348 non-null    int64  
 9   Satisfaction Level        348 non-null    object 
dtypes: float64(2), int32(1), int64(3), object(4)
memory usage: 28.5+ KB
None

First 5 rows of the Final Cleaned DataFrame:
| Gender   | Age   | City          | Membership Type   | Total Spend   | Items Purchased   | Average Rating   | Disc

In [13]:
# --- Feature Transformation (Encoding) ---

# 5. One-Hot Encoding for nominal features: 'Gender' and 'City'
# 'drop_first=True' is used to avoid multicollinearity.
df_F = pd.get_dummies(df_dropped, columns=['Gender', 'City'], drop_first=True)

# 6. Ordinal Encoding for 'Membership Type' (assuming a logical order: Bronze < Silver < Gold)
membership_order = {'Bronze': 1, 'Silver': 2, 'Gold': 3}
df_F['Membership Type Encoded'] = df_F['Membership Type'].map(membership_order)
df_F = df_F.drop(columns=['Membership Type'])

# 7. Label Encoding for the target variable 'Satisfaction Level'
# (Assigning levels: 0=Unsatisfied, 1=Neutral, 2=Satisfied)
satisfaction_order = {'Unsatisfied': 0, 'Neutral': 1, 'Satisfied': 2}
df_F['Satisfaction Level Encoded'] = df_F['Satisfaction Level'].map(satisfaction_order)
df_F = df_F.drop(columns=['Satisfaction Level'])

In [15]:
# Convert 'Discount Applied' from boolean (True/False) to integer (1/0)
df_F['Discount Applied'] = df_F['Discount Applied'].astype(int)

# 3. Engineer New Useful Features (RFM-derived metrics)
# Average Item Price: Captures the customer's value segment (how expensive their items are)
df_F['Average_Item_Price'] = df_F['Total Spend'] / (df_F['Items Purchased'])

# Engagement Score: Measures frequency/volume relative to recency
df_F['Engagement_Score'] = df_F['Items Purchased'] / (df_F['Days Since Last Purchase'] )

# Recency Value Ratio: Measures monetary value relative to recency (high-value, active customer)
df_F['Recency_Value_Ratio'] = df_F['Total Spend'] / (df_F['Days Since Last Purchase'] )

In [17]:
df_F.head(5)

Unnamed: 0,Age,Total Spend,Items Purchased,Average Rating,Discount Applied,Days Since Last Purchase,Gender_Male,City_Houston,City_Los Angeles,City_Miami,City_New York,City_San Francisco,Membership Type Encoded,Satisfaction Level Encoded,Average_Item_Price,Engagement_Score,Recency_Value_Ratio
0,29,1120.2,14,4.6,1,25,False,False,False,False,True,False,3,2,80.014286,0.56,44.808
1,34,780.5,11,4.1,0,18,True,False,True,False,False,False,2,1,70.954545,0.611111,43.361111
2,43,510.75,9,3.4,1,42,False,False,False,False,False,False,1,0,56.75,0.214286,12.160714
3,30,1480.3,19,4.7,0,12,True,False,False,False,False,True,3,2,77.910526,1.583333,123.358333
4,27,720.4,13,4.0,1,55,True,False,False,True,False,False,2,0,55.415385,0.236364,13.098182


In [27]:
# --- 1. Data Splitting (Train/Test) ---

# Define Features (X) and Target (y)
X = df_F.drop('Satisfaction Level Encoded', axis=1)
y = df_F['Satisfaction Level Encoded']

# Split the data (70% Train, 30% Test) and stratify to maintain class ratios
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"Data split into: Train ({len(X_train)} samples), Test ({len(X_test)} samples)")

# --- 2. Feature Scaling (Mandatory for Logistic Regression) ---

# Identify numerical columns for scaling
numerical_cols = [
    'Age', 'Total Spend', 'Items Purchased', 'Average Rating',
    'Days Since Last Purchase', 'Average_Item_Price',
    'Engagement_Score', 'Recency_Value_Ratio'
]

# Initialize StandardScaler
scaler = StandardScaler()

# Create copies for scaling
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

# Fit scaler only on the training data and transform both sets
X_train_scaled[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_test_scaled[numerical_cols] = scaler.transform(X_test[numerical_cols])

print("\nNumerical features scaled using StandardScaler.")

# --- 3. Model Training (Logistic Regression Baseline) ---

# Initialize the Logistic Regression Model
log_reg = LogisticRegression(
    solver='lbfgs',      # Default solver, good for small datasets
    multi_class='auto',  # Handles 3+ classes
    max_iter=1000,       # Increased iterations for robust convergence
    random_state=42
)

# Train the model on the scaled training data
log_reg.fit(X_train_scaled, y_train)

print("\nLogistic Regression model trained successfully.")

# --- 4. Model Evaluation ---

# Make predictions on the scaled test set
y_pred = log_reg.predict(X_test_scaled)

print(f"\n--- Evaluation Results (Logistic Regression Baseline) ---\n")

# 4.1. Accuracy Score
accuracy = accuracy_score(y_test, y_pred)
print(f"Overall Accuracy: {accuracy:.4f}\n")

# 4.2. Classification Report (F1-Score, Precision, Recall)
target_names = ['Unsatisfied (0)', 'Neutral (1)', 'Satisfied (2)']
report = classification_report(y_test, y_pred, target_names=target_names)
print("Classification Report:")
print(report)

# 4.3. Confusion Matrix (Visualizing Errors)
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix (True Rows vs Predicted Columns):")
print(pd.DataFrame(conf_matrix,
                   index=['True ' + name for name in target_names],
                   columns=['Pred ' + name for name in target_names]))

# 4.4. Feature Importance (from coefficients for the first class, Unsatisfied)
# Note: Coefficients are calculated for each class relative to the reference class (often Unsatisfied=0)
# Here, we only look at the coefficients for predicting class 1 (Neutral) vs class 0 (Unsatisfied) for simplicity.
coefficients_df = pd.DataFrame(log_reg.coef_[0], index=X_train_scaled.columns, columns=['Coeff (Class 1 vs 0)'])
coefficients_df['Absolute Value'] = coefficients_df['Coeff (Class 1 vs 0)'].abs()
coefficients_df = coefficients_df.sort_values(by='Absolute Value', ascending=False)

print("\nTop 5 Feature Importances (Absolute Coefficients for Class 1 vs Class 0):")
print(coefficients_df.head(5)[['Coeff (Class 1 vs 0)']].to_markdown(numalign="left", stralign="left"))

Data split into: Train (243 samples), Test (105 samples)

Numerical features scaled using StandardScaler.

Logistic Regression model trained successfully.

--- Evaluation Results (Logistic Regression Baseline) ---

Overall Accuracy: 1.0000

Classification Report:
                 precision    recall  f1-score   support

Unsatisfied (0)       1.00      1.00      1.00        35
    Neutral (1)       1.00      1.00      1.00        32
  Satisfied (2)       1.00      1.00      1.00        38

       accuracy                           1.00       105
      macro avg       1.00      1.00      1.00       105
   weighted avg       1.00      1.00      1.00       105


Confusion Matrix (True Rows vs Predicted Columns):
                      Pred Unsatisfied (0)  Pred Neutral (1)  \
True Unsatisfied (0)                    35                 0   
True Neutral (1)                         0                32   
True Satisfied (2)                       0                 0   

                      Pre

### Model Building and Evaluation Overview

The model-building phase used a Logistic Regression classifier to predict customer satisfaction levels (Unsatisfied, Neutral, Satisfied). The evaluation phase provided two key insights: an exceptionally high-performing model and a set of important features driving the classification.

### Model Performance

The Logistic Regression model achieved $\mathbf{100.00\%}$ accuracy on the test set. All other metrics (precision, recall, and F1-score) for all three classes were also perfect at $\mathbf{1.00}$. The confusion matrix showed zero misclassifications.

### Confusion Matrix Layout:


| True Rows \ Predicted Columns | Pred Unsatisfied (0) | Pred Neutral (1) | Pred Satisfied (2) |
|-------------------------------|----------------------|------------------|--------------------|
| **True Unsatisfied (0)** | 35                   | 0                | 0                  |
| **True Neutral (1)** | 0                    | 32               | 0                  |
| **True Satisfied (2)** | 0                    | 0                | 38                 |

### Conclusion on Performance: 

While this level of accuracy suggests a highly effective model, a $\mathbf{100\%}$ score in a real-world scenario is uncommon and strongly suggests a potential issue like data leakage, where a feature highly correlated or directly derived from the target variable was inadvertently included in the training data. This potential issue warrants further investigation to ensure the model's generalizability before deployment.

### Key Feature Insights

The feature importance, derived from the absolute value of the model's coefficients (comparing the Neutral class to the Unsatisfied class), identified the following top drivers:

### Feature Importance: Neutral (1) vs. Unsatisfied (0)

| Feature | Coeff (Class 1 vs 0) | Interpretation (Neutral vs. Unsatisfied) |
|:---|:---:|:---|
| **Days Since Last Purchase** | +1.469 | Most influential. A higher number of days since the last purchase is strongly associated with a **Neutral** satisfaction level over Unsatisfied. |
| **Discount Applied** | +1.102 | Receiving a discount is positively associated with being **Neutral** over Unsatisfied. |
| **City_Houston** | -0.764 | Customers from **Houston** are more likely to be **Unsatisfied** (negative coefficient) compared to Neutral. |
| **City_Miami** | +0.741 | Customers from **Miami** are more likely to be **Neutral** (positive coefficient) compared to Unsatisfied. |
| **Average_Item_Price** | -0.564 | A lower average item price is associated with being **Unsatisfied** (negative coefficient) compared to Neutral. |

### Conclusion on Features:

Days Since Last Purchase and Discount Applied are the two most critical variables in distinguishing between Unsatisfied and Neutral customers. The significant presence of City as an important feature suggests that geographic location may play a non-trivial role in determining customer satisfaction, a finding that could inform targeted, localized marketing or service improvements.