# E-Commerce Cart Abandonment Prediction

- Name: **Fahrettin Ege Bilge**
- ID: **21070001052**
- Instructor: **Assoc. Prof. Dr. Ömer ÇETİN**

## Table Of Contents

1. [Introduction](#Introduction)
2. [Preprocessing](#Preprocessing)
    - [Overview of Preprocessing Steps](#Overview-of-Preprocessing-Steps)
3. [Split Dataset](#Split-Dataset)
4. [Machine Learning Algorithm Selection](#Machine-Learning-Algorithm-Selection)
    - [1. Logistic Regression: A baseline algorithm for binary classification.](#1.-Logistic-Regression:-A-baseline-algorithm-for-binary-classification.)
    - [2.K-Nearest Neighbors (KNN): A distance-based classification model.](#2.K-Nearest-Neighbors-(KNN):-A-distance-based-classification-model.)
    - [3. Support Vector Machines (SVM): Clear decision boundary for classification tasks.](#3.-Support-Vector-Machines-(SVM):-Clear-decision-boundary-for-classification-tasks.)
    - [4. Naive Bayes: Complements one-hot encoding.](#4.-Naive-Bayes:-Complements-one-hot-encoding.)
5. [Evaluation Metrics](#Evaluation-Metrics)
    - [Accuracy](#Accuracy)
    - [Precision](#Precision)
    - [Recall](#Recall)
    - [F1-Score](#F1-Score)
    - [Confusion Matrix](#Confusion-Matrix)
6. [Results](#Results)
7. [Conclusion](#Conclusion)
8. [Appendices](#Appendices)
9. [References](#References)


## Introduction
Predicting cart abandonment is crucial for e-commerce platforms to reduce lost revenue and improve user experience. This project uses supervised learning techniques to predict the likelihood of users abandoning their carts based on features like cart contents, payment methods, and purchase history.

## Preprocessing
### Overview of Preprocessing Steps
1. Filtering Relevant Rows:
- Rows with status values other than canceled and complete were removed.
- Justification: canceled maps to abandoned = 1, while complete maps to abandoned = 0. Other statuses do not provide relevant information for this task.
2. Handling Categorical Features:
- Categorical variables (category_name_1 and payment_method) were one-hot encoded.
- Justification: One-hot encoding ensures that these variables are represented in a format suitable for machine learning models without assuming any ordinal relationship.
3. Handling Numerical Features:
- Numerical features (price, grand_total, discount_amount, total_purchases, and total_orders) were scaled using MinMaxScaler.
- Justification: Scaling ensures that all features are normalized, preventing features with large magnitudes from dominating the model.
4. Outlier Handling:
- Numerical columns were clipped at the 95th percentile to mitigate the effect of outliers.
- Justification: Outliers can disproportionately influence certain machine learning models like Logistic Regression or KNN.
5. Tracking Customer History:
- Aggregated total_purchases (sum of grand_total for each customer) and total_orders (number of orders per customer) were added as features.
- Justification: These features provide insights into customer behavior and engagement, which are critical for predicting cart abandonment.
6. Balancing the Dataset:
- Undersampling was used to balance the dataset by ensuring equal representation of abandoned (1) and not abandoned (0) classes.
- Justification: An imbalanced dataset can bias the model towards the majority class.

In [None]:
import helper as hlp
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
kaggle_dataset = 'data/kaggle_dataset/Pakistan_Largest_Ecommerce_Dataset.csv'
preprocessed_dataset = 'data/preprocessed_dataset/preprocessed_dataset.csv'


### Preprocess of Data

In [None]:
hlp.preprocess_dataset(kaggle_dataset, preprocessed_dataset)

### Visualization Of Preprocessed Data

In [None]:
hlp.visualize_preprocessed_data(preprocessed_dataset)

## Split Dataset
To train and evaluate machine learning models effectively, the dataset is split into training and testing subsets. This ensures that the model is trained on one portion of the data and evaluated on unseen data to measure its performance. 

#### Steps:
1. **Train-Test Split**: 
   - The dataset is split into 80% training data and 20% testing data.
2. **Stratification**: 
   - Stratified splitting ensures that the class distribution (abandoned vs. not abandoned) is preserved in both the training and testing datasets.
3. **Random State**: 
   - Setting a `random_state` ensures reproducibility of results.

In [None]:
# Load the dataset
df = pd.read_csv(preprocessed_dataset) 

# Ensure 'abandoned' is in the dataframe (required for splitting)
if 'abandoned' not in df.columns:
    raise ValueError("'abandoned' column is missing. Ensure preprocessing includes this target variable.")

# Define features (X) and target variable (y)
X = df.drop(columns=['abandoned'])  # Features
y = df['abandoned']                # Target variable

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,          # 20% of the data for testing
    stratify=y,             # Preserve class distribution
    random_state=42         # Ensure reproducibility
)

# Display shapes and class distribution
print("Dataset Split Information:")
print(f"Training data shape: {X_train.shape} (Features), {y_train.shape} (Target)")
print(f"Testing data shape: {X_test.shape} (Features), {y_test.shape} (Target)")

# Display class distribution in training and testing sets
train_class_distribution = y_train.value_counts(normalize=True).to_dict()
test_class_distribution = y_test.value_counts(normalize=True).to_dict()

print("\nClass Distribution in Training Data:")
for class_label, proportion in train_class_distribution.items():
    print(f"  Class {class_label}: {proportion:.2%}")

print("\nClass Distribution in Testing Data:")
for class_label, proportion in test_class_distribution.items():
    print(f"  Class {class_label}: {proportion:.2%}")


## Machine Learning Algorithm Selection
Justification for Algorithm Choices

1. Logistic Regression:
- Chosen for its simplicity, interpretability, and efficiency on linearly separable data.
- It also provides probabilities for predictions, making it suitable for understanding the likelihood of cart abandonment.
2. K-Nearest Neighbors (KNN):
- A non-parametric algorithm that uses similarity measures to make predictions.
- Effective for capturing local patterns and relationships in the data.
3. Support Vector Machines (SVM):
- Robust to high-dimensional spaces and outliers, making it a good choice for scaled numerical features.
- Provides a clear decision boundary for classification tasks.
4. Naive Bayes:
- Efficient and works well for categorical data due to its assumption of feature independence.
- It complements the one-hot encoded categorical features in our dataset.

In [None]:
# Initialize models
logistic_regression = LogisticRegression(random_state=42)
knn = KNeighborsClassifier(n_neighbors=5)
svm = LinearSVC(random_state=42, max_iter=100000,verbose=True)
naive_bayes = GaussianNB()
# Add ANN (MLPClassifier) to the pipeline
ann = MLPClassifier(
    hidden_layer_sizes=(128, 64),  # Two hidden layers with 128 and 64 neurons
    activation='relu',            # ReLU activation function
    solver='adam',                # Adam optimizer
    max_iter=200,                 # Maximum number of iterations
    random_state=42               # Reproducibility
)
random_forest = RandomForestClassifier(n_estimators=100, max_depth=None,random_state=42)

# Store results
results = []

### 1. Logistic Regression: A baseline algorithm for binary classification.

In [None]:
results.append(hlp.evaluate_model(logistic_regression, X_train, y_train, X_test, y_test, "Logistic Regression"))

### 2.K-Nearest Neighbors (KNN): A distance-based classification model.

In [None]:
results.append(hlp.evaluate_model(knn, X_train, y_train, X_test, y_test, "K-Nearest Neighbors"))

### 3. Support Vector Machines (SVM): Clear decision boundary for classification tasks.

In [None]:
results.append(hlp.evaluate_model(svm, X_train, y_train, X_test, y_test, "Support Vector Machines"))

### 4. Naive Bayes: Complements one-hot encoding.

In [None]:
results.append(hlp.evaluate_model(naive_bayes, X_train, y_train, X_test, y_test, "Naive Bayes"))

### 5.ANN: A costly solution.

In [None]:
results.append(hlp.evaluate_model(ann, X_train, y_train, X_test, y_test, "Artificial Neural Network (ANN)"))

### 6. Random Forest

In [None]:
results.append(hlp.evaluate_model(random_forest, X_train, y_train, X_test, y_test, "Random Forest"))

### Evaluation Metrics

- **Accuracy**: Overall correctness of predictions.
- **Precision**: Ratio of true positive predictions to all positive predictions.
- **Recall**: Ratio of true positives to all actual positives.
- **F1 Score**: Harmonic mean of precision and recall.
- **Confusion Matrix**: Visualization of classification performance.

#### Accuracy
$$
\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}
$$

#### Precision
$$
\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}
$$

#### Recall
$$
\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}
$$

#### F1-Score
$$
F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
$$

#### Confusion Matrix
$$
\begin{bmatrix}
\text{TP} & \text{FP} \\
\text{FN} & \text{TN}
\end{bmatrix}
$$


In [None]:
# Create a DataFrame for results
results_df = pd.DataFrame(results)
results_df

## Results
- Comparison of model performance on test data.
- Discussion of strengths and weaknesses of each algorithm.

In [None]:
# Extract metrics from results
performance_metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
filtered_metrics_df = results_df[['Model'] + performance_metrics]

# Plot performance metrics
filtered_metrics_df.set_index('Model').plot(kind='bar', figsize=(12, 6), alpha=0.8)
plt.title("Model Performance Metrics")
plt.ylabel("Score")
plt.xlabel("Model")
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.legend(loc='lower right', title="Metrics")
plt.show()


In [None]:
# Plot confusion matrices side by side
fig, axes = plt.subplots(1, len(results), figsize=(20, 6), constrained_layout=True)

for i, result in enumerate(results):
    cm = result['Confusion Matrix']
    model_name = result['Model']
    
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False, 
                xticklabels=['Not Abandoned', 'Abandoned'], 
                yticklabels=['Not Abandoned', 'Abandoned'], ax=axes[i])
    
    axes[i].set_title(f"{model_name}\nConfusion Matrix")
    axes[i].set_ylabel('True Labels' if i == 0 else '')  # Only the first plot has the y-axis label
    axes[i].set_xlabel('Predicted Labels')

plt.suptitle("Confusion Matrices of All Models", fontsize=16)
plt.show()


In [None]:
# Prepare data for grouped bar chart
metrics_melted = pd.melt(
    results_df, id_vars='Model', 
    value_vars=['Accuracy', 'Precision', 'Recall', 'F1-Score'],
    var_name='Metric', value_name='Score'
)

# Plot grouped bar chart
plt.figure(figsize=(12, 6))
sns.barplot(data=metrics_melted, x='Metric', y='Score', hue='Model', palette='Set2')
plt.title("Performance Metrics by Model")
plt.ylabel("Score")
plt.xlabel("Metric")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.legend(loc='lower right')
plt.show()


In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5), constrained_layout=True)

# Training Time
results_df.plot(kind='bar', x='Model', y='Training Time (s)', ax=axes[0], color='skyblue', legend=False, logy=True)
axes[0].set_title('Training Time (Log Scale)')
axes[0].set_ylabel('Time (s, log scale)')
axes[0].set_xlabel('Model')
axes[0].grid(axis='y', linestyle='--', alpha=0.7)

# Evaluation Time
results_df.plot(kind='bar', x='Model', y='Evaluation Time (s)', ax=axes[1], color='lightgreen', legend=False, logy=True)
axes[1].set_title('Evaluation Time (Log Scale)')
axes[1].set_ylabel('Time (s, log scale)')
axes[1].set_xlabel('Model')
axes[1].grid(axis='y', linestyle='--', alpha=0.7)

# Memory Usage
results_df.plot(kind='bar', x='Model', y='Memory Used (MB)', ax=axes[2], color='salmon', legend=False, logy=True)
axes[2].set_title('Memory Usage (Log Scale)')
axes[2].set_ylabel('Memory (MB, log scale)')
axes[2].set_xlabel('Model')
axes[2].grid(axis='y', linestyle='--', alpha=0.7)

plt.suptitle("Model Training and Evaluation Performance (Log Scale)", fontsize=16)
plt.show()


## Conclusion
Summarize findings, highlight key insights, and suggest potential improvements for future work.

## Appendices
- Sample code snippets.
## References
- Dataset: https://www.kaggle.com/datasets/zusmani/pakistans-largest-ecommerce-dataset/data