# E-Commerce Cart Abandonment Prediction

- Name: **Fahrettin Ege Bilge**
- ID: **21070001052**
- Instructor: **Assoc. Prof. Dr. Ömer ÇETİN**

## Table Of Contents

1. [Introduction](#Introduction)
2. [Preprocessing](#Preprocessing)
    - [Overview of Preprocessing Steps](#Overview-of-Preprocessing-Steps)
3. [Split Dataset](#Split-Dataset)
4. [Machine Learning Algorithm Selection](#Machine-Learning-Algorithm-Selection)
    - [1. Logistic Regression: A baseline algorithm for binary classification.](#1.-Logistic-Regression:-A-baseline-algorithm-for-binary-classification.)
    - [2.K-Nearest Neighbors (KNN): A distance-based classification model.](#2.K-Nearest-Neighbors-(KNN):-A-distance-based-classification-model.)
    - [3. Support Vector Machines (SVM): Clear decision boundary for classification tasks.](#3.-Support-Vector-Machines-(SVM):-Clear-decision-boundary-for-classification-tasks.)
    - [4. Naive Bayes: Complements one-hot encoding.](#4.-Naive-Bayes:-Complements-one-hot-encoding.)
5. [Evaluation Metrics](#Evaluation-Metrics)
    - [Accuracy](#Accuracy)
    - [Precision](#Precision)
    - [Recall](#Recall)
    - [F1-Score](#F1-Score)
    - [Confusion Matrix](#Confusion-Matrix)
6. [Results](#Results)
7. [Conclusion](#Conclusion)
8. [Appendices](#Appendices)
9. [References](#References)


## Introduction
Predicting cart abandonment is crucial for e-commerce platforms to reduce lost revenue and improve user experience. This project uses supervised learning techniques to predict the likelihood of users abandoning their carts based on features like cart contents, payment methods, and purchase history.

## Preprocessing
### Overview of Preprocessing Steps
1. Filtering Relevant Rows:
- Rows with status values other than canceled and complete were removed.
- Justification: canceled maps to abandoned = 1, while complete maps to abandoned = 0. Other statuses do not provide relevant information for this task.
2. Handling Categorical Features:
- Categorical variables (category_name_1 and payment_method) were one-hot encoded.
- Justification: One-hot encoding ensures that these variables are represented in a format suitable for machine learning models without assuming any ordinal relationship.
3. Handling Numerical Features:
- Numerical features (price, grand_total, discount_amount, total_purchases, and total_orders) were scaled using MinMaxScaler.
- Justification: Scaling ensures that all features are normalized, preventing features with large magnitudes from dominating the model.
4. Outlier Handling:
- Numerical columns were clipped at the 95th percentile to mitigate the effect of outliers.
- Justification: Outliers can disproportionately influence certain machine learning models like Logistic Regression or KNN.
5. Tracking Customer History:
- Aggregated total_purchases (sum of grand_total for each customer) and total_orders (number of orders per customer) were added as features.
- Justification: These features provide insights into customer behavior and engagement, which are critical for predicting cart abandonment.
6. Balancing the Dataset:
- Undersampling was used to balance the dataset by ensuring equal representation of abandoned (1) and not abandoned (0) classes.
- Justification: An imbalanced dataset can bias the model towards the majority class.

In [1]:
import helper as hlp
kaggle_dataset = 'data/kaggle_dataset/Pakistan_Largest_Ecommerce_Dataset.csv'
preprocessed_dataset = 'data/preprocessed_dataset/preprocessed_dataset.csv'
hlp.preprocess_dataset(kaggle_dataset, preprocessed_dataset)


  df = pd.read_csv(input_file)


Preprocessed dataset saved to data/preprocessed_dataset/preprocessed_dataset.csv


## Split Dataset
To train and evaluate machine learning models effectively, the dataset is split into training and testing subsets. This ensures that the model is trained on one portion of the data and evaluated on unseen data to measure its performance. 

#### Steps:
1. **Train-Test Split**: 
   - The dataset is split into 80% training data and 20% testing data.
2. **Stratification**: 
   - Stratified splitting ensures that the class distribution (abandoned vs. not abandoned) is preserved in both the training and testing datasets.
3. **Random State**: 
   - Setting a `random_state` ensures reproducibility of results.

In [5]:
from sklearn.model_selection import train_test_split

# Define features (X) and target variable (y)
X = df.drop(columns=['abandoned'])  # Features
y = df['abandoned']                # Target variable

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,          # 20% of the data for testing
    stratify=y,             # Preserve class distribution
    random_state=42         # Ensure reproducibility
)

# Display shapes and class distribution
print("Dataset Split Information:")
print(f"Training data shape: {X_train.shape} (Features), {y_train.shape} (Target)")
print(f"Testing data shape: {X_test.shape} (Features), {y_test.shape} (Target)")

# Display class distribution in training and testing sets
train_class_distribution = y_train.value_counts(normalize=True).to_dict()
test_class_distribution = y_test.value_counts(normalize=True).to_dict()

print("\nClass Distribution in Training Data:")
for class_label, proportion in train_class_distribution.items():
    print(f"  Class {class_label}: {proportion:.2%}")

print("\nClass Distribution in Testing Data:")
for class_label, proportion in test_class_distribution.items():
    print(f"  Class {class_label}: {proportion:.2%}")


Dataset Split Information:
Training data shape: (321998, 43) (Features), (321998,) (Target)
Testing data shape: (80500, 43) (Features), (80500,) (Target)

Class Distribution in Training Data:
  Class 1.0: 50.00%
  Class 0.0: 50.00%

Class Distribution in Testing Data:
  Class 0.0: 50.00%
  Class 1.0: 50.00%


## Machine Learning Algorithm Selection
Justification for Algorithm Choices

1. Logistic Regression:
- Chosen for its simplicity, interpretability, and efficiency on linearly separable data.
- It also provides probabilities for predictions, making it suitable for understanding the likelihood of cart abandonment.
2. K-Nearest Neighbors (KNN):
- A non-parametric algorithm that uses similarity measures to make predictions.
- Effective for capturing local patterns and relationships in the data.
3. Support Vector Machines (SVM):
- Robust to high-dimensional spaces and outliers, making it a good choice for scaled numerical features.
- Provides a clear decision boundary for classification tasks.
4. Naive Bayes:
- Efficient and works well for categorical data due to its assumption of feature independence.
- It complements the one-hot encoded categorical features in our dataset.

### 1. Logistic Regression: A baseline algorithm for binary classification.

### 2.K-Nearest Neighbors (KNN): A distance-based classification model.

### 3. Support Vector Machines (SVM): Clear decision boundary for classification tasks.

### 4. Naive Bayes: Complements one-hot encoding.

### Evaluation Metrics

- **Accuracy**: Overall correctness of predictions.
- **Precision**: Ratio of true positive predictions to all positive predictions.
- **Recall**: Ratio of true positives to all actual positives.
- **F1 Score**: Harmonic mean of precision and recall.
- **Confusion Matrix**: Visualization of classification performance.

#### Accuracy
$$
\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}
$$

#### Precision
$$
\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}
$$

#### Recall
$$
\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}
$$

#### F1-Score
$$
F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
$$

#### Confusion Matrix
$$
\begin{bmatrix}
\text{TP} & \text{FP} \\
\text{FN} & \text{TN}
\end{bmatrix}
$$


## Results
- Comparison of model performance on test data.
- Discussion of strengths and weaknesses of each algorithm.

## Conclusion
Summarize findings, highlight key insights, and suggest potential improvements for future work.

## Appendices
- Sample code snippets.
## References
- Dataset: https://www.kaggle.com/datasets/zusmani/pakistans-largest-ecommerce-dataset/data