# To start the project on churn prediction for the E-Commerce dataset, we'll follow these steps:

1. **Exploratory Data Analysis (EDA)**: Analyze the dataset to understand the features, their characteristics, and the data distribution.
2. **Data Cleaning and Preprocessing**: Clean the data and prepare it for modeling. This includes handling missing values, encoding categorical variables, etc.
3. **Feature Selection/Engineering**: Select or construct features that are relevant to the prediction task.
4. **Model Selection**: Choose a suitable machine learning model. For starters, models like Logistic Regression, Random Forest, or Gradient Boosting could be good choices.
5. **Model Training**: Train the model on the prepared dataset.
6. **Evaluation**: Evaluate the model performance using appropriate metrics like accuracy, precision, recall, F1-score, and AUC-ROC curve.
7. **Tuning and Optimization**: Fine-tune the model parameters to improve performance.
8. **Deployment Strategy**: Outline a deployment strategy for making the model production-ready, focusing on scalability, maintainability, and monitoring.

Firstly, let's start with loading and exploring the dataset to understand its structure and the type of data we are working

In [1]:
import pandas as pd

# Load the E-Commerce Dataset
ecommerce_data = pd.read_excel('/home/kutty/Desktop/github project/churnpediction/data/raw/E Commerce Dataset.xlsx')

# Display the first few rows of the dataset to understand its structure
ecommerce_data

Unnamed: 0,CustomerID,Churn,Tenure,PreferredLoginDevice,CityTier,WarehouseToHome,PreferredPaymentMode,Gender,HourSpendOnApp,NumberOfDeviceRegistered,PreferedOrderCat,SatisfactionScore,MaritalStatus,NumberOfAddress,Complain,OrderAmountHikeFromlastYear,CouponUsed,OrderCount,DaySinceLastOrder,CashbackAmount
0,50001,1,4.0,Mobile Phone,3,6.0,Debit Card,Female,3.0,3,Laptop & Accessory,2,Single,9,1,11.0,1.0,1.0,5.0,159.93
1,50002,1,,Phone,1,8.0,UPI,Male,3.0,4,Mobile,3,Single,7,1,15.0,0.0,1.0,0.0,120.90
2,50003,1,,Phone,1,30.0,Debit Card,Male,2.0,4,Mobile,3,Single,6,1,14.0,0.0,1.0,3.0,120.28
3,50004,1,0.0,Phone,3,15.0,Debit Card,Male,2.0,4,Laptop & Accessory,5,Single,8,0,23.0,0.0,1.0,3.0,134.07
4,50005,1,0.0,Phone,1,12.0,CC,Male,,3,Mobile,5,Single,3,0,11.0,1.0,1.0,3.0,129.60
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5625,55626,0,10.0,Computer,1,30.0,Credit Card,Male,3.0,2,Laptop & Accessory,1,Married,6,0,18.0,1.0,2.0,4.0,150.71
5626,55627,0,13.0,Mobile Phone,1,13.0,Credit Card,Male,3.0,5,Fashion,5,Married,6,0,16.0,1.0,2.0,,224.91
5627,55628,0,1.0,Mobile Phone,1,11.0,Debit Card,Male,3.0,2,Laptop & Accessory,4,Married,3,1,21.0,1.0,2.0,4.0,186.42
5628,55629,0,23.0,Computer,3,9.0,Credit Card,Male,4.0,5,Laptop & Accessory,4,Married,4,0,15.0,2.0,2.0,9.0,178.90


The dataset comprises several features related to customer behavior and attributes in an E-Commerce setting. Some of the key columns observed are:

- **CustomerID:** Unique identifier for each customer.
- **Churn:** Indicates whether the customer has churned (1) or not (0).
- **Tenure:** The duration (in some time unit) for which the customer has been with the E-Commerce platform.
- **PreferredLoginDevice:** The device preferred by the customer for logging in.
- **CityTier:** The tier of the city the customer resides in.
- **...** Other features include metrics like `WarehouseToHome`, `HourSpendOnApp`, `NumberOfDeviceRegistered`, `SatisfactionScore`, `NumberOfAddress`, among others.

Given that our target variable is `Churn`, this project will focus on predicting customer churn based on these features.

Next, we'll proceed to the Exploratory Data Analysis (EDA) to gain insights into the dataset, including checking for missing values and understanding the distribution of key variables.

In [2]:
# Summary statistics for numerical features
descriptive_stats = ecommerce_data.describe()

# Count of unique values for categorical features
categorical_unique_counts = ecommerce_data.select_dtypes(include=['object', 'category']).nunique()

# Checking for missing values
missing_values = ecommerce_data.isnull().sum()

(descriptive_stats, categorical_unique_counts, missing_values)

(         CustomerID        Churn       Tenure     CityTier  WarehouseToHome  \
 count   5630.000000  5630.000000  5366.000000  5630.000000      5379.000000   
 mean   52815.500000     0.168384    10.189899     1.654707        15.639896   
 std     1625.385339     0.374240     8.557241     0.915389         8.531475   
 min    50001.000000     0.000000     0.000000     1.000000         5.000000   
 25%    51408.250000     0.000000     2.000000     1.000000         9.000000   
 50%    52815.500000     0.000000     9.000000     1.000000        14.000000   
 75%    54222.750000     0.000000    16.000000     3.000000        20.000000   
 max    55630.000000     1.000000    61.000000     3.000000       127.000000   
 
        HourSpendOnApp  NumberOfDeviceRegistered  SatisfactionScore  \
 count     5375.000000               5630.000000        5630.000000   
 mean         2.931535                  3.688988           3.066785   
 std          0.721926                  1.023999           1.3801

### Exploratory Data Analysis (EDA) Summary:

**Summary Statistics:**
- The dataset contains 5630 entries.
- `CustomerID` acts as a unique identifier.
- `Churn` is our target variable, with approx. 17% of the customers having churned.
- `Tenure` ranges from 0 to 61, with an average tenure of about 10 units.
- Other features such as `CityTier`, `WarehouseToHome`, `HourSpendOnApp`, etc., display varying ranges and distributions.

**Unique Counts for Categorical Features:**
- Categorical variables like `PreferredLoginDevice`, `PreferredPaymentMode`, `Gender`, `PreferedOrderCat`, and `MaritalStatus` have a limited number of unique values, indicating they are mostly nominal categories.

**Missing Values:**
- Several features have missing values, notably `Tenure`, `WarehouseToHome`, `HourSpendOnApp`, `OrderAmountHikeFromlastYear`, `CouponUsed`, `OrderCount`, and `DaySinceLastOrder`.
- The missing values need to be addressed before modeling, as they can impact the performance of machine learning algorithms.

 Let's start with handling missing values. Depending on the nature of each feature, we will choose suitable methods such as imputation or dropping rows/columns with missing

In [3]:
from sklearn.impute import SimpleImputer

# Numerical imputer
num_imputer = SimpleImputer(strategy='mean')
ecommerce_data[['Tenure', 'WarehouseToHome', 'HourSpendOnApp', 'OrderAmountHikeFromlastYear', 'CouponUsed', 'OrderCount', 'DaySinceLastOrder']] = num_imputer.fit_transform(ecommerce_data[['Tenure', 'WarehouseToHome', 'HourSpendOnApp', 'OrderAmountHikeFromlastYear', 'CouponUsed', 'OrderCount', 'DaySinceLastOrder']])

# Check if any missing values remain
ecommerce_data.isnull().sum()

CustomerID                     0
Churn                          0
Tenure                         0
PreferredLoginDevice           0
CityTier                       0
WarehouseToHome                0
PreferredPaymentMode           0
Gender                         0
HourSpendOnApp                 0
NumberOfDeviceRegistered       0
PreferedOrderCat               0
SatisfactionScore              0
MaritalStatus                  0
NumberOfAddress                0
Complain                       0
OrderAmountHikeFromlastYear    0
CouponUsed                     0
OrderCount                     0
DaySinceLastOrder              0
CashbackAmount                 0
dtype: int64

All missing values have been successfully imputed using the mean strategy for numerical columns. There are no missing values remaining in the dataset.

Next, we will proceed with encoding categorical variables. Considering that machine learning models require numerical input, we will transform categorical variables into a format that can be provided to the models. This typically involves one-hot encoding for nominal variables and ordinal encoding for ordinal variables if any.

Let's encode the categorical variables in the dataset

In [4]:
# One-hot encoding for categorical variables
categorical_features = ecommerce_data.select_dtypes(include=['object', 'category']).columns
ecommerce_data_encoded = pd.get_dummies(ecommerce_data, columns=categorical_features, drop_first=True)

# Display the encoded dataset structure
ecommerce_data_encoded

Unnamed: 0,CustomerID,Churn,Tenure,CityTier,WarehouseToHome,HourSpendOnApp,NumberOfDeviceRegistered,SatisfactionScore,NumberOfAddress,Complain,...,PreferredPaymentMode_E wallet,PreferredPaymentMode_UPI,Gender_Male,PreferedOrderCat_Grocery,PreferedOrderCat_Laptop & Accessory,PreferedOrderCat_Mobile,PreferedOrderCat_Mobile Phone,PreferedOrderCat_Others,MaritalStatus_Married,MaritalStatus_Single
0,50001,1,4.000000,3,6.0,3.000000,3,2,9,1,...,False,False,False,False,True,False,False,False,False,True
1,50002,1,10.189899,1,8.0,3.000000,4,3,7,1,...,False,True,True,False,False,True,False,False,False,True
2,50003,1,10.189899,1,30.0,2.000000,4,3,6,1,...,False,False,True,False,False,True,False,False,False,True
3,50004,1,0.000000,3,15.0,2.000000,4,5,8,0,...,False,False,True,False,True,False,False,False,False,True
4,50005,1,0.000000,1,12.0,2.931535,3,5,3,0,...,False,False,True,False,False,True,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5625,55626,0,10.000000,1,30.0,3.000000,2,1,6,0,...,False,False,True,False,True,False,False,False,True,False
5626,55627,0,13.000000,1,13.0,3.000000,5,5,6,0,...,False,False,True,False,False,False,False,False,True,False
5627,55628,0,1.000000,1,11.0,3.000000,2,4,3,1,...,False,False,True,False,True,False,False,False,True,False
5628,55629,0,23.000000,3,9.0,4.000000,5,4,4,0,...,False,False,True,False,True,False,False,False,True,False


The categorical variables in the dataset have been successfully encoded using one-hot encoding, which resulted in the expansion of the dataset to 31 columns due to the addition of dummy variables for these categorical features. 

With the dataset now preprocessed—missing values handled and categorical variables encoded—next, we can proceed with feature selection or engineering if required, and then model selection for predicting customer

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Dropping the 'CustomerID' column as it's not a useful feature for prediction
ecommerce_data_encoded.drop(['CustomerID'], axis=1, inplace=True)

# Splitting the dataset into features (X) and target (y)
X = ecommerce_data_encoded.drop('Churn', axis=1)
y = ecommerce_data_encoded['Churn']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initializing the Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Training the model
rf.fit(X_train, y_train)

# Predictions
predictions = rf.predict(X_test)

# Evaluation
accuracy_score = accuracy_score(y_test, predictions)
classification_report = classification_report(y_test, predictions)

print("Accuracy Score:", accuracy_score)
print("Classification Report:\n", classification_report)

Accuracy Score: 0.9544108940201302
Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.99      0.97      1414
           1       0.92      0.79      0.85       275

    accuracy                           0.95      1689
   macro avg       0.94      0.89      0.91      1689
weighted avg       0.95      0.95      0.95      1689



The Random Forest Classifier was trained on the preprocessed E-Commerce dataset and achieved an accuracy of approximately 95.44% on the test set. The model also shows good precision, recall, and F1-score across both classes (churned and not churned). Specifically, precision is 0.92 for the churned class, indicating a strong ability to identify churned customers accurately. The recall for the churned class is 0.79, suggesting that the model is capable of capturing a significant proportion of actual churn cases.

This high level of performance suggests that the model could be a valuable tool for predicting customer churn in an E-Commerce setting, potentially enabling targeted interventions to retain customers at risk of churn. 

In [6]:
ecommerce_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5630 entries, 0 to 5629
Data columns (total 20 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   CustomerID                   5630 non-null   int64  
 1   Churn                        5630 non-null   int64  
 2   Tenure                       5630 non-null   float64
 3   PreferredLoginDevice         5630 non-null   object 
 4   CityTier                     5630 non-null   int64  
 5   WarehouseToHome              5630 non-null   float64
 6   PreferredPaymentMode         5630 non-null   object 
 7   Gender                       5630 non-null   object 
 8   HourSpendOnApp               5630 non-null   float64
 9   NumberOfDeviceRegistered     5630 non-null   int64  
 10  PreferedOrderCat             5630 non-null   object 
 11  SatisfactionScore            5630 non-null   int64  
 12  MaritalStatus                5630 non-null   object 
 13  NumberOfAddress   

In [7]:
print('Training set:', X_train.shape, y_train.shape, '\nTesting set:', X_test.shape, y_test.shape)

Training set: (3941, 29) (3941,) 
Testing set: (1689, 29) (1689,)


In [8]:
import mlflow
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Start an MLflow run
mlflow.set_experiment('E-Commerce Churn Prediction')

<Experiment: artifact_location='file:///home/kutty/Desktop/github%20project/churnpediction/mlruns/184263391567962105', creation_time=1710970175528, experiment_id='184263391567962105', last_update_time=1710970175528, lifecycle_stage='active', name='E-Commerce Churn Prediction', tags={}>

In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

with mlflow.start_run(run_name='RandomForest Base Model'):
    # Defining the model
    clf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
    
    # Training the model
    clf.fit(X_train, y_train)
    
    # Predicting on test set
    y_pred = clf.predict(X_test)
    
    # Evaluating the model
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    # Logging parameters, metrics, and model
    mlflow.log_params({'n_estimators': 100, 'max_depth': 5})
    mlflow.log_metrics({'accuracy': accuracy, 'precision': precision, 'recall': recall, 'f1_score': f1})
    
    mlflow.sklearn.log_model(clf, 'model')
    
    print('Model Evaluation Metrics:\nAccuracy:', accuracy, '\nPrecision:', precision, '\nRecall:', recall, '\nF1 Score:', f1)
    

Model Evaluation Metrics:
Accuracy: 0.8685612788632326 
Precision: 0.8732394366197183 
Recall: 0.22545454545454546 
F1 Score: 0.35838150289017345


In [10]:
mlflow.search_runs()

Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.f1_score,metrics.recall,metrics.accuracy,metrics.precision,params.n_estimators,params.max_depth,tags.mlflow.runName,tags.mlflow.user,tags.mlflow.source.type,tags.mlflow.source.name,tags.mlflow.log-model.history
0,6354ea3bc2da4e0fa86e994cf73f2494,184263391567962105,FINISHED,file:///home/kutty/Desktop/github%20project/ch...,2024-03-21 09:15:41.101000+00:00,2024-03-21 09:15:58.389000+00:00,0.358382,0.225455,0.868561,0.873239,100,5,RandomForest Base Model,kutty,LOCAL,/home/kutty/miniconda3/envs/myenv/lib/python3....,"[{""run_id"": ""6354ea3bc2da4e0fa86e994cf73f2494""..."
1,d71fc7a20f524597aa7a1f22cd879ffe,184263391567962105,FINISHED,file:///home/kutty/Desktop/github%20project/ch...,2024-03-20 21:49:13.132000+00:00,2024-03-20 21:49:20.939000+00:00,0.358382,0.225455,0.868561,0.873239,100,5,RandomForest Base Model,kutty,LOCAL,/home/kutty/miniconda3/envs/myenv/lib/python3....,"[{""run_id"": ""d71fc7a20f524597aa7a1f22cd879ffe""..."
2,e2f82e8fe8a44f7abbe3e3c5827025c2,184263391567962105,FINISHED,file:///home/kutty/Desktop/github%20project/ch...,2024-03-20 21:30:06.687000+00:00,2024-03-20 21:30:16.527000+00:00,0.358382,0.225455,0.868561,0.873239,100,5,RandomForest Base Model,kutty,LOCAL,/home/kutty/miniconda3/envs/myenv/lib/python3....,"[{""run_id"": ""e2f82e8fe8a44f7abbe3e3c5827025c2""..."


In [2]:
import requests

# URL of the Flask app's endpoint expected to handle POST requests for predictions
url = 'http://127.0.0.1:5001/predict'

# Data to be sent to the Flask app for making a prediction
# These values are just placeholders and should be replaced with real input data for prediction
data = {
    'Tenure': [0],
    'CityTier': [3],
    'WarehouseToHome': [15],
    'HourSpendOnApp': [0],
    'NumberOfDeviceRegistered': [3],
    'SatisfactionScore': [2],
    'NumberOfAddress': [9],
    'Complain': [1],
    'OrderAmountHikeFromlastYear': [11],
    'CouponUsed': [1],
    'OrderCount': [1],
    'DaySinceLastOrder': [5],
    'CashbackAmount': [160],
    'PreferredLoginDevice_Mobile Phone': [1],
    'PreferredLoginDevice_Phone': [0],
    'PreferredPaymentMode_COD': [0],
    'PreferredPaymentMode_Cash on Delivery': [0],
    'PreferredPaymentMode_Credit Card': [0],
    'PreferredPaymentMode_Debit Card': [1],
    'PreferredPaymentMode_E wallet': [0],
    'PreferredPaymentMode_UPI': [1],
    'Gender_Male': [0],
    'PreferedOrderCat_Grocery': [0],
    'PreferedOrderCat_Laptop & Accessory': [1],
    'PreferedOrderCat_Mobile': [0],
    'PreferedOrderCat_Mobile Phone': [20],
    'PreferedOrderCat_Others': [60],
    'MaritalStatus_Married': [10],
    'MaritalStatus_Single': [10]
}

# Send the POST request with form data
response = requests.post(url, data=data)

# Print the prediction result received from the Flask app
print(f'Response from server: {response.text}')

Response from server: Prediction: 1
