<a href="https://colab.research.google.com/github/Gowr93/Project_Repository_GowriCS/blob/main/E_Commerce_ChurnPrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## E-Commerce Dataset

Customer churn or attrition is one of the most crucial problems for any business.

It is important to track and analyse how many customers are leaving the platform and how many are sticking and the reasons behind them. Knowing customer behaviour can greatly enhance decision-making processes and can further help reduce churn to improve profitability.

# Data Set Overview

The dataset contains 60,000 rows with 51 columns including:



1.   User information
2.   Product Details
3.   User_Product Interaction


Goal:

Building a model for **CHURN PREDICTION** , Identifying users likely to stop Purchasing

DATA PREPARATION

In [20]:
#Importing required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import os

In [21]:
# Loading the dataset

e_churn = pd.read_csv('/content/ecommerce_recommendation_dataset.csv')

In [22]:
# Reviewing the data structure of datset
e_churn.head()

Unnamed: 0,user_id,product_id,category,price,rating,review_count,user_age,user_gender,user_location,purchase_history,...,product_rating_variance,review_sentiment_score,user_engagement_score,ad_click_rate,time_of_day,day_of_week,season,payment_method,coupon_used,product_popularity
0,78517,1645,Books,842.23,2,155,24,Other,Urban,False,...,0.13,-0.28,0.68,0.04,Night,Thursday,Summer,Debit Card,False,0.54
1,52887,100,Books,253.76,3,331,43,Other,Suburban,False,...,0.02,0.28,0.11,0.89,Morning,Saturday,Summer,Debit Card,False,0.77
2,59395,585,Books,483.65,2,236,64,Female,Rural,True,...,1.55,0.23,0.35,0.99,Evening,Tuesday,Fall,Debit Card,False,0.14
3,54739,3774,Groceries,459.37,2,227,34,Female,Urban,False,...,1.41,0.93,0.73,0.16,Afternoon,Tuesday,Spring,Credit Card,False,0.18
4,42723,2119,Groceries,150.11,2,214,51,Female,Urban,True,...,1.29,0.11,0.26,0.17,Night,Wednesday,Spring,PayPal,False,0.66


In [23]:
# Printing the number of Rows and Columns
print("The number of Rows : {} ".format(e_churn.shape[0]))
print("The number of Columns : {} ".format(e_churn.shape[1]))

The number of Rows : 60000 
The number of Columns : 51 


In [24]:
# Displaying the Datatypes in Dataset
e_churn.dtypes

Unnamed: 0,0
user_id,int64
product_id,int64
category,object
price,float64
rating,int64
review_count,int64
user_age,int64
user_gender,object
user_location,object
purchase_history,bool


In [25]:
# Review the Structure of the dataset
e_churn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Data columns (total 51 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   user_id                  60000 non-null  int64  
 1   product_id               60000 non-null  int64  
 2   category                 60000 non-null  object 
 3   price                    60000 non-null  float64
 4   rating                   60000 non-null  int64  
 5   review_count             60000 non-null  int64  
 6   user_age                 60000 non-null  int64  
 7   user_gender              60000 non-null  object 
 8   user_location            60000 non-null  object 
 9   purchase_history         60000 non-null  bool   
 10  time_on_page             60000 non-null  float64
 11  add_to_cart_count        60000 non-null  int64  
 12  search_keywords          60000 non-null  object 
 13  discount_applied         60000 non-null  bool   
 14  user_membership       

In [26]:
#Looping to find the number of Numerical data and Categorical data

num,obj = 0,0

for cols in e_churn.columns:
  if e_churn.dtypes[cols] != 'O':
    num += 1
  else:
    obj += 1
print(" The number of numerical value is {}".format(num))
print(" The number of categorical value is {}".format(obj))

 The number of numerical value is 31
 The number of categorical value is 20


CHECKING FOR MISSING DATA AND DUPLICATES IN THE DATASET

In [27]:
print(f"The missing data in the dataset is : \n {e_churn.isnull().sum()}")
print("=" *50)
print(f"The duplicate data in the dataset is : \n {e_churn.duplicated().sum()}")


The missing data in the dataset is : 
 user_id                    0
product_id                 0
category                   0
price                      0
rating                     0
review_count               0
user_age                   0
user_gender                0
user_location              0
purchase_history           0
time_on_page               0
add_to_cart_count          0
search_keywords            0
discount_applied           0
user_membership            0
user_browser               0
user_device                0
purchase_time              0
session_duration           0
clicks_on_ads              0
page_views                 0
referral_source            0
wishlist_additions         0
cart_abandonment_rate      0
average_spent              0
user_income                0
user_education             0
user_marital_status        0
product_availability       0
stock_status               0
product_return_rate        0
product_color              0
product_size               0
is_t

ANALYSING THE TARGET VARIABLE

In [28]:
# From the dataset given : ['purchase_history']
print(f" The Values in ['purchase_history'] \n" ,e_churn['purchase_history'].unique())
print("*" * 25)
print(f" The Count of ['purchase_history'] \n " ,e_churn['purchase_history'].value_counts())
print("*" * 25)
print(f" Fractions(Proportions) of the class distribution of ['purchae_history'] \n", e_churn['purchase_history'].value_counts(normalize = True))

 The Values in ['purchase_history'] 
 [False  True]
*************************
 The Count of ['purchase_history'] 
  purchase_history
True     30088
False    29912
Name: count, dtype: int64
*************************
 Fractions(Proportions) of the class distribution of ['purchae_history'] 
 purchase_history
True     0.501467
False    0.498533
Name: proportion, dtype: float64


User is Churned if they stop engaging/purchasing for a certain period (No purchase in the last 90 days)

ANALYSING THE FEATURES FOR CHURN PREDICTION



> We need Behavioral + Engagement + Demographic Features







User behavior: [time_on_page,  page_views, session_duration, wishlist_additions, add_to_cart_count, cart_abandonment_rate, clicks_on_ads, ad_click_rate, user_engagement_score]

Purchase-related: [average_spent, discount_applied, discount_percentage, coupon_used, delivery_time, shipping_fee, product_return_rate]

User profile: [user_age, user_gender, user_location, user_income, user_membership]

Temporal: [time_of_day, day_of_week, season]

## BUILDING THE MODEL

In [29]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Target variable
y = e_churn["purchase_history"]

# Features (drop identifiers + target)
X = e_churn.drop(columns=["user_id", "product_id", "purchase_time", "purchase_history"],axis = 1)

# Separate numerical and categorical
num_features = X.select_dtypes(include=["int64", "float64"]).columns
cat_features = X.select_dtypes(include=["object", "category"]).columns

# Preprocessing
preprocessor = ColumnTransformer([
    ("num", StandardScaler(), num_features),
    ("cat", OneHotEncoder(handle_unknown="ignore"), cat_features)
])

# Pipeline with Logistic Regression
churn_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", LogisticRegression(max_iter=1000))
])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Fit
churn_pipeline.fit(X_train, y_train)

# Evaluate
y_pred = churn_pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.49      0.47      0.48      5982
        True       0.50      0.52      0.51      6018

    accuracy                           0.49     12000
   macro avg       0.49      0.49      0.49     12000
weighted avg       0.49      0.49      0.49     12000

