<h1 style='text-align:center;font-weight:bold;color:orange'>Customer Churn Prediction</h1>

## **1 Introduction**
### **1.1 Context**

**Stating from the original source:**

The data set belongs to a leading online E-Commerce company. An online retail (E commerce) company wants to know the customers who are going to churn, so accordingly they can approach customer to offer some promos.

**What is E-commerce customer churn?**

Customer churn constitutes a condition where customer choosing not to use a product/services, in the context of "An Online Retail E-commerce Company" because there is no statement regarding whether the company has offline store, then it is safe to assume there is "no offline store" and this Retail Company solely selling through online platform. 

Thus, we will tailor our problem's analysis as well solution and strategies in terms of online purchasing activity. 

**Why Such Things Can Happen?**

There are several underlying factors, such as issues with:

1.   Services Quality
2.   Product Quality
3.   Retention Strategy that is not effective.

**Business Case of E-commerce:**

Importance of Addressing Customer Churn will Impact Revenue:
*   **Cost of Acquisition vs. Retention** : Acquiring a new customer is more expensive 
than retaining an existing one. High churn rates can lead to increased marketing and acquisition costs.

*   **Lifetime Value** : Loyal customers contribute more to the lifetime value (LTV), making churn prediction critical for sustaining revenue.

Justification :   
1.   Why Retaining existing customer cheaper than Customer Acquisition Cost (CAC).
* Factors and Reference :     
  *   **Trust** : Consumers tend to buy from brands they **trust**. This is why it takes a lot more effort to convert a new customer than to hold a loyal one. It can be due to **good customer service**, **ease of use** or **simply because the product solved their problem effectively**. [Forbes](https://www.forbes.com/sites/forbesbusinesscouncil/2022/12/12/customer-retention-versus-customer-acquisition/) 
  *   **More likelihood to Purchase** : Studies show that existing customers are 50% more likely to try new products and spend 31% more than new customers. [Forbes](https://www.forbes.com/sites/forbesagencycouncil/2020/01/29/the-value-of-investing-in-loyal-customers/?sh=1f4d77a21f6b)



2.   Online Retail Company of Cost of Acquiring Customers(CAC).
  * The average CAC varies across industries. For eCommerce businesses: $70. [Average Customer Acquisition Cost](https://userpilot.com/blog/average-customer-acquisition-cost/)

  *  Some of the larger companies, like Amazon and eBay, pay between 150 dollars and 200 dollars per customer. For smaller online stores, however, this figure is generally closer to $20 per customer.[Average Customer Acquisition in E-Commerce](https://beprofit.co/a/blog/the-customer-acquisition-cost-in-e-commerce-and-industry)

  * Average spend advertising in top a few retail industries: 
  Ecommerce (as a whole): $68 [Customer acquisition cost statistics](https://www.lightspeedhq.com/blog/customer-acquisition-cost/)

### **1.2 Problem Statement**

**Business Problem Statement:**

`How to predict whether customer will churn (stop using product product/services), so we can provide appropriate strategy to retain existing customers.`

**Machine Learning System Objective :** 

  * Input : Customer informations.
  * Output : Whether customer will likely to churn or not.
  * Objective Function : Minimize the difference between "Predicted churn" and "Actual churn".
   
### **1.3 Analytical Approach**
### **1.4 Metrics**
### **1.5 Dataset**
The dataset which was in an Excel file format was obtained from [Kaggle](https://www.kaggle.com/datasets/ankitverma2010/ecommerce-customer-churn-analysis-and-prediction/data). This file consists of two sheets, `Data Dict` (information about each column in the dataset) and `E Comm` (the dataset).
- `CustomerID`: Unique customer ID
- `Churn`: Churn status
- `Tenure`: Tenure of customer in organization
- `PreferredLoginDevie`: Preferred login device of customer
- `CityTier`: City tier
- `WarehouseToHome`: Distance between warehouse to home of customer
- `PreferredPaymentMode`: Preferred payment method of customer
- `Gender`: Gender of customer
- `HourSpendOnApp`: Number of hours spent on mobile app or website
- `NumberOfDeviceRegistered`: Total number of devices registered by a customer
- `PreferedOrderCat`: Preferred order category of customer in last month
- `SatisfactionScore`: Satisfaction score of customer on service
- `MaritalStatus`: Marital status of customer
- `NumberOfAddress`: Total number of address of customer
- `Complain`: Complaint raised in last month
- `OrderAmountHikeFromlastYear`: Percentage of increases in order from last year
- `CouponUsed`: Total number of coupon has been used in last month
- `DaySinceLastOrder`: Day since last order by customer
- `CashbackAmount`: Average cashback in last month

Kindly note whether or not all variables will be used for a data analysis and data modeling will be determined based on the findings during the data exploration. Variables with moderate to high correlations, for example, will be excluded for the next stages to avoid problems in the modeling phase.

In [33]:
# simulate cost for each misclassification

## **2 Initial Inspection**

In [2]:
import pandas as pd     # for data wrangling
import numpy as np      # for numerical operations

In [3]:
# import dataset
data = pd.read_excel('../data/E Commerce Dataset.xlsx', 
                     sheet_name=1)

In [4]:
# create function to inspect df
def inspect_dataframe(df):
    print(f'The dataframe contains {df.shape[0]} rows and {df.shape[1]} cols.')
    print(f"- {len(df.select_dtypes(include='number').columns)} are numeric cols")
    print(f"- {len(df.select_dtypes(include='O').columns)} are object cols")
    summary = {
        'ColumnName': df.columns.values.tolist(),
        'Nrow': df.shape[0],
        'DataType': df.dtypes.values.tolist(),
        'NAPct': (df.isna().mean() * 100).round(2).tolist(),
        'DuplicatePct': (df.duplicated().sum()/len(df)*100).round(2),
        'UniqueValue': df.nunique().tolist(),
        'Sample': [df[col].unique() for col in df.columns]
    }
    return pd.DataFrame(summary)

In [5]:
# inspect df
inspect_dataframe(data)

The dataframe contains 5630 rows and 20 cols.
- 15 are numeric cols
- 5 are object cols


Unnamed: 0,ColumnName,Nrow,DataType,NAPct,DuplicatePct,UniqueValue,Sample
0,CustomerID,5630,int64,0.0,0.0,5630,"[50001, 50002, 50003, 50004, 50005, 50006, 500..."
1,Churn,5630,int64,0.0,0.0,2,"[1, 0]"
2,Tenure,5630,float64,4.69,0.0,36,"[4.0, nan, 0.0, 13.0, 11.0, 9.0, 19.0, 20.0, 1..."
3,PreferredLoginDevice,5630,object,0.0,0.0,3,"[Mobile Phone, Phone, Computer]"
4,CityTier,5630,int64,0.0,0.0,3,"[3, 1, 2]"
5,WarehouseToHome,5630,float64,4.46,0.0,34,"[6.0, 8.0, 30.0, 15.0, 12.0, 22.0, 11.0, 9.0, ..."
6,PreferredPaymentMode,5630,object,0.0,0.0,7,"[Debit Card, UPI, CC, Cash on Delivery, E wall..."
7,Gender,5630,object,0.0,0.0,2,"[Female, Male]"
8,HourSpendOnApp,5630,float64,4.53,0.0,6,"[3.0, 2.0, nan, 1.0, 0.0, 4.0, 5.0]"
9,NumberOfDeviceRegistered,5630,int64,0.0,0.0,6,"[3, 4, 5, 2, 1, 6]"


**Note**
- The dataset contains 5630 rows and 19 columns, consisting 5 object columns and 14 numerical columns. The 5 object cols will later need to be represented in the form of numerical format so they can be modeled together with the rest columns.
- No duplicates were found in the dataset so no treatments are needed.
- Missing values found in 7 columns, namely `Tenure` (4.69%), `WarehouseToHome` (4.46%), `HourSpendOnApp` (4.53%), `OrderAmountHikeFromlastYear` (4.71), `CouponUsed` (4.55), `OrderCount` (4.58), and `DaySinceLastOrder` (5.45%). These columns should be investigated further to understand the mechanisms of the missingness and what treatment is appropriate each condition.

In [7]:
# get statistical summary for numerical var
data.describe().round(2).transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
CustomerID,5630.0,52815.5,1625.39,50001.0,51408.25,52815.5,54222.75,55630.0
Churn,5630.0,0.17,0.37,0.0,0.0,0.0,0.0,1.0
Tenure,5366.0,10.19,8.56,0.0,2.0,9.0,16.0,61.0
CityTier,5630.0,1.65,0.92,1.0,1.0,1.0,3.0,3.0
WarehouseToHome,5379.0,15.64,8.53,5.0,9.0,14.0,20.0,127.0
HourSpendOnApp,5375.0,2.93,0.72,0.0,2.0,3.0,3.0,5.0
NumberOfDeviceRegistered,5630.0,3.69,1.02,1.0,3.0,4.0,4.0,6.0
SatisfactionScore,5630.0,3.07,1.38,1.0,2.0,3.0,4.0,5.0
NumberOfAddress,5630.0,4.21,2.58,1.0,2.0,3.0,6.0,22.0
Complain,5630.0,0.28,0.45,0.0,0.0,0.0,1.0,1.0


**Note**
- What is apparent is column `Tenure`, `WarehouseToHome`, `DaySinceLastOrder`, and `CasbackAmount` contain outliers as the max value far greater than the 75% values. However, it is important to remember this is just an initial inspection. The values will change as the duplicates will be removed. And a deeper check with a statistical technique will be done to check whether or not the aforementioned columns contain outliers.
- Lorem ipsum

<div class="alert alert-block alert-warning">
<b>Next steps for data preprocessing</b><br>

- Handle missing values

- Convert object cols to numeric

- Rescale numbers

- Check imbalance distribution of the target variable
</div>

In [6]:
# check churn rate
data['Churn'].value_counts(normalize=True)

0    0.831616
1    0.168384
Name: Churn, dtype: float64