<div>
<img src="https://www.thebluediamondgallery.com/handwriting/images/customer-satisfaction.jpg" width="500"/>
</div>

# CHURN PREDICTION


1. [Project Scope](#1)
2. [Analytical approach](#1)
    1. [Exploratory data Analysis](#1)
        1. [Definition of training period , testing period and validation period](#1)
        2. [Outlier removal](#1)
        3. [Transposition of variables](#1)
        4. [Missing value imputation](#1)
    2. [Model Development](#1)
        1. [Types of data pulled](#1)
        2. [Treatment of CID level data](#1)
        3. [Input data](#1)
        4. [Category of variables used](#1)
    3. [Code Snippets](#2)
        1. [Logistic Regression](#2)
        2. [Random Forest](#2)
        3. [Gradient boosted machines](#2)
        4. [Light GBM](#2)
    4.[Selection of Algorithms (Basic -> Advanced)](#3)
3. [Evaluation](#4)
4. [Appendix](#4)




## Project Scope

* To predict the likelihood of each customer to churn in the next 12 weeks future.
* Identify key variables that influence customer churn.

<a id="1"></a>

## Analytical approach
a. **Exploratory data Analysis**
* Churn model built separately for **Online** and **Stores** customers.
* Data divided into following time periods :

| Training data| Test data  | Out of time validation |
| --- | --- | --- |
| Sep-2017 to Aug-2018 | Nov-2018 - Oct 2019 | Sep-2016 to Aug-2017  |


* Treatment of customer level data 
    *  **Categorical importance** : Categorical variables transposed to get relevance of each category. Eg: contribution of male and female customers.
    *  **Outlier removal** : Variables beyond 99th % capped with 99th % value.
    *  **Missing value imputation** : Missing values were imputated keeping the initial variable composition intact
    *  **Elimination of multicollinearity

b. **Model Development**

* **Types of Data pulled**
    * Transaction data : Sales , trips , units , margin
    * Demographic data : Ethnicity , Age , Gender
    * Seasonal data : Holiday vs non holiday
    * Loyalty data : credit card holders , loyalty customers
        

<a id="2"></a>

## Code Snippets :

In [5]:
import pandas as pd
import numpy as np
import pandas as pd
import time
from statsmodels.stats.outliers_influence import variance_inflation_factor    
from joblib import Parallel, delayed
data = pd.read_csv('data.csv')
data.head()  # the data has 78 variables to start with


############ Split into train and test sets ####################

feature = [x for x in train.columns if x not in 'Sales']
y = train[['ID','Sales']]
X = train[feature]

X = X.set_index('ID')
test = test.set_index('ID')
y = y.set_index('ID')



############# Remove multicollinearity  ##########################

# Defining the function that you will run later
def calculate_vif_(X, thresh=5.0):
    variables = [X.columns[i] for i in range(X.shape[1])]
    dropped=True
    while dropped:
        dropped=False
        print(len(variables))
        vif = Parallel(n_jobs=-1,verbose=5)(delayed(variance_inflation_factor)(X[variables].values, ix) for ix in range(len(variables)))

        maxloc = vif.index(max(vif))
        if max(vif) > thresh:
            print(time.ctime() + ' dropping \'' + X[variables].columns[maxloc] + '\' at index: ' + str(maxloc))
            variables.pop(maxloc)
            dropped=True

    print('Remaining variables:')
    print([variables])
    return X[[i for i in variables]]

feature_list = [x for x in train.columns ]
df = train[feature_list] # Selecting your data

df2 = calculate_vif_(df,5) # Actually running the function

##################### Remove out;iers ###########################

upper_lim = train['Sales'].quantile(.98)
lower_lim = train['Sales'].quantile(.01)
train.loc[(train['Sales'] > upper_lim),'Sales'] = upper_lim
train.loc[(train['Sales'] < lower_lim),'Sales'] = lower_lim


Unnamed: 0,GOLD,11,Credit,STORES,FEMALE,55-64,Hispanic,Children,$50K-75K,Single,...,1.19,0.27,0.28,0.29,1.20,0.30,0.31,0.32,1.21,1.22
0,PLATINUM,14,NCL,STORES,FEMALE,25-34,Hispanic,No Children,$40K-50K,Single,...,0,0,0,1,0,0,0,0,1,1
1,GOLD,16,Credit,STORES,FEMALE,35-44,Hispanic,No Children,<$15K,Single,...,0,0,1,0,0,0,0,0,1,1
2,RED,19,NCL,STORES,FEMALE,18-24,Hispanic,No Children,<$15K,Single,...,0,0,1,0,0,0,0,0,1,1
3,RED,21,NCL,STORES,FEMALE,65-74,Hispanic,Children,$50K-75K,Married,...,1,1,0,0,1,0,0,0,1,1
4,RED,22,NCL,STORES,FEMALE,45-54,Caucasian,Children,$150K-$175K,Married,...,1,1,0,0,0,0,1,0,1,1


In [None]:

### Logistic regression in SAS
proc logistic data=model.stores_2018 descending outest=output_sample outmodel=model.LogisticModel;
Class &Cat_var./param=reference ref=first;
model CHURN_FLAG(event='1') = &All_var./ 
CL selection=stepwise sle=.01 sls=.01 RSQUARE WALDCL LACKFIT CTABLE;
      output out=model.predictions_2019 predicted=p_1 predprob=(individual crossvalidate);
score data=model.stores_2019 out=model.validation_2019;
run;



### Random Forest 
rf = RandomForestClassifier(labelCol="CHURN_FLAG", featuresCol="features")
rf_model = rf.fit(trainingData)
feature_importance = rf_model.featureImportances
rf_prediction = rf_model.transform(testData)


### GBM 
gbt = GBTClassifier(labelCol="CHURN_FLAG",featuresCol="features", maxIter=10)

evaluator = BinaryClassificationEvaluator(labelCol="CHURN_FLAG")
paramGrid = ParamGridBuilder().build()
crossval = CrossValidator(estimator=gbt, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=6)
model = crossval.fit(trainingData)
model.save(sc, "gbt.model")
print model.featureImportances
gbt_prediction = model.transform(testData) 



### Light GBM 

import lightgbm as lgb

lgbm = lgb.LGBMClassifier( n_estimators= 300,
    colsample_bytree = 0.7,    max_depth= 20,    num_leaves= 100,    reg_alpha= 1.1,
    reg_lambda= 1.1,    min_split_gain= 0.4,    subsample=0.8,    subsample_freq =20)
model = lgbm.fit(X_train, y_train,
         eval_set=[(X_test, y_test)],
        eval_metric='auc',
          early_stopping_rounds=5  )



* Categorization for identifying customers at risk :

|Score |Risk | 
| --- | --- | 
| >.72 | High probability to churn | 
| .56 - .74 |Equally likely to churn|
| .4-.56 | Least likely to churn| 
| <.44 | Minimal risk customers |


<a id="3"></a>

# Selection of algorithm (Basic -> advanced)
* Exploratory data analysis was performed on the data using PySpark and Python Scripts
* Multiple algorithms were used to predict which customers are likely to churn



|Metrices |Logistic Regression | Random Forest | Gradient boosted machine |Light GBM |
| --- | --- | --- | --- | --- |
| centered Platform used | SAS | Pyspark/Pandas | Pyspark/Pandas | Pandas|
| Variable importance | varaibles with iv value < .02 rejected | All|All| All|
| Multicollinearity Check | eliminated (VIF) | not required | not required |not required |
| Final variable selection | ~69 | ~128 | ~128 | ~128 |
| Accuracy | Online(76%) & Stores(73%) | 76% (20% data) | 86%(20% data) | 80%(Store data)|
| Recall | 76 | 76 |76|76|
| Caveat | Time consuming and accuarcy < 80% | No significant improvisation over Logistic Regression| Takes 3x time |Results need optimization|
| Staus | Accepted as baseline model | Rejected|Rejected|In progress|


## Evaluation : Comparison of Light GBM with Logistic Regression for Store data
### In-time validation

| Metrics | Logistic Regressisom  | Light GBM|
| --- | --- | --- |
| Accuracy |70% | 80%|
| AUC | 73% | 86%|
| Recall | 72% | 77% |


<a id="4"></a>

# Appendix

## Terminologies and Algorithm Description
**Logistic regression:**
*Weight of Evidence and information values calculated for all variables. Variables with iv score < .2 excluded.
*Multicollinearity : Variables with VIF > 5 excluded
*Logistic regression can predict probabilities restricted between 0 and 1 of a customer getting churned and the coefficients of the model also provide a hint of the relative importance of each input variable.

**Random Forest**
*It consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction
*They ensure that the behavior of each individual tree is not too correlated
*Bagging (Bootstrap Aggregation) : Random forest takes advantage of this by allowing each individual tree to randomly sample from the dataset with replacement, resulting in different trees. 
*Feature Randomness — In a normal decision tree, when it is time to split a node, we consider every possible feature and pick the one that produces the most separation between the observations in the left node vs. those in the right node. In contrast, each tree in a random forest can pick only from a random subset of features. 
*Feature importance can be extracted and the relevant features can be fed one by one to measure the model performance.

**Gradient boosted Trees:**
*Machine learning boosting is a method for creating an ensemble. It starts by fitting an initial model (e.g. a tree or linear regression) to the data. Then a second model is built that focuses on accurately predicting the cases where the first model performs poorly. The combination of these two models is expected to be better than either model alone. Then you repeat this process of boosting many times.  Each successive model attempts to correct for the shortcomings of the combined boosted ensemble of all previous models.
*Feature importance can be extracted and the relevant features can be fed one by one to measure the model performance.

**Light GBM :**
*Light GBM is a gradient boosting framework that uses tree based learning algorithm.
*Light GBM is prefixed as ‘Light’ because of its high speed. Light GBM can handle the large size of data and takes lower memory to run. Another reason of why Light GBM is popular is because it focuses on accuracy of results.

**Confusion matrix and recall**
(https://www.researchgate.net/figure/Modified-Confusion-Matrix-Table-for-Accuracy-Prediction-of-24_fig4_319183193)
**ROC & AUC** (https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc)

