
# CUSTOMER CHURN CLASSIFICATION PROJECT


### INTRODUCTION

Customer attrition is one of the biggest expenditures of any organization. Customer churn otherwise known as customer attrition or customer turnover is the percentage of customers that stopped using your company's product or service within a specified timeframe.

For instance, if you began the year with 500 customers but later ended with 480 customers, the percentage of customers that left would be 4%. If we could figure out why a customer leaves and when they leave with reasonable accuracy, it would immensely help the organization to strategize their retention initiatives manifold.

### PROJECT OBJECTIVE
In this project, we aim to find the likelihood of a customer leaving the organization, the key indicators of churn as well as the retention strategies that can be implemented to avert this problem.


### PROJECT STRUCTURE

#### Our analysis or methods comprises the following steps:
1) Hypothesis formation and Data Processing - Importing the relevant libraries and modules, Cleaning of Data, Check data types, Encoding Data labels etc.

2)Data Evaluation -- Perform bivariate and multivariate analysis, EDA

3)Build & Select Model -- Train Model on dataset and select the best performing model.

4)Evaluate your chosen Model.

5)Model Improvement.

6)Future Predictions.

7)Key Insights and Conclusion.


## TECHNICAL CONTENT

### IMPORTING ALL THE RELEVANT LIBRARIES

We start by importing the relevant Libraries to this project. Below are some of the libraries that are to be used.

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

import os

import pickle

from scipy.stats import norm, skew

from scipy import stats

import statsmodels.api as sm 

### sklearn modules for data preprocessing:

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import OrdinalEncoder

from sklearn.pipeline import Pipeline

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler 

### sklearn modules for Model Selection:

from sklearn import svm, tree, linear_model, neighbors

from sklearn import naive_bayes, ensemble, discriminant_analysis, gaussian_process

from sklearn.neighbors import KNeighborsClassifier

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from xgboost import XGBClassifier

from sklearn.linear_model import LogisticRegression

from sklearn.svm import SVC

from sklearn.neighbors import KNeighborsClassifier

from sklearn.naive_bayes import GaussianNB

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier 

### sklearn modules for Model Evaluation & Improvement:
    
from sklearn.metrics import confusion_matrix, accuracy_score 

from sklearn.metrics import f1_score, precision_score, recall_score, fbeta_score

from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import ShuffleSplit

from sklearn.model_selection import KFold

from sklearn import feature_selection

from sklearn import model_selection

from sklearn import metrics

from sklearn.metrics import classification_report, precision_recall_curve

from sklearn.metrics import auc, roc_auc_score, roc_curve

from sklearn.metrics import balanced_accuracy_score

from sklearn.metrics import make_scorer, recall_score, log_loss

from sklearn.metrics import average_precision_score #Standard 

from sklearn.impute import SimpleImputer

from imblearn.over_sampling import RandomOverSampler

from scipy.stats import chi2_contingency

from scipy.stats import ttest_ind

import warnings 

warnings.filterwarnings('ignore')


## LOADING ALL THE THE DATASETS

In this Project, three datasets are provided. Two are to be used as the train dataset while the other will be used for the test dataset.


### FETCHING THE 1ST DATASET FROM MICROSOFT SERVER DATABASE

One of the train dataset was access via a microsoft server database with the following connection procedures.

%pip install pyodbc  

%pip install python-dotenv 

import pyodbc 

just installed with pip from dotenv import dotenv_values

import the dotenv_values function from the dotenv package 

import pand as pd 

import warnings warnings.filterwarnings('ignore')

#Load environment variables from .env file into a dictionary

environment_variables = dotenv_values('.env')

Get the values for the credentials you set in the '.env'file

database = environment_variables.get("DATABASE")

server = environment_variables.get("SERVER")

username = environment_variables.get("USERNAME")

password = environment_variables.get("PASSWORD")


connection_string=f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password}

#Use the connect method of the pyodbc library and pass in the connection string.

#This will connect to the server and might take a few seconds to be complete.

#Check your internet connection if it takes more time than necessary

connection=pyodbc.connect(connection_string)

#Now the sql query to get the data is shown below. 

Note that you will not have permissions to insert delete or update this database table. 

query = "Select * from dbo.LP2_Telco_churn_first_3000"


data = pd.read_sql(query,connection)

Analysis of the loaded dataset (data)

data.head()

data.info()

### IMPORTING THE 2ND DATASET

data2 = pd.read_csv('LP2_Telco-churn-last-2000.csv')

data2.head()

#You can concatenate this with other DataFrames to get the train data set for this project



### CONCTINATING THE TWO DATASETS AS THE TRAIN THE DATA

train_df = pd.concat([data,data2])

![train%20data%20concat.PNG](attachment:train%20data%20concat.PNG)


### OVERVIEW OF THE TRAIN DATASET
train.shape

train.shape

train.shape

![train.head.PNG](attachment:train.head.PNG)



![train.info%20and%20shape.PNG](attachment:train.info%20and%20shape.PNG)

![train.describe%20and%20missing%20value%20checking.PNG](attachment:train.describe%20and%20missing%20value%20checking.PNG)

From the above, the train data has some missing values with 5043 rows and 21 columns.



### IMPORTING THE TEXT DATASET

data3 = pd.read_excel(r'Telco-churn-second-2000.xlsx') as show below

test_df = data3

![test%20data.PNG](attachment:test%20data.PNG)

### OVER VIEW OF TEST DATA

![testing%20data%20with%2019%20cols.PNG](attachment:testing%20data%20with%2019%20cols.PNG)

From the test.shape above, test data has 2000 rows and 21 columns. we do not need the churn column as the test dat is not suppose to have the target variable. 

The customerID column is also not needed, so we are to remove it as indicated above.

The test data now have 2000 rows and 19 columns.

### DATA PROCESSING

![replacing%20false%20with%20no%20in%20train%20data.PNG](attachment:replacing%20false%20with%20no%20in%20train%20data.PNG)



![dividing%20train%20data%20into%20cat%20and%20num,%20correlation.PNG](attachment:dividing%20train%20data%20into%20cat%20and%20num,%20correlation.PNG)

### UNIVARIATE ANALYSIS

It is the simplest method of analyzing data where we examin each variable individually.
For Categorical features, we use frequency table or bar plots which will calculate the number of each category in a particular variable.

For numerical features, probability density plots can be used to look at the distribution of the variable.
Target Variable

We first look at the target variable which is the Churn Column. As it is a categorical variable.
let us look at its frequency table, percentage distribution and barplot

![uni%201.PNG](attachment:uni%201.PNG)

![uni%201a.PNG](attachment:uni%201a.PNG)

From the above, we can dedue the following:

1) Total numbers of customers that churn were = 1,336 which is about 26.5%

2) Total number of customer that did'nt churn were = 3,706 which is about 74.0%

### Let look at some of other categorical varibles

# Gender column

![gender%20uni.PNG](attachment:gender%20uni.PNG)

From the above, there is no much differences between the male and female,

We have 51% = Male while 

49% = Female 

In the entire dataset, the difference is 2%. This shows that their distribution is normally distributed or balanced.

## Senior Citizens Column

![sc%20uni%20a.PNG](attachment:sc%20uni%20a.PNG)

![sc%20uni%20b.PNG](attachment:sc%20uni%20b.PNG)

From the above graph, 

1) 84% are not Senior Citizens 

2) 16% are Senior Citizens.

Their distribution is not normally distributed. Hence it is imbalanced.

## Partners Column

![partners%20uni%201.PNG](attachment:partners%20uni%201.PNG)

![partners%20uni%202.PNG](attachment:partners%20uni%202.PNG)

From the analysis above,

1) Total numbers of partners = 2,458 (49%)

2) Total numbers of non partners = 2,585 (51%)

Their difference is 2% which show that they are equally distributed.

## Dependent Column

![dp%20uni%201.PNG](attachment:dp%20uni%201.PNG)

![dp%20uni%202.PNG](attachment:dp%20uni%202.PNG)

Analysis:

1) Total no of dependants = 1,561 (31%)

2) Total no of independants = 3,482 (69%)

The Distribution is not equally distributed. Imbalanced data.

### BIVARIATE ANALYSIS

Let us look at the relationship between the target variable and the Categorical variables. 

With the Bar Plot below, we can see the proportion of the Churned (Yes) and Not Churn (No).

### Gender Vs Churn 


![bi%201a.PNG](attachment:bi%201a.PNG)

![bi%201.PNG](attachment:bi%201.PNG)

Analysis:

1) Total of Male that Churned (Yes) = 675 (13.4%)

2) Total of Female that Churned (Yes) = 661 (13.1%)

3) Total of Male that did not Churn (No) = 1,883 (37.4%)

4) Total of Female that did not Churn (No) = 1,823 (36.1%)

This also shows that the gender churn rate is equally distributed.

### MULTIVARIATE ANALYSIS

Now, lets look at how three or more variables are related in the train dataset.

We shall be looking at the Numerical Columns to see how much they relate to one another.

![mult%201.PNG](attachment:mult%201.PNG)


## Correlations

![correlations.PNG](attachment:correlations.PNG)

![mult%202.PNG](attachment:mult%202.PNG)

![mult%203.PNG](attachment:mult%203.PNG)

From theabove,

1) Total Charge is strongly correlated to the tenure

2) MonthlyCharge is not correlated to tenure and is also strongly correlated to TotalCharges

3) All of them are strongly correlated to themselves.

### TRAIN DATA SPLITTING

Before we carry out train test splitting, we are to decalre our input variables 'x' and output variables 'y' (target variable).

This is done as shown below:

![data%20splitting.PNG](attachment:data%20splitting.PNG)

### Train Test Splitting to create the x_train, y_train, x_eval and y_eval ( Training data and Evaluation data)

After, splitting the train data set to get the input data 'x' and output data 'y', we can now proceed to carry out train test splitting.

![traintest%20splitting%20col%20splitting.PNG](attachment:traintest%20splitting%20col%20splitting.PNG)

## ENCODING, SCALING AND PIPELINE

For our Categorical data, we will use ordinalEncoder() and for scaling our numerical data, we will use standardScaler. 

To treat missing values, SimpleImputer will be used. For the Num data, mean will be used as the strategy while the mode 'most frequent' will be used for the cate data.

We shall also create pipeline for both the numerical and categorical datasets.

This is shown below:

![encoding%20and%20scaling.PNG](attachment:encoding%20and%20scaling.PNG)

### We can now view the x_trained data after fiting and transforming with our processor it as shown above.

![x_trained.PNG](attachment:x_trained.PNG)

### END2END_PIPELINE

This end2end_pipeline contains all the steps above with the space where we can apply any of our models that we want use to model our data. The 'None' is where we place any model we want use.

![end2end%20pipeline.PNG](attachment:end2end%20pipeline.PNG)

## from above, our end2end_pipeline is ready to be used by any of our models

### MODEL TRAINING AND EVALUATIONS

### We shall be traing on the following models and at the end we shall select the three best performing models.

### 1) Logistic Regression.

### 2) Decision Trees.

### 3) Support Vector Machine.

### 4) Random Forest

### 5) KNeighborsClassifier


### OUR PERFORMANCE METRICS SHALL BE  F1_SCORE

### METRIC, MODELS_TRAINED LIST AND POS_LABEL DECLARATION

![METRIC%20DECL.PNG](attachment:METRIC%20DECL.PNG)

### 1) 1ST MODEL TRAINING - LOGISTIC REGRESSION MODEL

We call our end2end-pipeline and put logisticRegression in the space 'None) and instantiate our model. After instatiating our model, we evaluate with the evalutaion dataset and calculate the F1-score as shown below;

![LR1.PNG](attachment:LR1.PNG)

![LR2.PNG](attachment:LR2.PNG)

### 2) 2ND MODEL TRAINING - DECISION TREE MODEL

We call our end2end-pipeline and put DecisionTree in the space 'None) and instantiate our model. After instatiating our model, we evaluate with the evalutaion dataset and calculate the F1-score as shown below;

![dt.PNG](attachment:dt.PNG)

###  3RD MODEL TRAINING - SUPPORT VECTOR MACHINE MODEL

We call our end2end-pipeline and put SVC in the space 'None) and instantiate our model. After instatiating our model, we evaluate with the evalutaion dataset and calculate the F1-score as shown below;

![svm.PNG](attachment:svm.PNG)


###  4TH MODEL TRAINING - RANDOMFORESTCLASSIFIER MODEL

We call our end2end-pipeline and put RandomForestClassifier in the space 'None) and instantiate our model. After instatiating our model, we evaluate with the evalutaion dataset and calculate the F1-score as shown below;


![rf1.PNG](attachment:rf1.PNG)

![rf2.PNG](attachment:rf2.PNG)

### 5TH MODEL TRAINING - KNEIGHBORSCLASSIFIER MODEL MODEL

We call our end2end-pipeline and put KneighborsClassifier in the space 'None) and instantiate our model. After instatiating our model, we evaluate with the evalutaion dataset and calculate the F1-score as shown below;

![knn1.PNG](attachment:knn1.PNG)

![knn2.PNG](attachment:knn2.PNG)

## COMPARING OUR MODEL

After modeling all the five models, we will compare using their f1-scores. This is achieved by printing all their f1-scores as shown beol:

![model%20comparism.PNG](attachment:model%20comparism.PNG)

## HYPERPARAMETER TUNING OF OUR THREE BEST MODELS

Before we proceed to tune our best performing models, let us create the followings:

1) We shall create a list of best_models_trained

2) we shall encode our target variable 'y' with labelEncoder

3) Store our three best performing models above as model2

4) we shall Refit our data with f1-score as our data is imbalanced

5) we shall also use Cross Validation with 5 folds to validate our data

This is shown belown:

![best%20model.PNG](attachment:best%20model.PNG)

### HYPERPARAMETER TUNING OF THE 1ST MODEL - RandomForestClassifier

First, we shall get the parameters by applying .get_params() to the model as shown below:

![rf%20para.PNG](attachment:rf%20para.PNG)

Secondly, we manually select the parameters above that we want to modefy. Instantiate it in a GridSearcher and fit our model to get the best parametes and best estimators as shown below.

![rf%20search.PNG](attachment:rf%20search.PNG)

![rf%20best%20estimator.PNG](attachment:rf%20best%20estimator.PNG)

### HYPERPARAMETER TUNING OF THE 2ND BEST MODEL - Support vector Machine

getting the parameters as done above and applying the same processes as above:

![svm%20para1.PNG](attachment:svm%20para1.PNG)

![svm%20para%202.PNG](attachment:svm%20para%202.PNG)

![svm%20para%203.PNG](attachment:svm%20para%203.PNG)

## HYPERPARAMETER TUNING OF THE 3RD BEST MODEL - Decision Tree

getting the parameters as done above and applying the same processes as above:

![dt%20param1.PNG](attachment:dt%20param1.PNG)

![dt%20param%202.PNG](attachment:dt%20param%202.PNG)

![dt%20para3.PNG](attachment:dt%20para3.PNG)

![dt%20param%204.PNG](attachment:dt%20param%204.PNG)

## COMPARING THE THREE MODELS

At the end of fine tuningthe models, the results are printed out as shown below.

From the result, RANDOMFORESTCLASSIFIER appeared the best among the three with the score of 58%.

![best.PNG](attachment:best.PNG)

## RandomForestClassifier = 58%

## Support Vector Machine = 56%

## DecisionTree = 52%

# EXPORTING KEY COMPONENTS

After acheiving the best model, key components are exported for future usage using pickle. this is done as shown below.

![export.PNG](attachment:export.PNG)

## HYPOTHESIS TESTING


### HYPOTHESIS STATEMENT

#### H0: CHURN RATE IS NOT DEPENDENT ON THE GENDER

#### H1: CHURN RATE IS DEPENDENT ON THE GENDER


### ANALYTICAL QUESTIONS

#### 1) What is the total charges of all the services

#### 2) What is the total charges of males and females respectively.

#### 3) Which services has the highest churn rate

#### 4) Which Gender churn the most.

To carry out the Hypothesis testing and also answer some of the questions above, the train dataset will be used.

![data.PNG](attachment:data.PNG)

## HYPOTHESIS TESTING USING CHI-SQUARED TEST

### OUR SIGNIFICANT LEVEL IS 5% (ALPHA = 0.05)

![hypo%201.PNG](attachment:hypo%201.PNG)

![hypo%202.PNG](attachment:hypo%202.PNG)

From the hypothesis test above, Churn rate does not depend on the gender.

## ANSWERING THE ANALYTICAL QUESTIONS

## 1) What is the total charges of all the services

![Q1.PNG](attachment:Q1.PNG)

### 2) What is the total charges of males and females respectively

![Q2A.PNG](attachment:Q2A.PNG)

![Q2B.PNG](attachment:Q2B.PNG)

![Q2C.PNG](attachment:Q2C.PNG)

![Q2D.PNG](attachment:Q2D.PNG)

![Q2E.PNG](attachment:Q2E.PNG)

### 3) Which services has the highest churn rate.
![q3.PNG](attachment:q3.PNG)


![q3a.PNG](attachment:q3a.PNG)

![q3b.PNG](attachment:q3b.PNG)

![q3c.PNG](attachment:q3c.PNG)

![q3d.PNG](attachment:q3d.PNG)

## 4) Which Gender churn the most.

![q5.PNG](attachment:q5.PNG)

![q5a.PNG](attachment:q5a.PNG)

## CONCLUSION AND RECOMMENDATIONS

## The followings are the Concluions and recommendations:

1) The company has a churn rate of 26.5%

2) The Company has a higher percentages of junior citizens than the senior Citizen and as such makes more money as the junior citizens are more likely to be using more of the dital services.

3) The gender does not have any effect on the churn rate.

4) The service with the most churn rate is the StreamingTV, followed by the StreamingMovies and the phoneservice.

5) The churn rate can be reduced by reducing the charges for the StreamingTV, StreamingMovies and Phoneservices as well as offering some bonues.