<a href="https://colab.research.google.com/github/KimaruThagna/data-science-in-pycaret/blob/main/telco_churn_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Churn Analysis
In this analysis, I am setting out to answer two business questions
1. How to efficiently and quickly compare different ML algorithms in the classification of the telco churn dataset
2. What are the top determinants of churn? In each of the determinant, what would likely lead to churn and what would not?

In [None]:
pip install pycaret shap sweetviz

Collecting pycaret
[?25l  Downloading https://files.pythonhosted.org/packages/30/4b/c2b856b18c0553238908f34d53e6c211f3cc4bfa13a8e8d522567a00b3d7/pycaret-2.3.0-py3-none-any.whl (261kB)
[K     |████████████████████████████████| 266kB 8.6MB/s 
[?25hCollecting shap
[?25l  Downloading https://files.pythonhosted.org/packages/b9/f4/c5b95cddae15be80f8e58b25edceca105aa83c0b8c86a1edad24a6af80d3/shap-0.39.0.tar.gz (356kB)
[K     |████████████████████████████████| 358kB 8.7MB/s 
[?25hCollecting sweetviz
[?25l  Downloading https://files.pythonhosted.org/packages/92/6f/58c132de8243a16c64b741dfc2aa8b31af66334ae6858d97c41846afe642/sweetviz-2.0.9-py3-none-any.whl (15.1MB)
[K     |████████████████████████████████| 15.1MB 299kB/s 
Collecting scikit-plot
  Downloading https://files.pythonhosted.org/packages/7c/47/32520e259340c140a4ad27c1b97050dd3254fdc517b1d59974d47037510e/scikit_plot-0.3.7-py3-none-any.whl
Collecting Boruta
[?25l  Downloading https://files.pythonhosted.org/packages/b2/11/583f4ea

In [None]:
import sweetviz as sv
import pandas as pd


In [None]:
telco_df = pd.read_csv('telecom_users.csv')
telco_df.columns

Index(['Unnamed: 0', 'customerID', 'gender', 'SeniorCitizen', 'Partner',
       'Dependents', 'tenure', 'PhoneService', 'MultipleLines',
       'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
       'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges',
       'Churn'],
      dtype='object')

In [None]:
# initial EDA with simlified tool
telco_eda = sv.analyze( telco_df)
telco_eda.show_notebook()

HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=23.0), HTML(value='')), layout=Layout(dis…




# Observations and Analysis


In [None]:
# setup ML with pycaret
from pycaret.classification import *
setup_1 = setup(telco_df, target='Churn',ignore_features=['customerID','Unnamed: 0'])

Unnamed: 0,Description,Value
0,session_id,7638
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"No: 0, Yes: 1"
4,Original Data,"(5986, 22)"
5,Missing Values,False
6,Numeric Features,2
7,Categorical Features,17
8,Ordinal Features,False
9,High Cardinality Features,False


# Effective Model Comparison

The question of effective model comparison is solved by Pycaret's 
`compare_models()` function that ranks different classification algorithms based on a certain sort criteria. It could be recall, AUC or accuracy etc.
The function also allows you to define a list of algorithms you wish to compare. 

To increase the effectiveness of this comparison function, it is imperative to perform feature engineering on the data for the best possible results

In [None]:
best = compare_models(sort='Accuracy')

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.8019,0.8431,0.4897,0.676,0.5677,0.4434,0.4533,14.716
lr,Logistic Regression,0.8,0.8373,0.5301,0.6529,0.5847,0.4549,0.4595,17.936
ada,Ada Boost Classifier,0.799,0.8408,0.531,0.6499,0.584,0.4533,0.4577,4.074
ridge,Ridge Classifier,0.7947,0.0,0.4932,0.6497,0.5606,0.43,0.437,1.698
lightgbm,Light Gradient Boosting Machine,0.7888,0.8223,0.5077,0.6282,0.5608,0.4239,0.4285,0.564
rf,Random Forest Classifier,0.7871,0.8226,0.4501,0.6434,0.5289,0.3969,0.408,5.265
et,Extra Trees Classifier,0.7776,0.8094,0.4438,0.6134,0.514,0.3748,0.3836,7.791
dt,Decision Tree Classifier,0.7675,0.6873,0.514,0.5708,0.5398,0.3851,0.3868,0.479
knn,K Neighbors Classifier,0.7632,0.7685,0.5041,0.5609,0.5304,0.3729,0.3742,2.545
svm,SVM - Linear Kernel,0.6735,0.0,0.6935,0.4375,0.5035,0.3085,0.3512,2.805


In [None]:
tuned = tune_model(best)
evaluate_model(tuned)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.79,0.8392,0.4595,0.6456,0.5368,0.406,0.4158
1,0.7924,0.8344,0.4414,0.6622,0.5297,0.4033,0.4169
2,0.7924,0.8368,0.4324,0.6667,0.5246,0.3994,0.4147
3,0.8091,0.8432,0.4505,0.7246,0.5556,0.4423,0.4626
4,0.7757,0.8341,0.4234,0.6104,0.5,0.3614,0.3715
5,0.8019,0.8359,0.4324,0.7059,0.5363,0.4195,0.4398
6,0.7924,0.824,0.4595,0.6538,0.5397,0.4109,0.4215
7,0.7924,0.8125,0.4196,0.6812,0.5193,0.3963,0.4152
8,0.8138,0.8655,0.4554,0.75,0.5667,0.457,0.4801
9,0.7947,0.7895,0.4554,0.6711,0.5426,0.4164,0.4294


interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Hyperparameters', 'param…