# Mobile Customer Churn

In this Portfolio task you will work with some (fake but realistic) data on Mobile Customer Churn.  Churn is where
a customer leaves the mobile provider.   The goal is to build a simple predictive model to predict churn from available features. 

The data was generated (by Hume Winzar at Macquarie) based on a real dataset provided by Optus.  The data is simulated but the column headings are the same. (Note that I'm not sure if all of the real relationships in this data are preserved so you need to be cautious in interpreting the results of your analysis here).  

The data is provided in file `MobileCustomerChurn.csv` and column headings are defined in a file `MobileChurnDataDictionary.csv` (store these in the `files` folder in your project).

Your high level goal in this notebook is to try to build and evaluate a __predictive model for churn__ - predict the value of the CHURN_IND field in the data from some of the other fields.  Note that the three `RECON` fields should not be used as they indicate whether the customer reconnected after having churned. 

__Note:__ you are not being evaluated on the _accuracy_ of the model but on the _process_ that you use to generate it.  You can use a simple model such as Logistic Regression for this task or try one of the more advanced methods covered in recent weeks.  Explore the data, build a model using a selection of features and then do some work on finding out which features provide the most accurate results.  

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import r2_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.feature_selection import RFE

In [2]:
churn = pd.read_csv("files/MobileCustomerChurn.csv", na_values=["NA", "#VALUE!"], index_col='INDEX')
churn.head()

Unnamed: 0_level_0,CUST_ID,ACCOUNT_TENURE,ACCT_CNT_SERVICES,AGE,CFU,SERVICE_TENURE,PLAN_ACCESS_FEE,BYO_PLAN_STATUS,PLAN_TENURE,MONTHS_OF_CONTRACT_REMAINING,...,CONTRACT_STATUS,PREV_CONTRACT_DURATION,HANDSET_USED_BRAND,CHURN_IND,MONTHLY_SPEND,COUNTRY_METRO_REGION,STATE,RECON_SMS_NEXT_MTH,RECON_TELE_NEXT_MTH,RECON_EMAIL_NEXT_MTH
INDEX,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1,46,1,30.0,CONSUMER,46,54.54,NON BYO,15,0,...,OFF-CONTRACT,24,SAMSUNG,1,61.4,COUNTRY,WA,,,
2,2,60,3,55.0,CONSUMER,59,54.54,NON BYO,5,0,...,OFF-CONTRACT,24,APPLE,1,54.54,METRO,NSW,,,
3,5,65,1,29.0,CONSUMER,65,40.9,BYO,15,0,...,OFF-CONTRACT,12,APPLE,1,2.5,COUNTRY,WA,,,
4,6,31,1,51.0,CONSUMER,31,31.81,NON BYO,31,0,...,OFF-CONTRACT,24,APPLE,1,6.48,COUNTRY,VIC,,,
5,8,95,1,31.0,CONSUMER,95,54.54,NON BYO,0,0,...,OFF-CONTRACT,24,APPLE,1,100.22,METRO,NSW,,,


## Data Preparation

Removal of irrelevant columns based on data/problem context

In [3]:
churn = churn.drop(['RECON_SMS_NEXT_MTH', 'RECON_TELE_NEXT_MTH', 'RECON_EMAIL_NEXT_MTH'], axis = 1)

In [4]:
churn.head()

Unnamed: 0_level_0,CUST_ID,ACCOUNT_TENURE,ACCT_CNT_SERVICES,AGE,CFU,SERVICE_TENURE,PLAN_ACCESS_FEE,BYO_PLAN_STATUS,PLAN_TENURE,MONTHS_OF_CONTRACT_REMAINING,LAST_FX_CONTRACT_DURATION,CONTRACT_STATUS,PREV_CONTRACT_DURATION,HANDSET_USED_BRAND,CHURN_IND,MONTHLY_SPEND,COUNTRY_METRO_REGION,STATE
INDEX,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,1,46,1,30.0,CONSUMER,46,54.54,NON BYO,15,0,24,OFF-CONTRACT,24,SAMSUNG,1,61.4,COUNTRY,WA
2,2,60,3,55.0,CONSUMER,59,54.54,NON BYO,5,0,24,OFF-CONTRACT,24,APPLE,1,54.54,METRO,NSW
3,5,65,1,29.0,CONSUMER,65,40.9,BYO,15,0,12,OFF-CONTRACT,12,APPLE,1,2.5,COUNTRY,WA
4,6,31,1,51.0,CONSUMER,31,31.81,NON BYO,31,0,24,OFF-CONTRACT,24,APPLE,1,6.48,COUNTRY,VIC
5,8,95,1,31.0,CONSUMER,95,54.54,NON BYO,0,0,24,OFF-CONTRACT,24,APPLE,1,100.22,METRO,NSW


Identification of potential 2 value categorical columns

In [5]:
churn.value_counts('COUNTRY_METRO_REGION')


COUNTRY_METRO_REGION
METRO      31826
COUNTRY    14379
dtype: int64

In [6]:
churn.value_counts('CFU')

CFU
CONSUMER          39087
SMALL BUSINESS     7119
dtype: int64

In [7]:
churn.value_counts('BYO_PLAN_STATUS')

BYO_PLAN_STATUS
NON BYO    35475
BYO        10731
dtype: int64

In [8]:
churn.value_counts('CONTRACT_STATUS')

CONTRACT_STATUS
ON-CONTRACT     28281
OFF-CONTRACT    12460
NO-CONTRACT      5465
dtype: int64

Conversion of identified 2 value categorical columns into numerical booleans

In [9]:
churn['Is Metro'] = (churn['COUNTRY_METRO_REGION'] =='METRO').astype(int)

In [10]:
churn['is BYO'] = (churn['BYO_PLAN_STATUS'] == 'BYO').astype(int)

In [11]:
churn['is CONSUMER'] = (churn['CFU'] == 'CONSUMER').astype(int)

In [12]:
churn.head()

Unnamed: 0_level_0,CUST_ID,ACCOUNT_TENURE,ACCT_CNT_SERVICES,AGE,CFU,SERVICE_TENURE,PLAN_ACCESS_FEE,BYO_PLAN_STATUS,PLAN_TENURE,MONTHS_OF_CONTRACT_REMAINING,...,CONTRACT_STATUS,PREV_CONTRACT_DURATION,HANDSET_USED_BRAND,CHURN_IND,MONTHLY_SPEND,COUNTRY_METRO_REGION,STATE,Is Metro,is BYO,is CONSUMER
INDEX,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1,46,1,30.0,CONSUMER,46,54.54,NON BYO,15,0,...,OFF-CONTRACT,24,SAMSUNG,1,61.4,COUNTRY,WA,0,0,1
2,2,60,3,55.0,CONSUMER,59,54.54,NON BYO,5,0,...,OFF-CONTRACT,24,APPLE,1,54.54,METRO,NSW,1,0,1
3,5,65,1,29.0,CONSUMER,65,40.9,BYO,15,0,...,OFF-CONTRACT,12,APPLE,1,2.5,COUNTRY,WA,0,1,1
4,6,31,1,51.0,CONSUMER,31,31.81,NON BYO,31,0,...,OFF-CONTRACT,24,APPLE,1,6.48,COUNTRY,VIC,0,0,1
5,8,95,1,31.0,CONSUMER,95,54.54,NON BYO,0,0,...,OFF-CONTRACT,24,APPLE,1,100.22,METRO,NSW,1,0,1


NaN value check with apporopriate deletion of rows

In [13]:
churn.isna().sum()
churn = churn.dropna(axis=0)

#  Exploring the Dataset

In [14]:
churn.shape

(46129, 21)

In [15]:
churn.describe()

Unnamed: 0,CUST_ID,ACCOUNT_TENURE,ACCT_CNT_SERVICES,AGE,SERVICE_TENURE,PLAN_ACCESS_FEE,PLAN_TENURE,MONTHS_OF_CONTRACT_REMAINING,LAST_FX_CONTRACT_DURATION,PREV_CONTRACT_DURATION,CHURN_IND,MONTHLY_SPEND,Is Metro,is BYO,is CONSUMER
count,46129.0,46129.0,46129.0,46129.0,46129.0,46129.0,46129.0,46129.0,46129.0,46129.0,46129.0,46129.0,46129.0,46129.0,46129.0
mean,42338.001344,45.887229,1.554402,41.411607,50.364413,51.360367,10.851157,8.234733,20.350755,15.253051,0.385072,75.16741,0.688612,0.232327,0.847146
std,22102.853209,33.073285,0.834352,15.263812,51.942875,20.854578,9.772148,8.339838,8.033236,10.98164,0.486618,73.392728,0.463066,0.422321,0.359851
min,1.0,0.0,1.0,-4.0,0.0,8.18,0.0,0.0,0.0,0.0,0.0,1.02,0.0,0.0,0.0
25%,24951.0,14.0,1.0,28.0,11.0,36.36,3.0,0.0,24.0,0.0,0.0,36.36,0.0,0.0,1.0
50%,43264.0,44.0,1.0,40.0,35.0,54.54,8.0,7.0,24.0,24.0,0.0,54.54,1.0,0.0,1.0
75%,61141.0,77.0,2.0,52.0,69.0,72.72,16.0,16.0,24.0,24.0,1.0,84.53,1.0,0.0,1.0
max,79500.0,120.0,4.0,116.0,259.0,234.54,147.0,24.0,36.0,36.0,1.0,1965.89,1.0,1.0,1.0


In [16]:
churn.value_counts('CHURN_IND')

CHURN_IND
0    28366
1    17763
dtype: int64

In [17]:
#sns.pairplot(data = churn.sample(1000), hue = 'CHURN_IND')

# Logistic Regression

Removal of non numerical columns

In [20]:
churn = churn.drop(['COUNTRY_METRO_REGION', 'BYO_PLAN_STATUS', 'CFU',
                    'CONTRACT_STATUS', 'HANDSET_USED_BRAND', 'STATE'], axis = 1)

Data Split into a 20-80 split for test and train sets.

In [52]:
train, test = train_test_split(churn, test_size = 0.2)
print(train.shape)
print(test.shape)

(36903, 15)
(9226, 15)


Logistic Regression Model

In [63]:
X_train = train.drop(['CHURN_IND','CUST_ID',], axis = 1)
y_train = train['CHURN_IND']
X_test = test.drop(['CHURN_IND','CUST_ID'], axis = 1)
y_test = test['CHURN_IND']

In [64]:
lr = LogisticRegression(max_iter=500)
lr.fit(X_train, y_train)

LogisticRegression(max_iter=500)

In [65]:
lr.coef_

array([[-0.00781907, -0.01455692, -0.01418935, -0.00482768, -0.0091249 ,
         0.00562672, -0.12815835,  0.01047812, -0.00310294,  0.01119849,
        -0.28527938, -0.44918513,  0.10035319]])

In [67]:
# Do predictions on test set
train_preds = lr.predict(X_train)
test_preds = lr.predict(X_test)
print("Train Accuracy:")
print(accuracy_score(y_train, train_preds))
print("Test Accuracy:")
print(accuracy_score(y_test, test_preds))

Train Accuracy:
0.710104869522803
Test Accuracy:
0.716453500975504
