# Mobile Customer Churn

In this Portfolio task you will work with some (fake but realistic) data on Mobile Customer Churn.  Churn is where
a customer leaves the mobile provider.   The goal is to build a simple predictive model to predict churn from available features. 

The data was generated (by Hume Winzar at Macquarie) based on a real dataset provided by Optus.  The data is simulated but the column headings are the same. (Note that I'm not sure if all of the real relationships in this data are preserved so you need to be cautious in interpreting the results of your analysis here).  

The data is provided in file `MobileCustomerChurn.csv` and column headings are defined in a file `MobileChurnDataDictionary.csv` (store these in the `files` folder in your project).

Your high level goal in this notebook is to try to build and evaluate a __predictive model for churn__ - predict the value of the CHURN_IND field in the data from some of the other fields.  Note that the three `RECON` fields should not be used as they indicate whether the customer reconnected after having churned. 

__Note:__ you are not being evaluated on the _accuracy_ of the model but on the _process_ that you use to generate it.  You can use a simple model such as Logistic Regression for this task or try one of the more advanced methods covered in recent weeks.  Explore the data, build a model using a selection of features and then do some work on finding out which features provide the most accurate results.  

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
churn = pd.read_csv("files/MobileCustomerChurn.csv", na_values=["NA", "#VALUE!"], index_col='INDEX')
churn.head()

Unnamed: 0_level_0,CUST_ID,ACCOUNT_TENURE,ACCT_CNT_SERVICES,AGE,CFU,SERVICE_TENURE,PLAN_ACCESS_FEE,BYO_PLAN_STATUS,PLAN_TENURE,MONTHS_OF_CONTRACT_REMAINING,...,CONTRACT_STATUS,PREV_CONTRACT_DURATION,HANDSET_USED_BRAND,CHURN_IND,MONTHLY_SPEND,COUNTRY_METRO_REGION,STATE,RECON_SMS_NEXT_MTH,RECON_TELE_NEXT_MTH,RECON_EMAIL_NEXT_MTH
INDEX,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1,46,1,30.0,CONSUMER,46,54.54,NON BYO,15,0,...,OFF-CONTRACT,24,SAMSUNG,1,61.4,COUNTRY,WA,,,
2,2,60,3,55.0,CONSUMER,59,54.54,NON BYO,5,0,...,OFF-CONTRACT,24,APPLE,1,54.54,METRO,NSW,,,
3,5,65,1,29.0,CONSUMER,65,40.9,BYO,15,0,...,OFF-CONTRACT,12,APPLE,1,2.5,COUNTRY,WA,,,
4,6,31,1,51.0,CONSUMER,31,31.81,NON BYO,31,0,...,OFF-CONTRACT,24,APPLE,1,6.48,COUNTRY,VIC,,,
5,8,95,1,31.0,CONSUMER,95,54.54,NON BYO,0,0,...,OFF-CONTRACT,24,APPLE,1,100.22,METRO,NSW,,,


In [3]:
#Explore the data, build a model using a selection of features and then do some work on finding out 
#which features provide the most accurate results.

churn = churn.dropna(subset=["STATE","COUNTRY_METRO_REGION","AGE"]) #DROP NA VALUES in these columns
churn.isna().sum() #check if dropped (Dont drop RECON since it will be dropped anyways when testing/training)

CUST_ID                             0
ACCOUNT_TENURE                      0
ACCT_CNT_SERVICES                   0
AGE                                 0
CFU                                 0
SERVICE_TENURE                      0
PLAN_ACCESS_FEE                     0
BYO_PLAN_STATUS                     0
PLAN_TENURE                         0
MONTHS_OF_CONTRACT_REMAINING        0
LAST_FX_CONTRACT_DURATION           0
CONTRACT_STATUS                     0
PREV_CONTRACT_DURATION              0
HANDSET_USED_BRAND                  0
CHURN_IND                           0
MONTHLY_SPEND                       0
COUNTRY_METRO_REGION                0
STATE                               0
RECON_SMS_NEXT_MTH              17763
RECON_TELE_NEXT_MTH             17763
RECON_EMAIL_NEXT_MTH            17763
dtype: int64

In [4]:
#Examining data
churn.describe()

Unnamed: 0,CUST_ID,ACCOUNT_TENURE,ACCT_CNT_SERVICES,AGE,SERVICE_TENURE,PLAN_ACCESS_FEE,PLAN_TENURE,MONTHS_OF_CONTRACT_REMAINING,LAST_FX_CONTRACT_DURATION,PREV_CONTRACT_DURATION,CHURN_IND,MONTHLY_SPEND,RECON_SMS_NEXT_MTH,RECON_TELE_NEXT_MTH,RECON_EMAIL_NEXT_MTH
count,46129.0,46129.0,46129.0,46129.0,46129.0,46129.0,46129.0,46129.0,46129.0,46129.0,46129.0,46129.0,28366.0,28366.0,28366.0
mean,42338.001344,45.887229,1.554402,41.411607,50.364413,51.360367,10.851157,8.234733,20.350755,15.253051,0.385072,75.16741,0.014665,0.191462,0.007051
std,22102.853209,33.073285,0.834352,15.263812,51.942875,20.854578,9.772148,8.339838,8.033236,10.98164,0.486618,73.392728,0.120212,0.393458,0.083673
min,1.0,0.0,1.0,-4.0,0.0,8.18,0.0,0.0,0.0,0.0,0.0,1.02,0.0,0.0,0.0
25%,24951.0,14.0,1.0,28.0,11.0,36.36,3.0,0.0,24.0,0.0,0.0,36.36,0.0,0.0,0.0
50%,43264.0,44.0,1.0,40.0,35.0,54.54,8.0,7.0,24.0,24.0,0.0,54.54,0.0,0.0,0.0
75%,61141.0,77.0,2.0,52.0,69.0,72.72,16.0,16.0,24.0,24.0,1.0,84.53,0.0,0.0,0.0
max,79500.0,120.0,4.0,116.0,259.0,234.54,147.0,24.0,36.0,36.0,1.0,1965.89,1.0,1.0,1.0


In [5]:
print(churn.shape)

(46129, 21)


In [6]:
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn import metrics
from sklearn.feature_selection import RFE

train, test = train_test_split(churn, test_size=0.2, random_state=42) #more training, less testing
print(train.shape)   #fixing random_state gives same result for different parameters/models so we can compare them
print(test.shape)

(36903, 21)
(9226, 21)


In [7]:
set(churn['CHURN_IND']) #Checking CHURN_IND classes

{0, 1}

In [8]:
# Logistic Regression function (with confusion matrix and ROC-AUC score)

def LogisticRegressionModel(x_train,y_train,x_test,y_test):
    model = LogisticRegression(solver='lbfgs', max_iter=1000)
    model.fit(x_train,y_train)
    
    #Testing
    y_pred = model.predict(x_test)
    
    print("\nActual CHURN_IND sample values from training dataset:")
    print(y_test[:5])

    print("\nCorresponding Predicted CHURN_IND samples:")
    print(y_pred[:5])
    
    #Evaluation of model    
    yhat = model.predict(x_test) 
    lr_model = LogisticRegression()
    rfe = RFE(estimator=lr_model, n_features_to_select=5, step=1)
    rfe.fit(x_train,y_train)

    print("\nConfusion matrix on test set: ")
    print(confusion_matrix(y_test, yhat)) #1 = true positive

    #TP = 1, FP = 2, FN = 3, TN = 4 (confusion matrix)
    #[TP FP 
    # FN TN]

    #compute accuracy score on confusion matrix
    y_test_hat = rfe.predict(x_test) #predicting, so use x_test
    print("Accuracy of confusion matrix: {:.2f}%".format(accuracy_score(y_test, y_test_hat)))

    #ROC-AUC score
    score = metrics.roc_auc_score(y_test,y_test_hat)
    print("ROC score: {:.2f}%".format(score))
    
    #Accuracy score
    testing_accuracy = metrics.accuracy_score(y_test, yhat)
    print("\n\nTesting accuracy: {:.2f}%".format(testing_accuracy))
    

In [9]:
# Logistic Regression function

def LogisticRegressionModel0(x_train,y_train,x_test,y_test):
    model = LogisticRegression(solver='lbfgs', max_iter=1000)
    model.fit(x_train,y_train)
    
    #Testing
    y_pred = model.predict(x_test)

    print("Actual CHURN_IND sample values from training dataset:")
    print(y_test[:5])

    print("\nCorresponding Predicted CHURN_IND samples:")
    print(y_pred[:5])
    
    #Evaluation of model
    yhat = model.predict(x_test) 
    testing_accuracy = metrics.accuracy_score(y_test, yhat)
    print("\n\nTesting accuracy: {:.2f}%".format(testing_accuracy))

In [10]:
# K Nearest Neighbour function

def KNNModel(x_train,y_train,x_test,y_test):
    from sklearn.neighbors import KNeighborsClassifier

    model = KNeighborsClassifier(n_neighbors=1)
    model.fit(x_train, y_train)
    
    y_pred = model.predict(x_test)
    
    print("Actual CHURN_IND sample values from training dataset:")
    print(y_test[:5])

    print("\nCorresponding Predicted CHURN_IND samples:")
    print(y_pred[:5])

    y_pred = model.predict(x_test)
    testing_accuracy = accuracy_score(y_test,y_pred)
    print("\n\nTesting accuracy: {:.2f}%".format(testing_accuracy))

In [11]:
#GaussianNBModel function

def GaussianNBModel(x_train,y_train,x_test,y_test):
    from sklearn.naive_bayes import GaussianNB
    model = GaussianNB()
    model.fit(x_train, y_train)

    y_pred = model.predict(x_test)
    
    print("Actual CHURN_IND sample values from training dataset:")
    print(y_test[:5])

    print("\nCorresponding Predicted CHURN_IND samples:")
    print(y_pred[:5])
    
    y_pred = model.predict(x_test)
    testing_accuracy = accuracy_score(y_pred, y_test)
    print("\n\nTesting accuracy: {:.2f}%".format(testing_accuracy))

In [25]:
#Define some parameters

par1 = ['CHURN_IND','BYO_PLAN_STATUS','CFU','CONTRACT_STATUS','HANDSET_USED_BRAND',
                       'COUNTRY_METRO_REGION','STATE','RECON_SMS_NEXT_MTH','RECON_TELE_NEXT_MTH',
                       'RECON_EMAIL_NEXT_MTH']

par2 = ['ACCT_CNT_SERVICES','ACCOUNT_TENURE','SERVICE_TENURE','PLAN_ACCESS_FEE','PLAN_TENURE',
                       'MONTHS_OF_CONTRACT_REMAINING','LAST_FX_CONTRACT_DURATION','PREV_CONTRACT_DURATION', 
                       'CHURN_IND','BYO_PLAN_STATUS','CFU','CONTRACT_STATUS','HANDSET_USED_BRAND',
                       'COUNTRY_METRO_REGION','STATE','RECON_SMS_NEXT_MTH','RECON_TELE_NEXT_MTH',
                       'RECON_EMAIL_NEXT_MTH']

par3 = ['MONTHLY_SPEND','AGE','SERVICE_TENURE','PLAN_ACCESS_FEE','PLAN_TENURE',
                       'MONTHS_OF_CONTRACT_REMAINING','LAST_FX_CONTRACT_DURATION','PREV_CONTRACT_DURATION', 
                       'CHURN_IND','BYO_PLAN_STATUS','CFU','CONTRACT_STATUS','HANDSET_USED_BRAND',
                       'COUNTRY_METRO_REGION','STATE','RECON_SMS_NEXT_MTH','RECON_TELE_NEXT_MTH',
                       'RECON_EMAIL_NEXT_MTH']

par4 = ['MONTHLY_SPEND','AGE','SERVICE_TENURE','PLAN_ACCESS_FEE','PLAN_TENURE',
                       'ACCOUNT_TENURE','ACCT_CNT_SERVICES', 
                       'CHURN_IND','BYO_PLAN_STATUS','CFU','CONTRACT_STATUS','HANDSET_USED_BRAND',
                       'COUNTRY_METRO_REGION','STATE','RECON_SMS_NEXT_MTH','RECON_TELE_NEXT_MTH',
                       'RECON_EMAIL_NEXT_MTH']

par5 = ['AGE','ACCOUNT_TENURE','ACCT_CNT_SERVICES',
                       'MONTHS_OF_CONTRACT_REMAINING','LAST_FX_CONTRACT_DURATION','PREV_CONTRACT_DURATION', 
                       'CHURN_IND','BYO_PLAN_STATUS','CFU','CONTRACT_STATUS','HANDSET_USED_BRAND',
                       'COUNTRY_METRO_REGION','STATE','RECON_SMS_NEXT_MTH','RECON_TELE_NEXT_MTH',
                       'RECON_EMAIL_NEXT_MTH']

par6= ['ACCT_CNT_SERVICES','AGE','MONTHLY_SPEND','PLAN_ACCESS_FEE','MONTHLY_SPEND',
                       'MONTHS_OF_CONTRACT_REMAINING','LAST_FX_CONTRACT_DURATION','PREV_CONTRACT_DURATION', 
                       'CHURN_IND','BYO_PLAN_STATUS','CFU','CONTRACT_STATUS','HANDSET_USED_BRAND',
                       'COUNTRY_METRO_REGION','STATE','RECON_SMS_NEXT_MTH','RECON_TELE_NEXT_MTH',
                       'RECON_EMAIL_NEXT_MTH']

x_train1 = train.drop(par1,axis=1)
x_test1 = test.drop(par1,axis=1)#All numerical fields

x_train2 = train.drop(par2,axis=1)
x_test2 = test.drop(par2,axis=1) #age and monthly spend

x_train3 = train.drop(par3,axis=1)
x_test3 = test.drop(par3,axis=1) #Account information

x_train4 = train.drop(par4,axis=1)
x_test4 = test.drop(par4,axis=1) #Monthly contract info (excluding monthly spending)

x_train5 = train.drop(par5,axis=1)
x_test5 = test.drop(par5,axis=1) #Plans/service information and monthly spending

x_train6 = train.drop(par6,axis=1)
x_test6 = test.drop(par6,axis=1) #Tenure information


#y_train/test will always be the same (CHURN_IND)
y_train1 = train['CHURN_IND']
y_test1 = test['CHURN_IND']

In [13]:
#Which parameteres should we use for our model? Does it change accuracy in predicting CHURN_IND?
print("Logistic Regression predicting CHURN_IND from Account information:")
LogisticRegressionModel(x_train3,y_train1,x_test3,y_test1)

Logistic Regression predicting CHURN_IND from Account information:

Actual CHURN_IND sample values from training dataset:
INDEX
20090    0
29283    0
23036    1
14911    0
33262    0
Name: CHURN_IND, dtype: int64

Corresponding Predicted CHURN_IND samples:
[0 0 0 0 0]

Confusion matrix on test set: 
[[5677    0]
 [2859  690]]
Accuracy of confusion matrix: 0.69%
ROC score: 0.60%


Testing accuracy: 0.69%


In [14]:
print("6.Logistic Regression predicting CHURN_IND from Monthly contract info (excluding monthly spending):")
LogisticRegressionModel(x_train4,y_train1,x_test4,y_test1)

6.Logistic Regression predicting CHURN_IND from Monthly contract info (excluding monthly spending):

Actual CHURN_IND sample values from training dataset:
INDEX
20090    0
29283    0
23036    1
14911    0
33262    0
Name: CHURN_IND, dtype: int64

Corresponding Predicted CHURN_IND samples:
[0 0 1 0 0]

Confusion matrix on test set: 
[[5132  545]
 [1912 1637]]
Accuracy of confusion matrix: 0.73%
ROC score: 0.68%


Testing accuracy: 0.73%


In [15]:
print("Logistic Regression predicting CHURN_IND from Tenure information:")
LogisticRegressionModel(x_train6,y_train1,x_test6,y_test1)

Logistic Regression predicting CHURN_IND from Tenure information:

Actual CHURN_IND sample values from training dataset:
INDEX
20090    0
29283    0
23036    1
14911    0
33262    0
Name: CHURN_IND, dtype: int64

Corresponding Predicted CHURN_IND samples:
[0 0 0 0 0]

Confusion matrix on test set: 
[[5303  374]
 [2063 1486]]
Accuracy of confusion matrix: 0.74%
ROC score: 0.68%


Testing accuracy: 0.74%


In [16]:
print("Logistic Regression predicting CHURN_IND from Plans/service information and monthly contract info:")
LogisticRegressionModel(x_train5,y_train1,x_test5,y_test1)

Logistic Regression predicting CHURN_IND from Plans/service information and monthly contract info:

Actual CHURN_IND sample values from training dataset:
INDEX
20090    0
29283    0
23036    1
14911    0
33262    0
Name: CHURN_IND, dtype: int64

Corresponding Predicted CHURN_IND samples:
[0 0 0 0 0]

Confusion matrix on test set: 
[[5195  482]
 [1529 2020]]
Accuracy of confusion matrix: 0.78%
ROC score: 0.74%


Testing accuracy: 0.78%


In [17]:
print("Logistic Regression predicting CHURN_IND from age and monthly spending:")
LogisticRegressionModel(x_train2,y_train1,x_test2,y_test1)

Logistic Regression predicting CHURN_IND from age and monthly spending:

Actual CHURN_IND sample values from training dataset:
INDEX
20090    0
29283    0
23036    1
14911    0
33262    0
Name: CHURN_IND, dtype: int64

Corresponding Predicted CHURN_IND samples:
[0 0 0 0 0]

Confusion matrix on test set: 
[[5213  464]
 [1444 2105]]
Accuracy of confusion matrix: 0.79%
ROC score: 0.76%


Testing accuracy: 0.79%


In [18]:
print("From the results above, we find that using all numerical fields in churn produces the most accurate model\n"
     "for predicting CHURN_IND.")

From the results above, we find that using all numerical fields in churn produces the most accurate model
for predicting CHURN_IND.


In [26]:
#Does choosing a different model affect our accuracy in predicting CHURN_IND (which model should we choose?)

print("1.Logistic Regression predicting CHURN_IND from all numerical fields-\n")
LogisticRegressionModel0(x_train1,y_train1,x_test1,y_test1)

1.Logistic Regression predicting CHURN_IND from all numerical fields-

Actual CHURN_IND sample values from training dataset:
INDEX
20090    0
29283    0
23036    1
14911    0
33262    0
Name: CHURN_IND, dtype: int64

Corresponding Predicted CHURN_IND samples:
[0 0 1 0 0]


Testing accuracy: 0.79%


In [20]:
print("1.K Nearest Neighbour predicting CHURN_IND from all numerical fields-\n")
KNNModel(x_train1,y_train1,x_test1,y_test1)

1.K Nearest Neighbour predicting CHURN_IND from all numerical fields-

Actual CHURN_IND sample values from training dataset:
INDEX
20090    0
29283    0
23036    1
14911    0
33262    0
Name: CHURN_IND, dtype: int64

Corresponding Predicted CHURN_IND samples:
[0 0 0 0 0]


Testing accuracy: 0.77%


In [21]:
print("1.GaussianNBModel predicting CHURN_IND from all numerical fields-\n")
GaussianNBModel(x_train1,y_train1,x_test1,y_test1)

1.GaussianNBModel predicting CHURN_IND from all numerical fields-

Actual CHURN_IND sample values from training dataset:
INDEX
20090    0
29283    0
23036    1
14911    0
33262    0
Name: CHURN_IND, dtype: int64

Corresponding Predicted CHURN_IND samples:
[0 0 0 0 0]


Testing accuracy: 0.80%


In [22]:
print("From the results above, we should use the GaussianNB Model since it has the highest accuracy out of\nthe 3 models."
      "\n\nHOWEVER corresponding CHURN_IND sample/predicted values shown in the model are not matching, while Logistical\n"
      "Regression matches. We instead should use Logistical Regression model since its has almost identical testing accuracy "
      "as GaussianNB while having correct corresponding CHURN_IND sample/predicted values.")

From the results above, we should use the GaussianNB Model since it has the highest accuracy out of
the 3 models.

HOWEVER corresponding CHURN_IND sample/predicted values shown in the model are not matching, while Logistical
Regression matches. We instead should use Logistical Regression model since its has almost identical testing accuracy as GaussianNB while having correct corresponding CHURN_IND sample/predicted values.
