## Mobile Customer Churn
In this Portfolio task you will work with some (fake but realistic) data on Mobile Customer Churn with the goal of characterising customers who churn and building a simple predictive model to predict churn from available features.

The data was generated (by Hume Winzar at Macquarie) based on a real dataset provided by Optus. The data is simulated but the column headings are the same. (Note that I'm not sure if all of the real relationships in this data are preserved so you need to be cautious in interpreting the results of your analysis here).

The data is provided in file MobileCustomerChurn.csv and column headings are defined in a file MobileChurnDataDictionary.csv (store these in the files folder in your project).

Your high level goal in this notebook are to:

look for significant clusters within the churn data - you might look separately at those who churn and those who don't or group them all together.
try to build and evaluate a predictive model for churn - predict the value of the CHURN_IND field in the data from some of the other fields

In [83]:
import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing

In [3]:

import os, types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.

if os.environ.get('RUNTIME_ENV_LOCATION_TYPE') == 'external':
    endpoint_ddd8810590c249348fe4ffccf137fd22 = 'https://s3-api.us-geo.objectstorage.softlayer.net'
else:
    endpoint_ddd8810590c249348fe4ffccf137fd22 = 'https://s3-api.us-geo.objectstorage.service.networklayer.com'

client_ddd8810590c249348fe4ffccf137fd22 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='-9ji40lux0spinzmpCzMpkmXlyvmYn8joqGlihNse1pZ',
    ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url=endpoint_ddd8810590c249348fe4ffccf137fd22)

body = client_ddd8810590c249348fe4ffccf137fd22.get_object(Bucket='practiceaiampml-donotdelete-pr-dccezt0pe2cdoj',Key='MobileCustomerChurn.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

customers_df = pd.read_csv(body)
customers_df.head()


Unnamed: 0,INDEX,CUST_ID,ACCOUNT_TENURE,ACCT_CNT_SERVICES,AGE,CFU,SERVICE_TENURE,PLAN_ACCESS_FEE,BYO_PLAN_STATUS,PLAN_TENURE,...,CONTRACT_STATUS,PREV_CONTRACT_DURATION,HANDSET_USED_BRAND,CHURN_IND,MONTHLY_SPEND,COUNTRY_METRO_REGION,STATE,RECON_SMS_NEXT_MTH,RECON_TELE_NEXT_MTH,RECON_EMAIL_NEXT_MTH
0,1,1,46,1,30,CONSUMER,46,54.54,NON BYO,15,...,OFF-CONTRACT,24,SAMSUNG,1,61.4,COUNTRY,WA,,,
1,2,2,60,3,55,CONSUMER,59,54.54,NON BYO,5,...,OFF-CONTRACT,24,APPLE,1,54.54,METRO,NSW,,,
2,3,5,65,1,29,CONSUMER,65,40.9,BYO,15,...,OFF-CONTRACT,12,APPLE,1,2.5,COUNTRY,WA,,,
3,4,6,31,1,51,CONSUMER,31,31.81,NON BYO,31,...,OFF-CONTRACT,24,APPLE,1,6.48,COUNTRY,VIC,,,
4,5,8,95,1,31,CONSUMER,95,54.54,NON BYO,0,...,OFF-CONTRACT,24,APPLE,1,100.22,METRO,NSW,,,


In [62]:
customers_df.shape

(46206, 22)

In [76]:
#Check if there are multiple rows with same customers, e.g. for multiple services
customers_df['CUST_ID'].nunique()

46206

In [24]:
customers_df.isnull().sum(axis=0)

INDEX                               0
CUST_ID                             0
ACCOUNT_TENURE                      0
ACCT_CNT_SERVICES                   0
AGE                                 0
CFU                                 0
SERVICE_TENURE                      0
PLAN_ACCESS_FEE                     0
BYO_PLAN_STATUS                     0
PLAN_TENURE                         0
MONTHS_OF_CONTRACT_REMAINING        0
LAST_FX_CONTRACT_DURATION           0
CONTRACT_STATUS                     0
PREV_CONTRACT_DURATION              0
HANDSET_USED_BRAND                  0
CHURN_IND                           0
MONTHLY_SPEND                       0
COUNTRY_METRO_REGION                1
STATE                               1
RECON_SMS_NEXT_MTH              17790
RECON_TELE_NEXT_MTH             17790
RECON_EMAIL_NEXT_MTH            17790
dtype: int64

In [25]:
100*customers_df.isnull().sum()/len(customers_df)

INDEX                            0.000000
CUST_ID                          0.000000
ACCOUNT_TENURE                   0.000000
ACCT_CNT_SERVICES                0.000000
AGE                              0.000000
CFU                              0.000000
SERVICE_TENURE                   0.000000
PLAN_ACCESS_FEE                  0.000000
BYO_PLAN_STATUS                  0.000000
PLAN_TENURE                      0.000000
MONTHS_OF_CONTRACT_REMAINING     0.000000
LAST_FX_CONTRACT_DURATION        0.000000
CONTRACT_STATUS                  0.000000
PREV_CONTRACT_DURATION           0.000000
HANDSET_USED_BRAND               0.000000
CHURN_IND                        0.000000
MONTHLY_SPEND                    0.000000
COUNTRY_METRO_REGION             0.002164
STATE                            0.002164
RECON_SMS_NEXT_MTH              38.501493
RECON_TELE_NEXT_MTH             38.501493
RECON_EMAIL_NEXT_MTH            38.501493
dtype: float64

In [4]:
customers_df.groupby(['BYO_PLAN_STATUS'])['CHURN_IND'].value_counts(normalize=True)

BYO_PLAN_STATUS  CHURN_IND
BYO              0            0.549436
                 1            0.450564
NON BYO          0            0.634813
                 1            0.365187
Name: CHURN_IND, dtype: float64

In [5]:
customers_df.groupby(['STATE'])['CHURN_IND'].value_counts(normalize=True)

STATE  CHURN_IND
ACT    0            0.660020
       1            0.339980
NSW    0            0.614987
       1            0.385013
NT     0            0.548387
       1            0.451613
QLD    0            0.595039
       1            0.404961
SA     0            0.661143
       1            0.338857
TAS    0            0.586093
       1            0.413907
VIC    0            0.626539
       1            0.373461
WA     0            0.583161
       1            0.416839
Name: CHURN_IND, dtype: float64

In [13]:
customers_df['CONTRACT_STATUS'].value_counts()

ON-CONTRACT     28281
OFF-CONTRACT    12460
NO-CONTRACT      5465
Name: CONTRACT_STATUS, dtype: int64

In [66]:
customers_df.groupby(['CONTRACT_STATUS'])['CHURN_IND'].value_counts(normalize=True)

CONTRACT_STATUS  CHURN_IND
NO-CONTRACT      0            0.527722
                 1            0.472278
OFF-CONTRACT     1            0.553291
                 0            0.446709
ON-CONTRACT      0            0.705986
                 1            0.294014
Name: CHURN_IND, dtype: float64

In [16]:
customers_df['CONTRACT_STATUS'].unique()

array(['OFF-CONTRACT', 'ON-CONTRACT', 'NO-CONTRACT'], dtype=object)

In [17]:
customers_df['CFU'].unique()

array(['CONSUMER', 'SMALL BUSINESS'], dtype=object)

In [18]:
customers_df.groupby(['CFU'])['CHURN_IND'].value_counts(normalize=True)

CFU             CHURN_IND
CONSUMER        0            0.605521
                1            0.394479
SMALL BUSINESS  0            0.666948
                1            0.333052
Name: CHURN_IND, dtype: float64

In [19]:
customers_df['HANDSET_USED_BRAND'].unique()

array(['SAMSUNG', 'APPLE', 'UNKNOWN', 'OTHER', 'GOOGLE', 'HUAWEI'],
      dtype=object)

In [20]:
customers_df.groupby(['HANDSET_USED_BRAND'])['CHURN_IND'].value_counts(normalize=True)

HANDSET_USED_BRAND  CHURN_IND
APPLE               0            0.636888
                    1            0.363112
GOOGLE              0            0.611987
                    1            0.388013
HUAWEI              0            0.586466
                    1            0.413534
OTHER               0            0.511693
                    1            0.488307
SAMSUNG             0            0.626609
                    1            0.373391
UNKNOWN             1            0.588696
                    0            0.411304
Name: CHURN_IND, dtype: float64

In [21]:
customers_df['COUNTRY_METRO_REGION'].unique()

array(['COUNTRY', 'METRO', nan], dtype=object)

In [22]:
customers_df.groupby(['COUNTRY_METRO_REGION'])['CHURN_IND'].value_counts(normalize=True)

COUNTRY_METRO_REGION  CHURN_IND
COUNTRY               0            0.565408
                      1            0.434592
METRO                 0            0.637372
                      1            0.362628
Name: CHURN_IND, dtype: float64

In [61]:
customers_df.shape

(46206, 22)

let's find out the number of customers who churned as soon as their tenure got over

In [44]:
a = len(customers_df.loc[(customers_df['ACCOUNT_TENURE'] == customers_df['PLAN_TENURE']) & customers_df['CHURN_IND'] == 1 ])

In [None]:
let's find out the number of customers who didn't churn even after the tenure was over

In [45]:
b = len(customers_df.loc[(customers_df['ACCOUNT_TENURE'] > customers_df['PLAN_TENURE']) & customers_df['CHURN_IND'] == 1 ])

In [48]:
tot_churn = len(customers_df.loc[(customers_df['CHURN_IND'] == 1)])

In [57]:
#Percentage of churned customers who left immediately after their Tenure was over:
print(a/tot_churn*100)
#Percentage of churned customers who continued after their Tenure was over but dropped eventually: 
print(b/tot_churn*100)

34.19336706014615
65.74480044969083


In [68]:
Feature = customers_df.drop(['INDEX','CUST_ID','HANDSET_USED_BRAND', 'COUNTRY_METRO_REGION', 'STATE', 'RECON_SMS_NEXT_MTH','RECON_TELE_NEXT_MTH', 'RECON_EMAIL_NEXT_MTH'],axis=1)
Feature.head()

Unnamed: 0,ACCOUNT_TENURE,ACCT_CNT_SERVICES,AGE,CFU,SERVICE_TENURE,PLAN_ACCESS_FEE,BYO_PLAN_STATUS,PLAN_TENURE,MONTHS_OF_CONTRACT_REMAINING,LAST_FX_CONTRACT_DURATION,CONTRACT_STATUS,PREV_CONTRACT_DURATION,CHURN_IND,MONTHLY_SPEND
0,46,1,30,CONSUMER,46,54.54,NON BYO,15,0,24,OFF-CONTRACT,24,1,61.4
1,60,3,55,CONSUMER,59,54.54,NON BYO,5,0,24,OFF-CONTRACT,24,1,54.54
2,65,1,29,CONSUMER,65,40.9,BYO,15,0,12,OFF-CONTRACT,12,1,2.5
3,31,1,51,CONSUMER,31,31.81,NON BYO,31,0,24,OFF-CONTRACT,24,1,6.48
4,95,1,31,CONSUMER,95,54.54,NON BYO,0,0,24,OFF-CONTRACT,24,1,100.22


In [69]:
#Lets convert CONSUMER to 0 and SMALL BUSINESS to 1 in CFU column:
Feature['CFU'].replace(to_replace=['CONSUMER','SMALL BUSINESS'], value=[0,1],inplace=True)
Feature.head()

Unnamed: 0,ACCOUNT_TENURE,ACCT_CNT_SERVICES,AGE,CFU,SERVICE_TENURE,PLAN_ACCESS_FEE,BYO_PLAN_STATUS,PLAN_TENURE,MONTHS_OF_CONTRACT_REMAINING,LAST_FX_CONTRACT_DURATION,CONTRACT_STATUS,PREV_CONTRACT_DURATION,CHURN_IND,MONTHLY_SPEND
0,46,1,30,0,46,54.54,NON BYO,15,0,24,OFF-CONTRACT,24,1,61.4
1,60,3,55,0,59,54.54,NON BYO,5,0,24,OFF-CONTRACT,24,1,54.54
2,65,1,29,0,65,40.9,BYO,15,0,12,OFF-CONTRACT,12,1,2.5
3,31,1,51,0,31,31.81,NON BYO,31,0,24,OFF-CONTRACT,24,1,6.48
4,95,1,31,0,95,54.54,NON BYO,0,0,24,OFF-CONTRACT,24,1,100.22


In [72]:
#Lets convert NON BYO to 0 and BYO to 1 in BYO_PLAN_STATUS column:
Feature['BYO_PLAN_STATUS'].replace(to_replace=['NON BYO','BYO'], value=[0,1],inplace=True)
Feature.head()

Unnamed: 0,ACCOUNT_TENURE,ACCT_CNT_SERVICES,AGE,CFU,SERVICE_TENURE,PLAN_ACCESS_FEE,BYO_PLAN_STATUS,PLAN_TENURE,MONTHS_OF_CONTRACT_REMAINING,LAST_FX_CONTRACT_DURATION,CONTRACT_STATUS,PREV_CONTRACT_DURATION,CHURN_IND,MONTHLY_SPEND
0,46,1,30,0,46,54.54,0,15,0,24,OFF-CONTRACT,24,1,61.4
1,60,3,55,0,59,54.54,0,5,0,24,OFF-CONTRACT,24,1,54.54
2,65,1,29,0,65,40.9,1,15,0,12,OFF-CONTRACT,12,1,2.5
3,31,1,51,0,31,31.81,0,31,0,24,OFF-CONTRACT,24,1,6.48
4,95,1,31,0,95,54.54,0,0,0,24,OFF-CONTRACT,24,1,100.22


In [73]:
Feature.shape

(46206, 14)

In [77]:
Feature = pd.concat([Feature,pd.get_dummies(customers_df['CONTRACT_STATUS'])], axis=1)
Feature.head()

Unnamed: 0,ACCOUNT_TENURE,ACCT_CNT_SERVICES,AGE,CFU,SERVICE_TENURE,PLAN_ACCESS_FEE,BYO_PLAN_STATUS,PLAN_TENURE,MONTHS_OF_CONTRACT_REMAINING,LAST_FX_CONTRACT_DURATION,CONTRACT_STATUS,PREV_CONTRACT_DURATION,CHURN_IND,MONTHLY_SPEND,NO-CONTRACT,OFF-CONTRACT,ON-CONTRACT
0,46,1,30,0,46,54.54,0,15,0,24,OFF-CONTRACT,24,1,61.4,0,1,0
1,60,3,55,0,59,54.54,0,5,0,24,OFF-CONTRACT,24,1,54.54,0,1,0
2,65,1,29,0,65,40.9,1,15,0,12,OFF-CONTRACT,12,1,2.5,0,1,0
3,31,1,51,0,31,31.81,0,31,0,24,OFF-CONTRACT,24,1,6.48,0,1,0
4,95,1,31,0,95,54.54,0,0,0,24,OFF-CONTRACT,24,1,100.22,0,1,0


In [78]:
Feature = Feature.drop('CONTRACT_STATUS',axis = 1)
Feature.head()

Unnamed: 0,ACCOUNT_TENURE,ACCT_CNT_SERVICES,AGE,CFU,SERVICE_TENURE,PLAN_ACCESS_FEE,BYO_PLAN_STATUS,PLAN_TENURE,MONTHS_OF_CONTRACT_REMAINING,LAST_FX_CONTRACT_DURATION,PREV_CONTRACT_DURATION,CHURN_IND,MONTHLY_SPEND,NO-CONTRACT,OFF-CONTRACT,ON-CONTRACT
0,46,1,30,0,46,54.54,0,15,0,24,24,1,61.4,0,1,0
1,60,3,55,0,59,54.54,0,5,0,24,24,1,54.54,0,1,0
2,65,1,29,0,65,40.9,1,15,0,12,12,1,2.5,0,1,0
3,31,1,51,0,31,31.81,0,31,0,24,24,1,6.48,0,1,0
4,95,1,31,0,95,54.54,0,0,0,24,24,1,100.22,0,1,0


In [110]:
Feature.head()

Unnamed: 0,ACCOUNT_TENURE,ACCT_CNT_SERVICES,AGE,CFU,SERVICE_TENURE,PLAN_ACCESS_FEE,BYO_PLAN_STATUS,PLAN_TENURE,MONTHS_OF_CONTRACT_REMAINING,LAST_FX_CONTRACT_DURATION,PREV_CONTRACT_DURATION,CHURN_IND,MONTHLY_SPEND,NO-CONTRACT,OFF-CONTRACT,ON-CONTRACT
0,46,1,30,0,46,54.54,0,15,0,24,24,1,61.4,0,1,0
1,60,3,55,0,59,54.54,0,5,0,24,24,1,54.54,0,1,0
2,65,1,29,0,65,40.9,1,15,0,12,12,1,2.5,0,1,0
3,31,1,51,0,31,31.81,0,31,0,24,24,1,6.48,0,1,0
4,95,1,31,0,95,54.54,0,0,0,24,24,1,100.22,0,1,0


In [111]:
Feature.dtypes

ACCOUNT_TENURE                    int64
ACCT_CNT_SERVICES                 int64
AGE                              object
CFU                               int64
SERVICE_TENURE                    int64
PLAN_ACCESS_FEE                 float64
BYO_PLAN_STATUS                   int64
PLAN_TENURE                       int64
MONTHS_OF_CONTRACT_REMAINING      int64
LAST_FX_CONTRACT_DURATION         int64
PREV_CONTRACT_DURATION            int64
CHURN_IND                         int64
MONTHLY_SPEND                   float64
NO-CONTRACT                       uint8
OFF-CONTRACT                      uint8
ON-CONTRACT                       uint8
dtype: object

In [112]:
Feature = Feature[Feature['AGE'] != '#VALUE!']

In [113]:
Feature.shape

(46130, 16)

In [115]:
Feature['AGE'] = pd.to_numeric(Feature['AGE'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [116]:
Feature.dtypes

ACCOUNT_TENURE                    int64
ACCT_CNT_SERVICES                 int64
AGE                               int64
CFU                               int64
SERVICE_TENURE                    int64
PLAN_ACCESS_FEE                 float64
BYO_PLAN_STATUS                   int64
PLAN_TENURE                       int64
MONTHS_OF_CONTRACT_REMAINING      int64
LAST_FX_CONTRACT_DURATION         int64
PREV_CONTRACT_DURATION            int64
CHURN_IND                         int64
MONTHLY_SPEND                   float64
NO-CONTRACT                       uint8
OFF-CONTRACT                      uint8
ON-CONTRACT                       uint8
dtype: object

In [117]:
y = Feature['CHURN_IND'].values
y[0:5]

array([1, 1, 1, 1, 1])

In [118]:
X = Feature.drop('CHURN_IND', axis = 1)
X[0:5]

Unnamed: 0,ACCOUNT_TENURE,ACCT_CNT_SERVICES,AGE,CFU,SERVICE_TENURE,PLAN_ACCESS_FEE,BYO_PLAN_STATUS,PLAN_TENURE,MONTHS_OF_CONTRACT_REMAINING,LAST_FX_CONTRACT_DURATION,PREV_CONTRACT_DURATION,MONTHLY_SPEND,NO-CONTRACT,OFF-CONTRACT,ON-CONTRACT
0,46,1,30,0,46,54.54,0,15,0,24,24,61.4,0,1,0
1,60,3,55,0,59,54.54,0,5,0,24,24,54.54,0,1,0
2,65,1,29,0,65,40.9,1,15,0,12,12,2.5,0,1,0
3,31,1,51,0,31,31.81,0,31,0,24,24,6.48,0,1,0
4,95,1,31,0,95,54.54,0,0,0,24,24,100.22,0,1,0


In [126]:
#Normalizing the data
X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

array([[ 0.00337818, -0.66446638, -0.74765229, -0.4247698 , -0.08407812,
         0.15244627, -0.55011724,  0.42458162, -0.98742672,  0.45426729,
         0.79650115, -0.18758882, -0.36613672,  1.64538911, -1.25605374],
       [ 0.42668002,  1.73264272,  0.89023552, -0.4247698 ,  0.16618151,
         0.15244627, -0.55011724, -0.59874983, -0.98742672,  0.45426729,
         0.79650115, -0.2810606 , -0.36613672,  1.64538911, -1.25605374],
       [ 0.57785924, -0.66446638, -0.81316781, -0.4247698 ,  0.28168595,
        -0.5016135 ,  1.81779433,  0.42458162, -0.98742672, -1.03955577,
        -0.29624802, -0.99013804, -0.36613672,  1.64538911, -1.25605374],
       [-0.4501595 , -0.66446638,  0.62817347, -0.4247698 , -0.37283923,
        -0.93749351, -0.55011724,  2.06191194, -0.98742672,  0.45426729,
         0.79650115, -0.93590806, -0.36613672,  1.64538911, -1.25605374],
       [ 1.4849346 , -0.66446638, -0.68213678, -0.4247698 ,  0.85920816,
         0.15244627, -0.55011724, -1.11041556, 

In [127]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (36904, 15) (36904,)
Test set: (9226, 15) (9226,)


In [128]:

from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.metrics import f1_score
from sklearn.metrics import jaccard_score

In [129]:
#KNN
Ks = 10
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))
f1_score = np.zeros((Ks-1))
jacc_score = np.zeros((Ks-1))
jacc_avg_score = np.zeros((Ks-1))

for n in range(1,Ks):
    
    #Train Model and Predict  
    neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
    yhat=neigh.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)
    f1_score[n-1] = metrics.f1_score(y_test, yhat, average='weighted') 
    std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])

f1_score

array([0.68697784, 0.68782933, 0.71520006, 0.70609482, 0.72233409,
       0.71612849, 0.72436054, 0.71911948, 0.7276681 ])

In [130]:
print( "The best f1_score was with", f1_score.max(), "with k=", f1_score.argmax()+1)

The best f1_score was with 0.7276681025330027 with k= 9


## Decision Tree

In [131]:

from sklearn.tree import DecisionTreeClassifier

In [132]:
drugTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
drugTree # it shows the default parameters

DecisionTreeClassifier(criterion='entropy', max_depth=4)

In [133]:

drugTree.fit(X_train,y_train)

DecisionTreeClassifier(criterion='entropy', max_depth=4)

In [134]:

predTree = drugTree.predict(X_test)

In [135]:

print (predTree [0:5])
print (y_test [0:5])

[0 0 0 0 1]
[0 0 0 0 1]


In [136]:
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_test, predTree))

DecisionTrees's Accuracy:  0.724474311727726


In [138]:
print("DecisionTrees's f1_score: ", metrics.f1_score(y_test, predTree, average='weighted'))

DecisionTrees's f1_score:  0.7077387426249984


## Support Vector Machine

In [139]:

from sklearn import svm

In [141]:
kernel_list = ['rbf', 'linear', 'poly', 'sigmoid']

f1_score = np.zeros((4))
jacc_score = np.zeros((4))

for i, knl in enumerate(kernel_list):
    clf = svm.SVC(kernel=knl)
    clf.fit(X_train, y_train)
    
    yhat_svm = clf.predict(X_test)
    f1_score[i] = metrics.f1_score(y_test, yhat_svm, average='weighted') 
    jacc_score[i] = metrics.jaccard_score(y_test, yhat_svm, pos_label=1)
    
f1_score

array([0.73452581, 0.70113883, 0.72756369, 0.61096932])

In [142]:

print( "The best f1_score was with", f1_score.max(), "with k=", kernel_list[f1_score.argmax()])
print( "The best jaccard_score was with", jacc_score.max(), "with k=", kernel_list[jacc_score.argmax()])

The best f1_score was with 0.7345258118899542 with k= rbf
The best jaccard_score was with 0.4428909952606635 with k= rbf


In [143]:
jacc_score

array([0.442891  , 0.37815126, 0.4293517 , 0.33116279])

## Logistic Regression

In [144]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR

LogisticRegression(C=0.01, solver='liblinear')

In [145]:

yhat_LR = LR.predict(X_test)
yhat_LR

array([0, 0, 0, ..., 0, 1, 1])

In [146]:
#Evaluating F1 Score
metrics.f1_score(y_test, yhat_LR, average='weighted')

0.717154500704285

In [147]:
#Evaluating Jaccard Score

metrics.jaccard_score(y_test, yhat_LR, pos_label=1)

0.43172690763052207

In [149]:
from sklearn.cluster import KMeans 

In [150]:
clusterNum = 3
k_means = KMeans(init = "k-means++", n_clusters = clusterNum, n_init = 12)
k_means.fit(X)
labels = k_means.labels_
print(labels)

[1 1 1 ... 0 1 2]


In [151]:
Feature["Clus_km"] = labels
Feature.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


Unnamed: 0,ACCOUNT_TENURE,ACCT_CNT_SERVICES,AGE,CFU,SERVICE_TENURE,PLAN_ACCESS_FEE,BYO_PLAN_STATUS,PLAN_TENURE,MONTHS_OF_CONTRACT_REMAINING,LAST_FX_CONTRACT_DURATION,PREV_CONTRACT_DURATION,CHURN_IND,MONTHLY_SPEND,NO-CONTRACT,OFF-CONTRACT,ON-CONTRACT,Clus_km
0,46,1,30,0,46,54.54,0,15,0,24,24,1,61.4,0,1,0,1
1,60,3,55,0,59,54.54,0,5,0,24,24,1,54.54,0,1,0,1
2,65,1,29,0,65,40.9,1,15,0,12,12,1,2.5,0,1,0,1
3,31,1,51,0,31,31.81,0,31,0,24,24,1,6.48,0,1,0,1
4,95,1,31,0,95,54.54,0,0,0,24,24,1,100.22,0,1,0,1


In [152]:
Feature
Feature['Clus_km'].value_counts()

0    28188
1    12499
2     5443
Name: Clus_km, dtype: int64

In [None]:
df.groupby('Clus_km').mean()