In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
churn_data = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
churn_data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


First we want to analyze the connection between continuous variables and churn, therefore we extract tenure, MonthlyCharges, TotalCharges and Churn to a new dataframe. We do not want to include TotalCharges since it us highly correlated with MonthlyCharges.

In [3]:
continuous = churn_data[['tenure', 'MonthlyCharges', 'Churn']]
continuous.head()

Unnamed: 0,tenure,MonthlyCharges,Churn
0,1,29.85,No
1,34,56.95,No
2,2,53.85,Yes
3,45,42.3,No
4,2,70.7,Yes


Then we need to convert churn to numeric variable

In [4]:
continuous['Churn'] = continuous['Churn'].replace({'No':0, 'Yes':1})
continuous.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,tenure,MonthlyCharges,Churn
0,1,29.85,0
1,34,56.95,0
2,2,53.85,1
3,45,42.3,0
4,2,70.7,1


Check whether there are null values in the dataset

In [5]:
continuous.isnull().sum()

tenure            0
MonthlyCharges    0
Churn             0
dtype: int64

The dataset looks fine, now we can split it to training set and test set, then use training set to fit our logistic model. After that, we use cross validation to improve the performance of our model.

In [6]:
from sklearn.cross_validation import train_test_split
X = continuous.values[:, :2]
Y = continuous.values[:, -1]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)



In [7]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

In [8]:
from sklearn import linear_model
logreg = linear_model.LogisticRegression(C=1e5)
logreg.fit(X_train, Y_train)

LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [9]:
prepro = logreg.predict_proba(X_test_std)
logreg.score(X_test_std,Y_test)

0.7388218594748048

In [10]:
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(logreg, X_train, Y_train, cv=10, scoring='accuracy')
print('accuracy of each fold is: ')
print(scores)
print('cv accuracy is:', scores.mean())

accuracy of each fold is: 
[0.7840708  0.79432624 0.78900709 0.79218472 0.76376554 0.78152753
 0.77264654 0.77797513 0.79573712 0.81527531]
cv accuracy is: 0.7866516030326369


We can see that this model is actually not bad, it's accuracy can reach 78.67% on average. But using montly cost can not give us much information about the clients therefore the company may not know how to adjust their marketing strategy. As we know, monthly cost is related to what kinds of service the customer has signed up for, we'll use these data to fit and train a new logistic model.

In [11]:
data = churn_data[['tenure', 'PhoneService', 'InternetService', 'Churn']]
data.head()

Unnamed: 0,tenure,PhoneService,InternetService,Churn
0,1,No,DSL,No
1,34,Yes,DSL,No
2,2,Yes,DSL,Yes
3,45,No,DSL,No
4,2,Yes,Fiber optic,Yes


In [12]:
data['Churn'] = data['Churn'].replace({'No':0, 'Yes':1})
data['PhoneService'] = data['PhoneService'].replace({'No':0, 'Yes':1})
data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,tenure,PhoneService,InternetService,Churn
0,1,0,DSL,0
1,34,1,DSL,0
2,2,1,DSL,1
3,45,0,DSL,0
4,2,1,Fiber optic,1


We must encode InternetService into dummy variables.

In [13]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_Inter = LabelEncoder()
data['InternetService'] = labelencoder_Inter.fit_transform(data.values[:, 2] )
ohe = OneHotEncoder(categorical_features = [2])
data = ohe.fit_transform(data).toarray()
data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


array([[ 1.,  0.,  0.,  1.,  0.,  0.],
       [ 1.,  0.,  0., 34.,  1.,  0.],
       [ 1.,  0.,  0.,  2.,  1.,  1.],
       ...,
       [ 1.,  0.,  0., 11.,  0.,  0.],
       [ 0.,  1.,  0.,  4.,  1.,  1.],
       [ 0.,  1.,  0., 66.,  1.,  0.]])

In [14]:
X = data[:, :-1]
Y = data[:, -1]

In [15]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

In [16]:
logreg = linear_model.LogisticRegression(C=1e5)
logreg.fit(X_train, Y_train)
prepro = logreg.predict_proba(X_test_std)
logreg.score(X_test_std,Y_test)

0.6210078069552875

In [17]:
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(logreg, X_train, Y_train, cv=10, scoring='accuracy')
print('accuracy of each fold is: ')
print(scores)
print('cv accuracy is:', scores.mean())

accuracy of each fold is: 
[0.79115044 0.79432624 0.78900709 0.79040853 0.77264654 0.77797513
 0.77797513 0.78685613 0.79396092 0.81527531]
cv accuracy is: 0.7889581466752594


After 10-fold cross validation we can see that the performance of model has increased, also the company know knows which kind of customers may stay long on their service and which customers should they attract more.