# Predicting Customer Churn

Credit: This example is taken from [yhat](http://blog.yhat.com/posts/predicting-customer-churn-with-sklearn.html).

In business world, one important measure is "Churn Rate", the rate at which customers terminate the service. For example, in telecommunication, customer churn is one of the most important business health metrics, so important they have customer churn agents whose job is to call customers who are likely to leave and offer them very special deals to tempt them to stay.

In this example, we look at a dataset from telecommunication industry. Each customer profile (row) has several variables as shown in the table below.

The last column (Churn?) is the output variable we are interested in. We want to look into all customers in our database and predict the likelihood of each customer leaving the business.

In [14]:
import pandas as pd
import numpy as np

# loading up data and printing them to get a general idea

churn_df = pd.read_csv('churn.csv')
col_names = churn_df.columns.tolist()

print("Column names:")
print(col_names)

to_show = col_names[:6] + col_names[-6:]

print("\nSample data:")
churn_df[to_show].head(6)

Column names:
['State', 'Account Length', 'Area Code', 'Phone', "Int'l Plan", 'VMail Plan', 'VMail Message', 'Day Mins', 'Day Calls', 'Day Charge', 'Eve Mins', 'Eve Calls', 'Eve Charge', 'Night Mins', 'Night Calls', 'Night Charge', 'Intl Mins', 'Intl Calls', 'Intl Charge', 'CustServ Calls', 'Churn?']

Sample data:


Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
0,KS,128,415,382-4657,no,yes,11.01,10.0,3,2.7,1,False.
1,OH,107,415,371-7191,no,yes,11.45,13.7,3,3.7,1,False.
2,NJ,137,415,358-1921,no,no,7.32,12.2,5,3.29,0,False.
3,OH,84,408,375-9999,yes,no,8.86,6.6,7,1.78,2,False.
4,OK,75,415,330-6626,yes,no,8.41,10.1,3,2.73,3,False.
5,AL,118,510,391-8027,yes,no,9.18,6.3,6,1.7,0,False.


### Data Preprocessing

Now that we got a rough idea of what data look like let's transform data into usable format.

In [15]:
# Drop columns that we don't think would help
to_drop = ['State','Area Code','Phone','Churn?']
X = churn_df.drop(to_drop,axis=1)

# Separate target values, [Churn?] column, into another variable called churn_result
# Transform [Churn?] column to 0 and 1 (instead of False and True)
churn_result = churn_df['Churn?']
y = np.where(churn_result == 'True.',1,0)

# Transform some columns with 'yes'/'no' values and convert to 1 and 0
# X stores all features (x1 through x7)
yes_no_cols = ["Int'l Plan","VMail Plan"]
X[yes_no_cols] = X[yes_no_cols] == 'yes'
X = X.as_matrix().astype(np.float)

The following part is important. We took our 7-column feature tables and scale them. Standard scaler basically 'standardize' all feature column, so that each column has zero mean and is scaled by standard deviation.

In [16]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)

print("Feature space holds %d observations and %d features" % X.shape)
print("Unique target labels:", np.unique(y))

Feature space holds 3333 observations and 17 features
Unique target labels: [0 1]


### Training the model

Now we have our features all preprocessed and our target formatted. Let us have fun with some model training.

Remember, to prevent overfitting, we recommended dividing data into training and testing sets. Train the model with the training set and test the model with the testing set. This process is called cross-validation.

To be even more robust, we can do k-fold cross-validation. This means the original data are randomly partitioned into k chunks. In each round of validation, a chunk of data is used as the testing set, while the other k-1 chunks are used as training set. After all k rounds are completed the model performances are averaged across k rounds to give a single metric of model validated result.


sklearn provides a simple way to do this, with cross_validation module.

This function is written to perform validation quickly on any model held in the variable `clf_class`. The function first construct k data chunks. Loop through all the chunks. Construct a model based on variables we specify in `**kwargs`. Fit the model with training set and return predictions on the testing set.


In [17]:
from sklearn.cross_validation import KFold

def run_cv(X,y,clf_class,**kwargs):
    # Construct a kfolds object
    kf = KFold(len(y),n_folds=5,shuffle=True)
    y_pred = y.copy()
    
    # Iterate through folds
    for train_index, test_index in kf:
        X_train, X_test = X[train_index], X[test_index]
        y_train = y[train_index]
        # Initialize a classifier with key word arguments
        clf = clf_class(**kwargs)
        clf.fit(X_train,y_train)
        y_pred[test_index] = clf.predict(X_test)
    return y_pred

Another function, `accuracy` took the predicted categories and actual categories (churn or not churn) and compare them to give model accuracy score (0%-100%).

In [18]:
def accuracy(y_true,y_pred):
    # NumPy interprets True and False as 1. and 0.
    return np.mean(y_true == y_pred)

Then we will use the cross validation function that we just wrote to test 3 classic classification models.

In [19]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.neighbors import KNeighborsClassifier as KNN

print("Support vector machines:")
print("%.3f" % accuracy(y, run_cv(X,y,SVC)))
print("Random forest:")
print("%.3f" % accuracy(y, run_cv(X,y,RF)))
print("K-nearest-neighbors:")
print("%.3f" % accuracy(y, run_cv(X,y,KNN)))

Support vector machines:
0.921
Random forest:
0.940
K-nearest-neighbors:
0.891


So it appears here that Random forest is the best model out of all three.