# Machine Learning: Practical Application


In this tutorial we will build a simple model to predict the if a customer is about to churn.

Goals:
1. Explore the dataset
2. Build a simple predictive modeling
3. Iterate and improve your score


How to follow along:
    
- install [Anaconda Python](https://www.continuum.io/downloads) (or create conda environment with miniconda)
- download and unzip `www.dataweekends.com/tdwi`
- `cd tdwi_machine_learning`
- `jupyter notebook`
    
    

We start by importing the necessary libraries:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## 1) Explore the dataset

#### Data exploration

- Load the csv file into memory using Pandas
- Describe each attribute
    - is it discrete?
    - is it continuous?
    - is it a number?
    - is it text?
- Identify the target

Load the csv file into memory using Pandas

In [None]:
df = pd.read_csv('churn.csv')

What's the content of ```df``` ?

In [None]:
df.head(3)

Describe each attribute (is it discrete? is it continuous? is it a number? is it text?)

In [None]:
df.info()

#### Mental notes so far:

- Dataset contains 7043 entries
- 1 Target column (```Churn```)
- 19 Features:
    - 4 numerical, 15 text
    - Some features probably binary
    - Some featuers categorical (more than 2 values)
    - No missing data

Target:

In [None]:
df['Churn'].value_counts()

Binary variable.

Approximately 1 every 4 customers churns. This is our benchmark.

If we predicted no churns we would be accurate 73.5% of the time.

In [None]:
benchmark_accuracy = df['Churn'].value_counts()[0] / len(df)
benchmark_accuracy

Binary encode target

In [None]:
y = (df['Churn'] == 'Yes')

In [None]:
y.head(4)

In [None]:
y.value_counts()

Drop churn column from df

In [None]:
dfnochurn = df.drop('Churn', axis=1)

Feature cardinality

In [None]:
card = dfnochurn.apply(lambda x:len(x.unique()))
card

Some features are numerical, some are binary, some are categorical. Let's start with just the numerical features.

Copy numerical features to a DataFrame called `X`.

In [None]:
X = df[['tenure', 'MonthlyCharges', 'TotalCharges']].copy()

## 2) Build a simple model

Train / Test split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)

Let's use a Decision tree model

In [None]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state=0)
model

Train the model

In [None]:
model.fit(X_train, y_train)

Calculate the accuracy score

In [None]:
my_score = model.score(X_test, y_test)

print("Classification Score: %0.3f" % my_score)
print("Benchmark Score: %0.3f" % benchmark_accuracy)

Very bad!

Let's try with a Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=0)
model.fit(X_train, y_train)
my_score = model.score(X_test, y_test)

print("Classification Score: %0.3f" % my_score)
print("Benchmark Score: %0.3f" % benchmark_accuracy)

Barely better than the benchmark.

Print the confusion matrix for the decision tree model

In [None]:
from sklearn.metrics import confusion_matrix

y_pred = model.predict(X_test)

cm = confusion_matrix(y_test, y_pred)

pd.DataFrame(cm, index=['No Churn', 'Churn'],
                 columns=['Pred No Churn', 'Pred Churn'])

## 3) Iterate and improve

Now you have a basic pipeline. How can you improve the score? Try:
- rescale the numerical features:
    - can you use the log of Total Charges?
- add other features:
    - can you add the binary features to the model? See if you can create auxiliary boolean columns in `X` that reproduce the binary features in `dfnochurn`. For example, you could create a column called `IsMale` that is equal to `True` when `df['gender'] == 'Male'`.
    - can you add the categorical features to the model? To do this you will have to use the function `pd.get_dummies` and to perform 1-hot encoding of the categorical features.

- visual exploration:
    - can you display the histogram of the numerical features?
    - can you display the relative ratio of binary and categorical variables using pie charts?

- change the parameters of the model.
    - can you change the initialization of the decision tree or the random forest classifier to improve their score? you can check the documentation here:
        - http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
        - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

- change the model itself. You can find many other models here:
  http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

Try to get the best score on the test set

log of Total Charges

In [None]:
X['TotalCharges'].min(), X['TotalCharges'].max()

In [None]:
X['LogTotalCharges'] = np.log10(2 + X['TotalCharges'])

In [None]:
X.head(3)

Features with only 2 values => Binary

In [None]:
binary_features = card[card == 2].index
df[binary_features].head(3)

Create new binary features to represent them

In [None]:
X['IsMale'] = (df['gender'] == 'Male')
X['IsSeniorCitizen'] = (df['SeniorCitizen'] == 1)
X['HasPartner'] = (df['Partner'] == 'Yes')
X['HasDependents'] = (df['Dependents'] == 'Yes')
X['HasPhoneService'] = (df['PhoneService'] == 'Yes')
X['HasPaperlessBilling'] = (df['PaperlessBilling'] == 'Yes')

In [None]:
X.head(3)

Features with more than 2 values => Categorical

In [None]:
categorical_features = card[(card == 3) | (card == 4)].index
categorical_features

#### Visual exploration

Let's explore visually the distribution of each feature in order to decide how to treat it.

Distribution of Monthly charges

In [None]:
X['MonthlyCharges'].plot(kind='hist', bins=20)

Distribution of Total charges

In [None]:
X['TotalCharges'].plot(kind='hist', bins=20)

Distribution of Log Total charges

In [None]:
X['LogTotalCharges'].plot(kind='hist', bins=20)

Ratios of binary variables

In [None]:
binary = ["IsMale", "IsSeniorCitizen", "HasPartner", "HasDependents", "HasPhoneService", "HasPaperlessBilling"]
plt.figure(figsize=(10, 6))
for i, c in enumerate(binary):
    plt.subplot(2, 3, i + 1)
    X[c].value_counts().plot(kind='pie')
plt.tight_layout()

Ratios of categorical variables

In [None]:
plt.figure(figsize=(12, 6))
for i, c in enumerate(categorical_features):
    plt.subplot(2, 5, i + 1)
    df[c].value_counts().plot(kind='pie', title=c)
    plt.ylabel('')
plt.tight_layout()

Create categorical dummy columns (one-hot encoding)

In [None]:
X_categorical = pd.get_dummies(df[categorical_features])
X_categorical.head(3)

Combine features

In [None]:
X = pd.concat([X, X_categorical], axis=1)

In [None]:
X.head(3)

In [None]:
X.shape

We have 41 features and 7043 data points

## Final Model building

Train / Test split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)

Random Forest Classifier

In [None]:
model = RandomForestClassifier(random_state=0)
model.fit(X_train, y_train)
my_score = model.score(X_test, y_test)

print("Classification Score: %0.3f" % my_score)
print("Benchmark Score: %0.3f" % benchmark_accuracy)

Print the confusion matrix for the decision tree model

In [None]:
from sklearn.metrics import confusion_matrix

y_pred = model.predict(X_test)

cm = confusion_matrix(y_test, y_pred)

pd.DataFrame(cm, index=['No Churn', 'Churn'],
                 columns=['Pred No Churn', 'Pred Churn'])

Let's rank the features by importance

In [None]:
pd.Series(model.feature_importances_, index=X.columns).sort_values(ascending=False).head(10)

Try other models

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)
my_score = model.score(X_test, y_test)

print("Classification Score: %0.4f" % my_score)
print("Benchmark Score: %0.4f" % benchmark_accuracy)

In [None]:
model = RandomForestClassifier(n_estimators=100, max_depth=4)
model.fit(X_train, y_train)
my_score = model.score(X_test, y_test)

print("Classification Score: %0.4f" % my_score)
print("Benchmark Score: %0.4f" % benchmark_accuracy)

In [None]:
from sklearn.neural_network import MLPClassifier

model = MLPClassifier((100, 50, 20), batch_size=32, max_iter=2000)

model.fit(X_train, y_train)
my_score = model.score(X_test, y_test)

print("Classification Score: %0.4f" % my_score)
print("Benchmark Score: %0.4f" % benchmark_accuracy)

In [None]:
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC

model = VotingClassifier([('lr', LogisticRegression()),
                          ('rf', RandomForestClassifier(n_estimators=100, max_depth=4)),
                          ('mlp', MLPClassifier((100, 50, 20), batch_size=32, max_iter=2000)),
                          ('svc', SVC(probability=True))],
                         voting='soft',
                         n_jobs=-1)

model.fit(X_train, y_train)
my_score = model.score(X_test, y_test)

print("Classification Score: %0.4f" % my_score)
print("Benchmark Score: %0.4f" % benchmark_accuracy)

*Copyright &copy; 2017 Dataweekends & CATALIT LLC*