# Machine Learning: Practical Application


In this tutorial we will build a simple model to predict the if a customer is about to churn.

Goals:
1. Explore the dataset
2. Build a simple predictive modeling
3. Iterate and improve your score


How to follow along:
    
- install [Anaconda Python](https://www.continuum.io/downloads) (or create conda environment with miniconda)
- download and unzip `www.dataweekends.com/tdwi`
- `cd tdwi_machine_learning`
- `jupyter notebook`
    
    

We start by importing the necessary libraries:

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## 1) Explore the dataset

#### Data exploration

- Load the csv file into memory using Pandas
- Describe each attribute
    - is it discrete?
    - is it continuous?
    - is it a number?
    - is it text?
- Identify the target

Load the csv file into memory using Pandas

In [2]:
df = pd.read_csv('churn.csv')

What's the content of ```df``` ?

In [3]:
df.head(3)

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,Nophoneservice,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electroniccheck,29.85,29.85,No
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,Oneyear,No,Mailedcheck,56.95,1889.5,No
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailedcheck,53.85,108.15,Yes


Describe each attribute (is it discrete? is it continuous? is it a number? is it text?)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
gender              7043 non-null object
SeniorCitizen       7043 non-null int64
Partner             7043 non-null object
Dependents          7043 non-null object
tenure              7043 non-null int64
PhoneService        7043 non-null object
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null float64
Churn               7043 non-null object
dtypes: float64(2), int64(2), object(16)
memory usage: 1.1+ MB


#### Mental notes so far:

- Dataset contains 7043 entries
- 1 Target column (```Churn```)
- 19 Features:
    - 4 numerical, 15 text
    - Some features probably binary
    - Some featuers categorical (more than 2 values)
    - No missing data

Target:

In [5]:
df['Churn'].value_counts()

No     5174
Yes    1869
Name: Churn, dtype: int64

Binary variable.

Approximately 1 every 4 customers churns. This is our benchmark.

If we predicted no churns we would be accurate 73.5% of the time.

In [6]:
benchmark_accuracy = df['Churn'].value_counts()[0] / len(df)
benchmark_accuracy

0.7346301292063041

Binary encode target

In [7]:
y = (df['Churn'] == 'Yes')

In [8]:
y.head(4)

0    False
1    False
2     True
3    False
Name: Churn, dtype: bool

In [9]:
y.value_counts()

False    5174
True     1869
Name: Churn, dtype: int64

Drop churn column from df

In [10]:
dfnochurn = df.drop('Churn', axis=1)

Feature cardinality

In [11]:
card = dfnochurn.apply(lambda x:len(x.unique()))
card

gender                 2
SeniorCitizen          2
Partner                2
Dependents             2
tenure                73
PhoneService           2
MultipleLines          3
InternetService        3
OnlineSecurity         3
OnlineBackup           3
DeviceProtection       3
TechSupport            3
StreamingTV            3
StreamingMovies        3
Contract               3
PaperlessBilling       2
PaymentMethod          4
MonthlyCharges      1585
TotalCharges        6531
dtype: int64

Some features are numerical, some are binary, some are categorical. Let's start with just the numerical features.

Copy numerical features to a DataFrame called `X`.

In [12]:
X = df[['tenure', 'MonthlyCharges', 'TotalCharges']].copy()

## 2) Build a simple model

Train / Test split

In [13]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)

Let's use a Decision tree model

In [14]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state=0)
model

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

Train the model

In [15]:
model.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

Calculate the accuracy score

In [16]:
my_score = model.score(X_test, y_test)

print("Classification Score: %0.3f" % my_score)
print("Benchmark Score: %0.3f" % benchmark_accuracy)

Classification Score: 0.708
Benchmark Score: 0.735


Very bad!

Let's try with a Random Forest Classifier

In [17]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=0)
model.fit(X_train, y_train)
my_score = model.score(X_test, y_test)

print("Classification Score: %0.3f" % my_score)
print("Benchmark Score: %0.3f" % benchmark_accuracy)

Classification Score: 0.743
Benchmark Score: 0.735


Barely better than the benchmark.

Print the confusion matrix for the decision tree model

In [18]:
from sklearn.metrics import confusion_matrix

y_pred = model.predict(X_test)

cm = confusion_matrix(y_test, y_pred)

pd.DataFrame(cm, index=['No Churn', 'Churn'],
                 columns=['Pred No Churn', 'Pred Churn'])

Unnamed: 0,Pred No Churn,Pred Churn
No Churn,902,139
Churn,223,145


## 3) Iterate and improve

Now you have a basic pipeline. How can you improve the score? Try:
- rescale the numerical features:
    - can you use the log of Total Charges?
- add other features:
    - can you add the binary features to the model? See if you can create auxiliary boolean columns in `X` that reproduce the binary features in `dfnochurn`. For example, you could create a column called `IsMale` that is equal to `True` when `df['gender'] == 'Male'`.
    - can you add the categorical features to the model? To do this you will have to use the function `pd.get_dummies` and to perform 1-hot encoding of the categorical features.

- visual exploration:
    - can you display the histogram of the numerical features?
    - can you display the relative ratio of binary and categorical variables using pie charts?

- change the parameters of the model.
    - can you change the initialization of the decision tree or the random forest classifier to improve their score? you can check the documentation here:
        - http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
        - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

- change the model itself. You can find many other models here:
  http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

Try to get the best score on the test set

*Copyright &copy; 2017 Dataweekends & CATALIT LLC*