# 3. Machine Learning for Classification

We'll use logistic regression to predict churn


## 3.1 Churn prediction project

* Dataset: https://www.kaggle.com/blastchar/telco-customer-churn
* https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv


## 3.2 Data preparation

* Download the data, read it with pandas
* Look at the data
* Make column names and values look uniform
* Check if all the columns read correctly
* Check if the churn variable needs any preparation

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

In [2]:
data = 'https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv'

In [3]:
df = pd.read_csv(data)
df.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


In [4]:
df.head().T

Unnamed: 0,0,1,2,3,4
Make,BMW,BMW,BMW,BMW,BMW
Model,1 Series M,1 Series,1 Series,1 Series,1 Series
Year,2011,2011,2011,2011,2011
Engine Fuel Type,premium unleaded (required),premium unleaded (required),premium unleaded (required),premium unleaded (required),premium unleaded (required)
Engine HP,335.0,300.0,300.0,230.0,230.0
Engine Cylinders,6.0,6.0,6.0,6.0,6.0
Transmission Type,MANUAL,MANUAL,MANUAL,MANUAL,MANUAL
Driven_Wheels,rear wheel drive,rear wheel drive,rear wheel drive,rear wheel drive,rear wheel drive
Number of Doors,2.0,2.0,2.0,2.0,2.0
Market Category,"Factory Tuner,Luxury,High-Performance","Luxury,Performance","Luxury,High-Performance","Luxury,Performance",Luxury


In [5]:
df.dtypes

Make                  object
Model                 object
Year                   int64
Engine Fuel Type      object
Engine HP            float64
Engine Cylinders     float64
Transmission Type     object
Driven_Wheels         object
Number of Doors      float64
Market Category       object
Vehicle Size          object
Vehicle Style         object
highway MPG            int64
city mpg               int64
Popularity             int64
MSRP                   int64
dtype: object

In [6]:
df.columns

Index(['Make', 'Model', 'Year', 'Engine Fuel Type', 'Engine HP',
       'Engine Cylinders', 'Transmission Type', 'Driven_Wheels',
       'Number of Doors', 'Market Category', 'Vehicle Size', 'Vehicle Style',
       'highway MPG', 'city mpg', 'Popularity', 'MSRP'],
      dtype='object')

In [7]:
df = df[['Make', 'Model', 'Year', 'Engine HP', 'Engine Cylinders',
        'Transmission Type', 'Vehicle Style', 'highway MPG',
        'city mpg','MSRP']]

In [8]:
df.columns = df.columns.str.lower().str.replace(' ', '_')

categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)

for c in categorical_columns:
    df[c] = df[c].str.lower().str.replace(' ', '_')

In [9]:
df.isnull().sum()

make                  0
model                 0
year                  0
engine_hp            69
engine_cylinders     30
transmission_type     0
vehicle_style         0
highway_mpg           0
city_mpg              0
msrp                  0
dtype: int64

In [10]:
df.fillna(0, inplace=True)

In [11]:
df.rename(columns={'msrp': 'price'}, inplace=True)

In [12]:
df.price.mean()

40594.737032063116

In [13]:
df['transmission_type'].value_counts()

transmission_type
automatic           8266
manual              2935
automated_manual     626
direct_drive          68
unknown               19
Name: count, dtype: int64

In [14]:
numerical = ['year', 'engine_hp', 'engine_cylinders',
            'highway_mpg', 'city_mpg']

categorical = ['make', 'model', 'transmission_type', 'vehicle_style']

In [15]:
df[categorical].nunique()

make                  48
model                914
transmission_type      5
vehicle_style         16
dtype: int64

## 3.7 Feature importance: Correlation

How about numerical columns?

* Correlation coefficient - https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

In [16]:
df[numerical].corr()

Unnamed: 0,year,engine_hp,engine_cylinders,highway_mpg,city_mpg
year,1.0,0.338714,-0.040708,0.25824,0.198171
engine_hp,0.338714,1.0,0.774851,-0.415707,-0.424918
engine_cylinders,-0.040708,0.774851,1.0,-0.614541,-0.587306
highway_mpg,0.25824,-0.415707,-0.614541,1.0,0.886829
city_mpg,0.198171,-0.424918,-0.587306,0.886829,1.0


In [17]:
df['above_average'] = df['price'] > df.price.mean()

## 3.3 Setting up the validation framework

* Perform the train/validation/test split with Scikit-Learn

In [18]:
from sklearn.model_selection import train_test_split

In [19]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_train, df_val = train_test_split(df_full_train, test_size=0.20, random_state=42)

In [20]:
len(df_train), len(df_val), len(df_test)

(7624, 1907, 2383)

In [21]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [22]:
y_train = df_train.above_average.values
y_val = df_val.above_average.values
y_test = df_test.above_average.values

del df_train['above_average']
del df_val['above_average']
del df_test['above_average']

## 3.6 Feature importance: Mutual information

Mutual information - concept from information theory, it tells us how much 
we can learn about one variable if we know the value of another

* https://en.wikipedia.org/wiki/Mutual_information

In [23]:
from sklearn.metrics import mutual_info_score

In [24]:
def mutual_info_churn_score(series):
    return mutual_info_score(series, df.above_average)

In [25]:
mi = df[categorical].apply(mutual_info_churn_score)
mi.sort_values(ascending=False)

model                0.457469
make                 0.237731
vehicle_style        0.082633
transmission_type    0.019954
dtype: float64

In [26]:
df['above_average'] = df['price'] > df.price.mean()

## 3.8 One-hot encoding

* Use Scikit-Learn to encode categorical features

In [27]:
from sklearn.feature_extraction import DictVectorizer

In [28]:
dv = DictVectorizer(sparse=False)

train_dict = df_train[categorical + numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

val_dict = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dict)

## 3.10 Training logistic regression with Scikit-Learn

* Train a model with Scikit-Learn
* Apply it to the validation dataset
* Calculate the accuracy

In [29]:
from sklearn.linear_model import LogisticRegression

In [30]:
model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

In [31]:
model.intercept_[0]

-0.10435160548921182

In [32]:
model.coef_[0].round(2)

array([ 0.02, -0.06,  0.03,  0.08,  1.  ,  0.7 ,  0.27,  2.52,  0.09,
        2.08,  0.  , -0.41,  2.07, -1.41, -1.21, -2.87,  0.15, -0.26,
       -1.49,  0.19, -0.74, -1.26, -0.  , -2.18,  0.09, -1.49,  0.  ,
        1.93,  1.19,  1.17,  2.54,  0.49,  0.  , -1.2 ,  0.  ,  1.17,
       -1.05, -0.95, -0.62, -0.06, -1.74,  1.43,  0.38,  0.9 , -0.12,
        0.12, -1.72, -1.08,  1.68, -0.64, -0.66,  0.88, -0.08, -0.02,
       -0.  , -0.36, -0.  , -0.45, -0.  , -0.  , -0.  , -0.43, -0.13,
       -0.3 , -0.17, -0.04, -0.09, -0.05, -0.  , -0.  , -0.51,  0.07,
        0.01, -0.07,  0.63, -0.16, -0.02,  0.04,  0.  ,  0.7 ,  0.39,
        0.7 ,  0.3 , -0.01, -0.02, -0.75, -0.07, -0.02, -0.12,  0.01,
       -0.06,  0.  ,  0.  ,  0.01,  0.  ,  0.54,  0.29, -0.07, -0.64,
       -0.  ,  0.  ,  0.  , -0.01,  0.18,  0.09, -0.01, -0.  , -0.01,
        0.1 , -0.01, -0.35, -0.  , -0.39, -0.03,  0.67,  0.29, -0.24,
        0.23,  0.1 , -0.05, -0.08, -0.04,  0.49, -0.18, -0.01, -0.03,
       -0.07, -0.02,

In [33]:
y_pred = model.predict_proba(X_val)[:, 1]

In [34]:
churn_decision = (y_pred >= 0.5)

In [35]:
round((y_val == churn_decision).mean(), 2)

0.94

In [36]:
model = LogisticRegression(solver='liblinear',
                           C=10,
                           max_iter=1000,
                           random_state=42
                        )
model.fit(X_train, y_train)
y_pred = model.predict_proba(X_val)[:, 1]
churn_decision = (y_pred >= 0.5)
original = round((y_val == churn_decision).mean(), 2)
features = categorical + numerical
for i in features:
    dv = DictVectorizer(sparse=False)
    cols = features.copy()
    cols.remove(i)
    train_dict = df_train[cols].to_dict(orient='records')
    X_train = dv.fit_transform(train_dict)

    val_dict = df_val[cols].to_dict(orient='records')
    X_val = dv.transform(val_dict)
    model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
    # solver='lbfgs' is the default solver in newer version of sklearn
    # for older versions, you need to specify it explicitly
    model.fit(X_train, y_train)
    y_pred = model.predict_proba(X_val)[:, 1]
    churn_decision = (y_pred >= 0.5)
    new_accuracy = (y_val == churn_decision).mean()
    print(f"{i}: {abs(original - new_accuracy)}")

make: 0.007561615102254948
model: 0.022852648138437237
transmission_type: 0.0024016780283167005
vehicle_style: 0.007121132669113739
year: 0.007037231253277487
engine_hp: 0.011316203460933316
engine_cylinders: 0.004415312008390182
highway_mpg: 0.003890928159412721
city_mpg: 0.00336654431043526


In [37]:
import numpy as np

In [38]:
y_train_ridge = np.log1p(df_train.price)
y_val_ridge  = np.log1p(df_val.price)
y_test_ridge  = np.log1p(df_test.price)

In [39]:
def rmse(y, y_pred):
    error = y_pred - y
    mse = (error ** 2).mean()
    return np.sqrt(mse)

In [40]:
from sklearn.linear_model import Ridge
for alpha in [0, 0.01, 0.1, 1, 10]:
    model = Ridge(alpha=alpha,
                  solver="sag",
                  random_state=42
                ).fit(X_train, y_train_ridge)
    y_pred = model.predict(X_val)
    rmse_val = rmse(y_val_ridge, y_pred)
    print(f"{alpha}: {round(rmse_val, 3)}")




0: 0.485




0.01: 0.485




0.1: 0.485




1: 0.485
10: 0.485




## 3.13 Summary

* Feature importance - risk, mutual information, correlation
* One-hot encoding can be implemented with `DictVectorizer`
* Logistic regression - linear model like linear regression
* Output of log reg - probability
* Interpretation of weights is similar to linear regression

## 3.14 Explore more

More things

* Try to exclude least useful features


Use scikit-learn in project of last week

* Re-implement train/val/test split using scikit-learn in the project from the last week
* Also, instead of our own linear regression, use `LinearRegression` (not regularized) and `RidgeRegression` (regularized). Find the best regularization parameter for Ridge

Other projects

* Lead scoring - https://www.kaggle.com/ashydv/leads-dataset
* Default prediction - https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

