<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/HOGENT-ML/course/blob/main/720-exercise_churn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
</table>

In [1]:
# Importing the necessary packages
import numpy as np                                  # "Scientific computing"
import scipy.stats as stats                         # Statistical tests

import pandas as pd                                 # Data Frame
from pandas.api.types import CategoricalDtype

import matplotlib.pyplot as plt                     # Basic visualisation

from sklearn.model_selection import cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

![](img/penguins.png)

Churn prediction is one of the classic machine learning applications. Companies want to predict the likelihood of a customer or employee leaving. Customers or employees that are "in danger" can then get a special treatment. The dataset we use in this exercise contains historical data from bank customers. We know for each customer wether he/she left ("Exited") or not. 

In [2]:
churn = pd.read_csv('https://raw.githubusercontent.com/HOGENT-ML/course/main/datasets/churn.csv')
churn.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


Get some general info about the dataset (type of each column, null values, ...)

In [3]:
churn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


Perform basic data cleaning and preparation. 

Tip: use the solution of the exercise "demographic student score" as a source of inspiration. 

Remove the columns you don't need

In [4]:
churn.drop(columns=['RowNumber','CustomerId','Surname'], axis=1, inplace=True)

Is this a skewed dataset?

In [5]:
churn['Exited'].value_counts()

0    7963
1    2037
Name: Exited, dtype: int64

What is X and what is y?

In [6]:
X = churn.drop(['Exited'], axis=1)
y = churn['Exited']

Define the data preparation for the categorical and numerical columns
Setting remainder='passthrough' will mean that all columns not specified in the list of "transformers"  
will be passed through without transformation, instead of being dropped.

In [7]:
from sklearn.preprocessing import OneHotEncoder
categorical_ix = X.select_dtypes(include=['object']).columns
print(categorical_ix)

numerical_ix = X.select_dtypes(include=['int64','float64']).columns
print(numerical_ix)

Index(['Geography', 'Gender'], dtype='object')
Index(['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
       'IsActiveMember', 'EstimatedSalary'],
      dtype='object')


In [8]:
col_transform = ColumnTransformer(transformers=[('cat',OneHotEncoder(),categorical_ix),
                                                ('num',MinMaxScaler(),numerical_ix)],
                                  remainder='passthrough')

What is X_train, y_train, X_test, y_test?

In [9]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

Find a model for LogisticRegression, Support Vector Machines with 3d degree polynomial kernel, Decision Trees and Random Forest each with their default parameters. Which one gives the best accuracy?

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

classifiers = [
    ('lr', LogisticRegression(random_state=42)),
    ('svc_poly',SVC(kernel='poly', degree=3, random_state=42)),
    ('tree', DecisionTreeClassifier(random_state=42)),
    ('rf', RandomForestClassifier(random_state=42))
]

for key, clf in classifiers:
    pipeline = Pipeline(steps=[('prep', col_transform),(key,clf)])
    acc = np.mean(cross_val_score(pipeline, X_train, y_train, cv=3, scoring='accuracy'))
    print(f'{key}: {acc}')



lr: 0.8118666666666666
svc_poly: 0.8498666666666667
tree: 0.7906666666666666
rf: 0.8588


Does a soft voting classifier using the above classifiers perform better?

In [12]:
from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(estimators=classifiers, voting='soft')
voting_clf.named_estimators['svc_poly'].probability = True
pipeline = Pipeline(steps=[('prep', col_transform),('voting',voting_clf)])
# pipeline.fit(X_train, y_train)
acc = np.mean(cross_val_score(pipeline, X_train, y_train, cv=3, scoring='accuracy'))
print(f'Voting: {acc}')

Voting: 0.8513333333333333


Continue with the best model from  the 4 individual classifiers above and apply grid search to find the best parameter combination. 

What's the best parameter combination and the corresponding accuracy?

In [16]:
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ('preprocessing', col_transform),
    ('rf', RandomForestClassifier(random_state=42))
])

param_grid = [
    {'rf__bootstrap':[True, False],'rf__n_estimators':[30,50,100,200], 'rf__max_features':[4,6,8,10]}
]

gridsearch = GridSearchCV(pipeline, param_grid, cv=3, scoring='accuracy')

gridsearch.fit(X_train, y_train)




In [21]:
print(gridsearch.best_params_)
print(gridsearch.best_score_)

{'rf__bootstrap': True, 'rf__max_features': 6, 'rf__n_estimators': 100}
0.8597333333333333


What is the accuracy score on the test set and what are the most important features?

In [22]:
from sklearn.metrics import accuracy_score

rf = RandomForestClassifier(max_features = 6, n_estimators = 100, bootstrap = True)

pipeline = Pipeline(steps=[('prep', col_transform),('rf', rf)])
pipeline.fit(X_train, y_train)

acc = pipeline.score(X_test,y_test)

print(f'Accuracy= {acc}')



Accuracy= 0.8684


In [23]:
rf.feature_importances_

array([0.01026872, 0.02215481, 0.01030942, 0.01201913, 0.01143353,
       0.13886179, 0.22904918, 0.07399237, 0.14761231, 0.13526635,
       0.01756125, 0.0479812 , 0.14348995])

In [24]:
col_transform.get_feature_names_out()

array(['cat__Geography_France', 'cat__Geography_Germany',
       'cat__Geography_Spain', 'cat__Gender_Female', 'cat__Gender_Male',
       'num__CreditScore', 'num__Age', 'num__Tenure', 'num__Balance',
       'num__NumOfProducts', 'num__HasCrCard', 'num__IsActiveMember',
       'num__EstimatedSalary'], dtype=object)

In [25]:
for score, name in zip(rf.feature_importances_, col_transform.get_feature_names_out()):
    print(round(score,2), name)

0.01 cat__Geography_France
0.02 cat__Geography_Germany
0.01 cat__Geography_Spain
0.01 cat__Gender_Female
0.01 cat__Gender_Male
0.14 num__CreditScore
0.23 num__Age
0.07 num__Tenure
0.15 num__Balance
0.14 num__NumOfProducts
0.02 num__HasCrCard
0.05 num__IsActiveMember
0.14 num__EstimatedSalary


Do Ada Boosting or Stacking lead to a better accuracy on the test set? 

For Stacking you can use the same estimators as you did for voting, but apply for the best classifier the optimal parameter combination you found above. 

In [27]:
from sklearn.ensemble import AdaBoostClassifier

X_train_prep = col_transform.fit_transform(X_train)
X_test_prep = col_transform.transform(X_test)

ada_clf = AdaBoostClassifier(RandomForestClassifier(n_estimators=100, max_features=6, random_state=42),
                              n_estimators=200,algorithm="SAMME.R", learning_rate=0.5, random_state=42)

ada_clf.fit(X_train_prep, y_train)

print(ada_clf.score(X_test_prep, y_test))



0.8632


In [29]:
from sklearn.ensemble import StackingClassifier

classifiers = [
    ('lr', LogisticRegression(random_state=42)),
    ('svc_poly',SVC(kernel='poly', degree=3, random_state=42)),
    ('tree', DecisionTreeClassifier(random_state=42)),
    ('rf', RandomForestClassifier(max_features=6, n_estimators=100, bootstrap=True, random_state=42))
]

stacking_clf = StackingClassifier(estimators=classifiers, 
                                  final_estimator=RandomForestClassifier(max_features=6, 
                                                                         n_estimators=100, bootstrap=True, 
                                                                         random_state=42),
                                  cv=3)

stacking_clf.fit(X_train_prep, y_train)

print(stacking_clf.score(X_test_prep, y_test))

0.8552


Conclusion: which model delivers the best results? 