# Foundations of Statistical Analysis and Machine Learning - Final exam exercise 2

### Guidelines

The exam is composed of two exercises (on separate notebooks) each one with a specific dataset. In total, there are three independent parts to complete:
* Exercise 1 (8 points): regression
* Exercise 2 part 1 (9 points): classification
* Exercise 2 part 2 (3 points): clustering

You can work directly on the notebooks. At the end of the 2-hour exam, you will have 5 minutes to upload it on Teams.

Don't forget that you have to complete the TWO notebooks to have the maximum grade.

Many questions can be tackled even if the previous ones are not completed or not correct.

The subject is long but don't worry if you cannot complete 100 % of the questions, try to do as much as you can. In case you are blocked at some point, don't panic. Just move on to the next question.

Avoid "naive" copy-pasting: you will not understand what you are doing and it will be a problem for the next questions. Notebooks that are cluttered with useless code that has been mindlessly copied from previous examples will be penalized. Moreover, similarities between student works are easy to spot.

Good luck!

## Exercise 2

Here are some libraries that could be usefull in the exercises.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn 

We will work on the provided data set BankChurners that gathers information on the customers of a bank.

In [2]:
# Loading the original data set
data = pd.read_csv("BankChurners.csv")
data.head(10)

Unnamed: 0,CLIENTNUM,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,...,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1,Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2
0,768805383,Existing Customer,45,M,3,High School,Married,$60K - $80K,Blue,39,...,12691.0,777,11914.0,1.335,1144,42,1.625,0.061,9.3e-05,0.99991
1,818770008,Existing Customer,49,F,5,Graduate,Single,Less than $40K,Blue,44,...,8256.0,864,7392.0,1.541,1291,33,3.714,0.105,5.7e-05,0.99994
2,713982108,Existing Customer,51,M,3,Graduate,Married,$80K - $120K,Blue,36,...,3418.0,0,3418.0,2.594,1887,20,2.333,0.0,2.1e-05,0.99998
3,769911858,Existing Customer,40,F,4,High School,Unknown,Less than $40K,Blue,34,...,3313.0,2517,796.0,1.405,1171,20,2.333,0.76,0.000134,0.99987
4,709106358,Existing Customer,40,M,3,Uneducated,Married,$60K - $80K,Blue,21,...,4716.0,0,4716.0,2.175,816,28,2.5,0.0,2.2e-05,0.99998
5,713061558,Existing Customer,44,M,2,Graduate,Married,$40K - $60K,Blue,36,...,4010.0,1247,2763.0,1.376,1088,24,0.846,0.311,5.5e-05,0.99994
6,810347208,Existing Customer,51,M,4,Unknown,Married,$120K +,Gold,46,...,34516.0,2264,32252.0,1.975,1330,31,0.722,0.066,0.000123,0.99988
7,818906208,Existing Customer,32,M,0,High School,Unknown,$60K - $80K,Silver,27,...,29081.0,1396,27685.0,2.204,1538,36,0.714,0.048,8.6e-05,0.99991
8,710930508,Existing Customer,37,M,3,Uneducated,Single,$60K - $80K,Blue,36,...,22352.0,2517,19835.0,3.355,1350,24,1.182,0.113,4.5e-05,0.99996
9,719661558,Existing Customer,48,M,2,Graduate,Single,$80K - $120K,Blue,36,...,11656.0,1677,9979.0,1.524,1441,32,0.882,0.144,0.000303,0.9997


Here is some code to execute to prepare the data set.

In [3]:
# Removing the last two columns that are useless
data = data.iloc[:,:-2]

# Encoding the customer churn
data.Attrition_Flag = data.Attrition_Flag.replace({'Attrited Customer':1,'Existing Customer':0})

# Encoding the gender
data.Gender = data.Gender.replace({'F':1,'M':0})

# Dropping rows with unknown values
data = data.drop(data[data.Education_Level=='Unknown'].index, axis=0)
data = data.drop(data[data.Income_Category=='Unknown'].index, axis=0)

# Encoding the education level with an (ordered) scale
data.Education_Level = data.Education_Level.replace({'Uneducated':0,'High School':1,'College':2,'Graduate':3,'Post-Graduate':4,'Doctorate':5})

# Encoding the income category with the mean value of each income interval
data.Income_Category = data.Income_Category.replace({'Less than $40K':20,'$40K - $60K':50,'$60K - $80K':70,'$80K - $120K':100,'$120K +':150})

# Encoding the marital status
data = pd.concat([data,pd.get_dummies(data['Marital_Status']).drop(columns=['Unknown'])],axis=1)

# Encoding the card category
data = pd.concat([data,pd.get_dummies(data['Card_Category'], prefix='Card')], axis=1)
# data = pd.concat([data,pd.get_dummies(data['Card_Category']).drop(columns=['Platinum'])],axis=1)
# data.drop(columns = ['Education_Level','Income_Category','Marital_Status','Card_Category','CLIENTNUM'],inplace=True)

# Removing the useless columns
data = data.drop(columns = ['Marital_Status','Card_Category','CLIENTNUM'])

data.head(10)

Unnamed: 0,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Income_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,...,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Divorced,Married,Single,Card_Blue,Card_Gold,Card_Platinum,Card_Silver
0,0,45,0,3,1,70,39,5,1,3,...,42,1.625,0.061,0,1,0,1,0,0,0
1,0,49,1,5,3,20,44,6,1,2,...,33,3.714,0.105,0,0,1,1,0,0,0
2,0,51,0,3,3,100,36,4,1,0,...,20,2.333,0.0,0,1,0,1,0,0,0
3,0,40,1,4,1,20,34,3,4,1,...,20,2.333,0.76,0,0,0,1,0,0,0
4,0,40,0,3,0,70,21,5,1,0,...,28,2.5,0.0,0,1,0,1,0,0,0
5,0,44,0,2,3,50,36,3,1,2,...,24,0.846,0.311,0,1,0,1,0,0,0
7,0,32,0,0,1,70,27,2,2,2,...,36,0.714,0.048,0,0,0,0,0,0,1
8,0,37,0,3,0,70,36,5,2,0,...,24,1.182,0.113,0,0,1,1,0,0,0
9,0,48,0,2,3,100,36,6,3,3,...,32,0.882,0.144,0,0,1,1,0,0,0
10,0,42,0,5,0,150,31,5,3,2,...,42,0.68,0.217,0,0,0,1,0,0,0


You can assume that the data set is cleaned and prepared now.

## PART 1: Predicting the churn (9 points)

Churn is an important phenomenon for EPIBank. You are asked to build a model that can detect profiles of clients who are likely to churn (i.e. leave the bank). EPIBank would like to use it to identify profiles of clients who are likely to churn and offer them promotions in order to retain them.

Attrition_Flag corresponds to the customer churn and is our target here. The value is 0 when the customer is still in the bank, and is 1 when she has left the bank and is no longer a customer. <br>
The other columns will be considered as predictors (or features).

### 1) Prepare y (for the target) and X (for the predictors)

In [None]:
X = 

In [None]:
y =

### 2) Plot the density function of _Customer_Age_ with one color for the customers who have left (Attrition_Flag = 1) and one other color for the customers who are still in the bank (Attrition_Flag = 0). Plot the same for _Total_Trans_Ct_ and _Total_Revolving_Bal_ 
### (3 plots expected)

### 3) Plot with stacked bars the repartition of _Gender_ among customers who have left (Attrition_Flag = 1) and customers who are still in the bank (Attrition_Flag = 0). Plot the same for _Education_Level_ 
### (2 plots expected)

### 4) When you look at the figures of questions 2 and 3, which predictor(s) among  _Customer_Age_ , _Total_Trans_Ct_ , _Total_Revolving_Bal_ , _Gender_ , and _Education_Level_ do you think is(are) the best for predicting the churn? Explain why.

### 5) Proceed to a split of the data set. Bear in mind that you will be asked for an accurate estimation of the performance of your best model at the end. Keep 60 % of the examples for the test set.

### 6) Choose ONE metric that you will use to evaluate models that will predict churn. Write it down and explain why you choose this metric. 

### 7) Train a model of your choice (except Random Forest), using ALL features

### 8) Train a Random Forest, using all features, and tune the following hyperparameters: [ number of estimators, max_depth ]  
Notes: 
- "Tune" means find the optimal value for that hyperparameter
- If you are not sure how tune more than one parameter at once, just tune the number of estimators


### 9) Compare the performance of the models you have trained. Comment the results based on the metric you chose earlier.

### 10) Plot the ROC curves for your models. Does it confirm your choice?

## PART 2: Making clusters of customers (3 points)

In this part, we will not consider the churn, we will on focus on the information in X to create three clusters of customers. Customers among each cluster should share similarities so that specific EPIBank employees can focus their attention on each cluster.

### 11) Train a k-Means clustering (on ALL features) with k = 3

### 12) Plot (scatter) Credit_Limit vs. Income_Category with colors corresponding to the cluster attribution

### 13) Compute the mean values for the following features among the whole group of customers: [Customer_Age, Credit_Limit, Gender, Income_Category, Education_Level, Card_Silver]

### 14) Compute the mean values for the same features as in question 13 among each individual cluster. 

### 15) Based on your results on the two previous questions, describe the three clusters in terms of the characteristics of its members. 