## Customers Churn Prediction

This project is focused on developing a model that will help Beta Bank predict whether customers will leave or stay based on their Credit Score, Geographical Location, Gender, Age, how long they've been with the bank (Tenure), Balance, Number of Products they use, whether they have a credit card, if they are active members, and their Estimated Salary.

### Data Description

- `RowNumber:`data string index
- `CustomerId:`unique customer identifier
- `Surname:`customer surname
- `CreditScore:`customer credit score
- `Geography:`country of residence
- `Gender:`gender
- `Age:`age of customer
- `Tenure:`period of maturation for a customer’s fixed deposit (years)
- `Balance:`account balance
- `NumOfProducts:`number of banking products used by the customer
- `HasCrCard:`customer has a credit card
- `IsActiveMember:`customer’s activeness
- `EstimatedSalary:`estimated salary

### Table of Contents

1. General Information
2. Data Processing
    * Deleting columns
    * Filling missing Tenure Values
    * Dummies for the categorical Columns
    * Scaling the numerical columns
    * Features, target and splitting the data
3. Building a Model with the Class Balance
    * Decision Tree
    * Random Forest
    * Logistic Regression
4. Quality Check on the Test Set
    * Class Weight Adjustment
    * Upsampling
5. Final Test on Test set
6. Conclusion

## General Information

Importing the necessary libraries

In [1]:
import pandas as pd 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import StandardScaler 
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle 
from sklearn.metrics import f1_score 
from sklearn.metrics import roc_auc_score

Loading the dataset

In [2]:
data = pd.read_csv('/datasets/Churn.csv')

data.head(10)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0
5,6,15574012,Chu,645,Spain,Male,44,8.0,113755.78,2,1,0,149756.71,1
6,7,15592531,Bartlett,822,France,Male,50,7.0,0.0,2,1,1,10062.8,0
7,8,15656148,Obinna,376,Germany,Female,29,4.0,115046.74,4,1,0,119346.88,1
8,9,15792365,He,501,France,Male,44,4.0,142051.07,2,0,1,74940.5,0
9,10,15592389,H?,684,France,Male,27,2.0,134603.88,1,1,1,71725.73,0


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


The dataset contains 10000 rows, and 14 columns. All the columns have the right datatypes, and have no missing values except "Tenure" column. So, we would be filling the missing values in the "Tenure" column. The columns that contain categorical values will be converted to numerical values using one-hot encoding. After one-hot encoding, we would standardize(scale) the numerical columns except the ones with binary values (either 0 or 1) like HasCrCard and IsActiveMember.

Our target will be the Exited column, while the rest of the column will serve as features

## Data Preprocessing

### Deleting columns

We would be looking at the following columns:RowNumber, CustomerId, and Surname. RowNumber is basically the index, just that it starts from 1 and not 0. CusttomerId is used to uniquely identify each customer. Lastly, Surname is also another means of identification, and both are different for each and every observation. Including these columns will not help with the training of our models. So we would drop these columns.

In [4]:
data = data.drop(["RowNumber", "CustomerId", "Surname"], axis = 1)

In [5]:
data.head(40)

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0
5,645,Spain,Male,44,8.0,113755.78,2,1,0,149756.71,1
6,822,France,Male,50,7.0,0.0,2,1,1,10062.8,0
7,376,Germany,Female,29,4.0,115046.74,4,1,0,119346.88,1
8,501,France,Male,44,4.0,142051.07,2,0,1,74940.5,0
9,684,France,Male,27,2.0,134603.88,1,1,1,71725.73,0


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CreditScore      10000 non-null  int64  
 1   Geography        10000 non-null  object 
 2   Gender           10000 non-null  object 
 3   Age              10000 non-null  int64  
 4   Tenure           9091 non-null   float64
 5   Balance          10000 non-null  float64
 6   NumOfProducts    10000 non-null  int64  
 7   HasCrCard        10000 non-null  int64  
 8   IsActiveMember   10000 non-null  int64  
 9   EstimatedSalary  10000 non-null  float64
 10  Exited           10000 non-null  int64  
dtypes: float64(3), int64(6), object(2)
memory usage: 859.5+ KB


### Filling missing "Tenure" values

Checking the column to see its characteristics

In [7]:
data["Tenure"].unique()

array([ 2.,  1.,  8.,  7.,  4.,  6.,  3., 10.,  5.,  9.,  0., nan])

The column contains some Nan values which have to be filled with the median values of the Tenure column. 

In [8]:
data["Tenure"] = data["Tenure"].fillna(data["Tenure"].median())

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CreditScore      10000 non-null  int64  
 1   Geography        10000 non-null  object 
 2   Gender           10000 non-null  object 
 3   Age              10000 non-null  int64  
 4   Tenure           10000 non-null  float64
 5   Balance          10000 non-null  float64
 6   NumOfProducts    10000 non-null  int64  
 7   HasCrCard        10000 non-null  int64  
 8   IsActiveMember   10000 non-null  int64  
 9   EstimatedSalary  10000 non-null  float64
 10  Exited           10000 non-null  int64  
dtypes: float64(3), int64(6), object(2)
memory usage: 859.5+ KB


All missing values have been filled in the Tenure column

### Dummies for the categorical columns

From the dataset, there are two categorical columns, namely: Geography and Gender

Taking a look at the values in each categorical column. Let's start with Geography

In [10]:
data["Geography"].value_counts()

France     5014
Germany    2509
Spain      2477
Name: Geography, dtype: int64

Next, we look at Gender column

In [11]:
data["Gender"].value_counts()

Male      5457
Female    4543
Name: Gender, dtype: int64

Using One-hot encoding to change the categorical values to numerical values, so that they can be suitable for the ML Models

In [12]:
data = pd.get_dummies(data, drop_first = True)
#replaces the categorical columns by their dummies and drops the first dummy column for each replaced column

data.head(5)

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,Gender_Male
0,619,42,2.0,0.0,1,1,1,101348.88,1,0,0,0
1,608,41,1.0,83807.86,1,0,1,112542.58,0,0,1,0
2,502,42,8.0,159660.8,3,1,0,113931.57,1,0,0,0
3,699,39,1.0,0.0,2,0,0,93826.63,0,0,0,0
4,850,43,2.0,125510.82,1,1,1,79084.1,0,0,1,0


The Geography column contains 3 values: Spain, Germany, and France. When we create dummies for this column, the column will be replaced by 3 columns: Geography_Spain, Geography_Germany, and Geography_France. Each column will take the value 1 in the observation where the Geography column has the country as value, otherwise it gets 0. 
The same process will be done for the Gender column. The pd.get_dummies function is applied on the whole dataset since those are the only categorical columns.
We can drop one of the dummy columns for both scenarios because a 0 in Spain and Germany, directly implies a 1 for France. This is done by setting the parameter drop_first = True.

### Features, target and Splitting the data

Our target for this project is the Exited column, while the rest of the columns will be assigned as features. The whole dataset will be split into three parts, namely: training, validation and test sets making up 60%, 20%, and 20% respectively. The train_test_split() function will be used to perform this task of splitting the dataset. Also, The random_state will be set to 12345 and we will keep it the same throughout the model training

In [13]:
features=data.drop('Exited', axis=1)
target=data['Exited']

#splitting features, and targets into features_train (60%), features_test_valid(60%), target_train(40%)
# target_test_valid(40%)

features_train, features_test_valid, target_train, target_test_valid=train_test_split(features, target,\
                                                                                      test_size=0.4,\
                                                                                     random_state=12345)

#Getting the shape of the datasets

print(features_train.shape)
print(target_train.shape)
print(features_test_valid.shape)


(6000, 11)
(6000,)
(4000, 11)


In [14]:
#splitting features_test_valid and target_test_valid into 
#features_valid(50%), features_test(50%), target_valid(50%), target_test(50%)
features_valid, features_test, target_valid, target_test = train_test_split(features_test_valid, \
                                                                        target_test_valid, test_size=0.5, \
                                                                       random_state=12345)

In [15]:
#Getting the shape of the datasets

print(features_valid.shape)
print(features_test.shape)
print(target_valid.shape)
print(target_test.shape)


(2000, 11)
(2000, 11)
(2000,)
(2000,)


## Scaling the numerical Columns

From our dataset, the numerical columns are "CreditScore", "Age", "Tenure", "Balance", "EstimatedSalary", and "NumOfProducts". These columns have to be standardized because the algorithm would interpret variables with high dispersion as more important than those with lower dispersion. Therefore, we will call our StandardScaler() function, we will fit() the numerical columns in it and transform them, thereby, getting our scaled values

In [16]:
num_dat = ["CreditScore", "Age", "Tenure", "Balance", "EstimatedSalary", "NumOfProducts"]
#we create a list containing the numeric column names assign it to a variable named dat_num

scaler = StandardScaler()

scaler.fit(features_train[num_dat])
#trains the scaler with the data from the numeric columns

features_train[num_dat] = scaler.transform(features_train[num_dat])#transforms the data into scaled values
features_train.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features_train[num_dat] = scaler.transform(features_train[num_dat])#transforms the data into scaled values
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)


Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_Germany,Geography_Spain,Gender_Male
7479,-0.886751,-0.373192,1.082277,1.232271,-0.89156,1,0,-0.187705,0,1,1
3411,0.608663,-0.183385,1.082277,0.600563,-0.89156,0,0,-0.333945,0,0,0
6027,2.052152,0.480939,-0.737696,1.027098,0.830152,0,1,1.503095,1,0,1
1247,-1.457915,-1.417129,0.354288,-1.233163,0.830152,1,0,-1.071061,0,0,1
3716,0.130961,-1.132419,-1.10169,1.140475,-0.89156,0,0,1.524268,1,0,0


In [17]:
features_test[num_dat] = scaler.transform(features_test[num_dat])#transforms the data into scaled values
features_test.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features_test[num_dat] = scaler.transform(features_test[num_dat])#transforms the data into scaled values
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)


Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_Germany,Geography_Spain,Gender_Male
7041,-2.226392,-0.088482,-1.10169,-1.233163,0.830152,1,0,0.647083,0,0,1
5709,-0.08712,0.006422,1.446272,-1.233163,-0.89156,1,0,-1.65841,0,0,0
7117,-0.917905,-0.752805,-0.009707,0.722307,-0.89156,1,1,-1.369334,0,1,1
7775,-0.253277,0.101325,1.810266,-1.233163,0.830152,1,0,0.075086,0,1,1
8735,0.785204,-0.847708,1.810266,0.615625,-0.89156,0,1,-1.070919,0,0,1


In [18]:
features_valid[num_dat] = scaler.transform(features_valid[num_dat])#transforms the data into scaled values
features_valid.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features_valid[num_dat] = scaler.transform(features_valid[num_dat])#transforms the data into scaled values
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)


Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_Germany,Geography_Spain,Gender_Male
8532,-0.699824,-0.373192,-1.10169,-1.233163,0.830152,1,0,-0.015173,0,0,0
5799,-0.284431,0.575842,-0.737696,-1.233163,-0.89156,1,1,1.471724,0,0,0
5511,0.151731,-0.657902,-1.829679,0.438711,-0.89156,1,0,-1.367107,1,0,1
7365,-0.876366,-0.278288,1.810266,1.239884,-0.89156,1,1,-0.786517,0,1,0
7367,-0.481743,0.291132,1.810266,-1.233163,0.830152,1,0,1.358533,0,1,1


## Building a Model with the Class Imbalance

### Decision Tree Classification

For the Decision Tree Classification we will be calling the DecisionTreeClassifier() function. We will call 2 hyperparameters: random_state and max_depth. random_state will be constant across the board and will be given a fixed value of 12345. max_depth, however, is the hyperparameter we will play with. So we will loop through a bunch of values for max_depth (in this case, 1 to 12) and get their f1-scores and AUC-ROC values, both of which are metrics for model quality. 

In [28]:
#Looping through the Tree with varying max_depth  

for depth in range(1, 13):
    dtc_model = DecisionTreeClassifier(random_state=12345, max_depth = depth)
    
    dtc_model.fit(features_train, target_train) 
    #Here the model is trained using the features and target
    #of the training set
    
    dtc_pred_valid= dtc_model.predict(features_valid)
    #The model makes predictions on the features of the validation set 
    
    probabilities_valid = dtc_model.predict_proba(features_valid)
    #this line of code gets both negative class and positive class probabilities for 
    #each observation of the features_valid set
    
    probabilities_one_valid = probabilities_valid[:, 1]
    #gets the positive class probabilities for each observation of the features_valid set
    print('Max depth', depth, 'F1 score =', f1_score(target_valid, dtc_pred_valid), 'AUC-ROC score =', \
         roc_auc_score(target_valid, probabilities_one_valid))
    

Max depth 1 F1 score = 0.0 AUC-ROC score = 0.6925565119556736
Max depth 2 F1 score = 0.5217391304347825 AUC-ROC score = 0.7501814673449512
Max depth 3 F1 score = 0.4234875444839857 AUC-ROC score = 0.7973440741838507
Max depth 4 F1 score = 0.5528700906344411 AUC-ROC score = 0.813428129858032
Max depth 5 F1 score = 0.5406249999999999 AUC-ROC score = 0.8221680508592478
Max depth 6 F1 score = 0.5696969696969697 AUC-ROC score = 0.8164631712023421
Max depth 7 F1 score = 0.5320813771517998 AUC-ROC score = 0.8138530658907929
Max depth 8 F1 score = 0.5454545454545454 AUC-ROC score = 0.8119854644656693
Max depth 9 F1 score = 0.5633802816901409 AUC-ROC score = 0.7801515554775917
Max depth 10 F1 score = 0.5385694249649369 AUC-ROC score = 0.7657619511368929
Max depth 11 F1 score = 0.5059920106524634 AUC-ROC score = 0.7311735190752424
Max depth 12 F1 score = 0.521072796934866 AUC-ROC score = 0.7165306165655491


The best f1 score is approximately 0.57 which is observed when max_depth is 6, with an AUC-ROC value of ~0.82

## Random Forest Classification

For the Random Forest Classification we will be calling the RandomForestClassifier() function. The hyperparameters that we will be dealing with here are n_estimators, which is the number of trees, and max_depth, which is the depth of each tree.
Then we will loop through values of max_depth, within that loop, we will loop through values of n_estimators. The loop will be used to create models with different permutations of max_depth and n_estimators values that we will store in the list, from which we will choose the model with the highest f1 score


In [27]:
score_f1 = 0 

best_est_rf = 0

best_depth_rf = 0

for i in range(1, 12):
    #loops through various depths of Trees from 1 to 11 
    for j in range(10, 101, 10):
        #loops through several trees from 10 to 100 in steps of 10
        
        model_rf = RandomForestClassifier(random_state = 12345, max_depth = i, n_estimators = j )
        
        model_rf.fit(features_train, target_train)
        
        predictions_valid_rf = model_rf.predict(features_valid)
        
        probabilities_valid_rf = model_rf.predict_proba(features_valid)
        
        probabilities_one_valid_rf = probabilities_valid_rf[:, 1]
        
        roc_score = roc_auc_score(target_valid, probabilities_one_valid_rf)
        
        result_F1 = f1_score(target_valid, predictions_valid_rf)
        
              
        if result_F1 > score_f1:
                        
            best_est_rf = j
            
            best_depth_rf = i
            
            best_roc_score = roc_score
            
            score_f1 = result_F1

print('best depth:', best_depth_rf, "n_estimators:", best_est_rf, 'F1 score =', score_f1,
      'AUC-ROC score =', best_roc_score)
        

best depth: 10 n_estimators: 10 F1 score = 0.5891238670694864 AUC-ROC score = 0.8456038023457679


The best f1 score is 0.5891 (~0.59) with a AUC-ROC score of 0.8456 

## Logistic Regression

For Logistic Regression classification, we will use the LogisticRegression() function. The random state will be the same. However, the max_depth and n_estimators hyperparameters don't apply here. All we need to do is to set a solver. We will use 'liblinear'

In [29]:
lr_model = LogisticRegression(random_state=12345, solver='liblinear')

lr_model.fit(features_train, target_train)

lr_valid_pred = lr_model.predict(features_valid)

probabilities_lr_valid = lr_model.predict_proba(features_valid)

probabilities_lr_one_valid = probabilities_lr_valid[:, 1]

print('F1 score =', f1_score(target_valid, lr_valid_pred), 
      'AUC-ROC =', roc_auc_score(target_valid, probabilities_lr_one_valid))

F1 score = 0.33108108108108103 AUC-ROC = 0.7587512627102753


### Conclusion

The best of the 3 models is the Random Forest Classifier with max_depth = 10 and n_estimators = 10 hyperparameters since it had the highest f1 score (about 0.59) and AUC-ROC score (about 0.85). 

## Dealing with Class Imbalance

To study the class imbalance, we need to find distrubution of the classes in the target of the training set. 

In [30]:
target_train.value_counts(normalize = True)

0    0.800667
1    0.199333
Name: Exited, dtype: float64

The negative class (0) is approximately 80% of the data, while the positive class (1) is approximately 20% of the data. So there are 4 times as much 0s as there are 1s.
Let's consider some methods to tackle Class Imbalance.

## Class Weight Adjustment

Here we want to make the rare class (i.e 1) have more weight. This is achieved by setting the hyperparameter class weight = "balanced" when training the model

In [31]:
model_rand_for = RandomForestClassifier(random_state=12345, max_depth=10, n_estimators=10, 
                                       class_weight='balanced')
model_rand_for.fit(features_train, target_train)

pred_model_randfor = model_rand_for.predict(features_valid)

proba_model_randfor = model_rand_for.predict_proba(features_valid)

proba_one_model_randfor = proba_model_randfor[:, 1]

print('F1 score =', f1_score(target_valid, pred_model_randfor), 
      'AUC-ROC =', roc_auc_score(target_valid, proba_one_model_randfor))

F1 score = 0.6038647342995168 AUC-ROC = 0.8378377560957906


After adjusting the weight of the classes, we can see an improvement in the F1 Score, it is greater than 0.59. However, the AUC-ROC value took a slight dip. The True Positive Rate surely decreased a little

## Upsampling

Here, we will duplicate the rarer class (positive observation) several times for it to be evenly matched with the other class. We saw earlier that there are 4 times as many 0s as there are ones.

Therefore, we will repeat the ones and their observations 4 times to evenly match the zeros in the training set. After doing so, we will have to shuffle them using the shuffle() function to make learning a quite complex

In [32]:
#Working with the training dataset

def upsample(features, target, repeat):
    
    features_zeros = features_train[target_train == 0]
    features_ones = features_train[target_train == 1]
    target_zeros = target_train[target_train == 0]
    target_ones = target_train[target_train == 1]
    
    #duplicating the positive class observations and combining them with the negative class observations
    
    arg_1 = pd.concat([features_zeros] + [features_ones] * repeat)
    
    arg_2 = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled, target_upsampled = shuffle(arg_1, arg_2, random_state = 12345)
    
    return features_upsampled, target_upsampled

features_upsampled, target_upsampled = upsample(features_train, target_train, 4)
    
print(features_upsampled.shape, target_upsampled.shape)
  


(9588, 11) (9588,)


Next, we train the model after upsampling

In [33]:
model = RandomForestClassifier(random_state=12345, max_depth=10, n_estimators=10) 

model.fit(features_upsampled, target_upsampled)
                            
predicted_valid = model.predict(features_valid)

proba_predict_valid = model.predict_proba(features_valid)

proba_one_valid= proba_predict_valid[:, 1]

print('F1 score =', f1_score(target_valid, predicted_valid), 'AUC-ROC =', \
      roc_auc_score(target_valid, proba_one_valid))

F1 score = 0.5849462365591399 AUC-ROC = 0.8331627036214833


The F1 score after upsampling is lower than what we got when we adjusted the class weight. The AUC-ROC also has a lower value than it did for class adjustment

* Conclusion

We will move forward with the class weight adjustment approach as it has the higher f1 score of 0.602.

## Final Test on Test Set

Here, we apply our model with class weight adjustment to the test set. Before that, we need to train the model using both the training and validation sets; we will join them using the pd.concat() function.

In [26]:
merged_features = pd.concat([features_train, features_valid], axis = 0)

merged_target = pd.concat([target_train, target_valid], axis = 0)

rf_model_fin = RandomForestClassifier(random_state=12345, max_depth=10, n_estimators=10, 
                                       class_weight='balanced')

rf_model_fin.fit(merged_features, merged_target)

pred_final = rf_model_fin.predict(features_test)

pred_final_proba = rf_model_fin.predict_proba(features_test)

proba_one_final = pred_final_proba[:, 1]

print('F1 score =', f1_score(target_test, pred_final), 'AUC-ROC =', \
      roc_auc_score(target_test, proba_one_final))

F1 score = 0.5979381443298969 AUC-ROC = 0.847932229103049


Our final f1 score is 0.5979 (~0.60) which is more than our threshold of 0.5891

## Conclusion

This project was done to develop a model that will help Beta Bank predict if customers will leave or stay based on their profile, account type, credit score, and Estimated salary.

The dataset was split into three parts, namely: training, validation, and test. 60% of the dataset was assigned for training, 20% was assigned for validation, and the last 20% was assigned for testing.

Then the dataset was processed by scaling numeric columns, filling in missing values, and getting dummy columns from categorical ones. After splitting the data, without taking into account the 4:1 class imbalance, Decision Tree, Random Forest, and Logistic Regression Classifier models were used to train and validate the dataset to get the model with the best F1 score. 

Random Forest had the best f1 score of 0.5891 and AUC-ROC value of 0.8456. 

We then took the class imbalance into account and used two approaches to address it, namely: Class Weight Adjustment and Upsampling. We chose to go with Class Weight Adjustment since it had the higher f1 score of 0.603, and AUC-ROC score of  0.8378. 
We merged the training and validation set, and used the merged dataset to train the model. After training, the model was applied on the test set, and we got an f1 score of 0.5979 (~0.60)