**Review**

Hello Cory!

I'm happy to review your project today.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  Thank you so much for your feedbacks. I've split the cells into multiple so it's easier. Hopefully i got it right this time. Thank you!
</div>
  
First of all, thank you for turning in the project! You did a great job overall, but there are some small problems that need to be fixed before the project will be accepted. Let me know if you have any questions!


Beta Bank has asked me to look at, and make predictions based on them losing customers at a steady pace. My goal with will be 1. See how many customers are still remaining at the bank, 2. make a prediction based on all the data how likely the remain customers are to leave. Finally, we will make some possible sugguestions to possibly help keep the customers loyal to Beta Bank.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import resample
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier

In [2]:
df = pd.read_csv('/datasets/Churn.csv')

display(df.head())
print(df.info())

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB
None


In [3]:
total_customers = df.shape[0]
remaining_customers = df[df['Exited'] == 0].shape[0]

print(f"Total number of customers: {total_customers}")
print(f"Number of customers still with the bank: {remaining_customers}")

Total number of customers: 10000
Number of customers still with the bank: 7963


All right! First step is down! We started off with 10,000 members and are now down to 7,963. Let's focus of the remaining members to see how likely they are to leave soon and then explore what we can do to keep them with us!

<div class="alert alert-block alert-success">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

Correct

</div>

In [4]:
df.fillna(df.mean(), inplace=True)

categorical_features = df.select_dtypes(include=['object']).columns.tolist()

df = pd.get_dummies(df, columns=categorical_features, drop_first=True)

X = df.drop(columns=['Exited']) 
y = df['Exited'] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12345)

model = RandomForestClassifier(n_estimators=100, random_state=12345)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]  

accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

print(f"Accuracy: {accuracy:.3f}")
print(f"F1 Score: {f1:.3f}")  
print(f"AUC-ROC Score: {roc_auc:.3f}") 
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.843
F1 Score: 0.470
AUC-ROC Score: 0.842

Confusion Matrix:
[[1547   26]
 [ 288  139]]


<div class="alert alert-block alert-danger">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

Correct. But:
    
1. The main metric is F1 score but not accuracy. The additional metric is auc roc. You can calculate accuracy, if you want, but you should calculate f1 score and roc auc as well.
2. Why didn't you use categorical features? You should encode them and use during model training.

</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Correct. Good job!

</div>

This prediction is looking great for us! According to the confusion matrix 1,573 customers stayed 68 of which were predicted to leave! The unfortunate side is that 427 customers are predicted to leave and 288 of those were predicted to stay! With an accuracy of 84.3%% I feel like its safe to say we can take that to the bank! Lets see if we can improve our F1 Score.

In [5]:
categorical_features = df.select_dtypes(include=['object']).columns.tolist()
numerical_features = df.select_dtypes(exclude=['object']).drop(columns=['Exited']).columns.tolist()

df_train = pd.concat([X_train, y_train], axis=1)

df_majority = df_train[df_train['Exited'] == 0]
df_minority = df_train[df_train['Exited'] == 1]

df_minority_upsampled = resample(df_minority, 
                                 replace=True,  
                                 n_samples=len(df_majority),  
                                 random_state=12345)

df_upsampled = pd.concat([df_majority, df_minority_upsampled])

X_train = df_upsampled.drop(columns=['Exited'])
y_train = df_upsampled['Exited']

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numerical_features),  
    ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), categorical_features) 
])

param_grid = {'classifier__C': [0.01, 0.1, 1, 10, 100], 'classifier__penalty': ['l1', 'l2']}

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(solver='liblinear'))
])

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='f1', n_jobs=-1)
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_

y_pred = best_model.predict(X_test)
y_pred_proba = best_model.predict_proba(X_test)[:, 1]

accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

print(f"Best Model: {best_model}")
print(f"Accuracy: {accuracy:.3f}")
print(f"F1 Score: {f1:.3f}")  
print(f"AUC-ROC Score: {roc_auc:.3f}")  
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Best Model: Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num', StandardScaler(),
                                                  ['RowNumber', 'CustomerId',
                                                   'CreditScore', 'Age',
                                                   'Tenure', 'Balance',
                                                   'NumOfProducts', 'HasCrCard',
                                                   'IsActiveMember',
                                                   'EstimatedSalary',
                                                   'Surname_Abbie',
                                                   'Surname_Abbott',
                                                   'Surname_Abdullah',
                                                   'Surname_Abdulov',
                                                   'Surname_Abel',
                                                   'Surname_Abernathy',
                    

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

Correct. Good job! But:

1. You have already created variables X, y and splitted them above. You should not do the same thing for the second time here. So, please, clean your code
2. Why didn't you use categorical features? You should encode them and use during model training.
3. The main metric is F1 score but not accuracy. The additional metric is auc roc. You can calculate accuracy, if you want, but you should calculate f1 score and roc auc as well.

</div>

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

If you're going to use any linear models, all quantitative features should be scaled. Not all features but only quantitative ones. Binary features which you got by one hot encoding have a perfect scale by default and additional scaling only ruins it. So, please, fix it.

</div>

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V3</b> <a class="tocSkip"></a>

Not fixed. You keep scaling all the features including binary ones.

</div>

Looks like linear regression is not the way to go! Let's look at a few other possibilites.

In [6]:
model = DecisionTreeClassifier(class_weight='balanced', random_state=12345)

param_grid = {
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']  
}

grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='f1', verbose=0)
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_

y_pred = best_model.predict(X_test)
y_pred_proba = best_model.predict_proba(X_test)[:, 1]  # Probability of class 1

accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

print(f"Best Model: {best_model}")
print(f"Accuracy: {accuracy:.3f}")
print(f"F1 Score: {f1:.3f}")  
print(f"AUC-ROC Score: {roc_auc:.3f}")  
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Best Model: DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=30, random_state=12345)
Accuracy: 0.803
F1 Score: 0.504
AUC-ROC Score: 0.681

Confusion Matrix:
[[1406  167]
 [ 227  200]]


In [7]:
model = RandomForestClassifier(class_weight='balanced', random_state=12345)

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='f1', verbose=0)
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_

y_pred = best_model.predict(X_test)
y_pred_proba = best_model.predict_proba(X_test)[:, 1]

accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

print(f"Best Model: {best_model}")
print(f"Accuracy: {accuracy:.3f}")
print(f"F1 Score: {f1:.3f}")
print(f"AUC-ROC Score: {roc_auc:.3f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Best Model: RandomForestClassifier(class_weight='balanced', max_depth=30, n_estimators=300,
                       random_state=12345)
Accuracy: 0.811
F1 Score: 0.602
AUC-ROC Score: 0.839

Confusion Matrix:
[[1336  237]
 [ 141  286]]


<div class="alert alert-block alert-danger">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

Correct. Good job! But:

1. You have already filled the NaNs, created variables X, y and splitted them above. You should not do the same thing for the second time here. So, please, clean your code
2. The main metric is F1 score. The additional metric is auc roc. You should calculate both f1 score and roc auc.
3. We need to scale the data only if we use linear models. For tree based models it is useless operation because scaling does not affect on tree based models. But if you use linear models, you need to scale only quantitative features because one hot encoded features have a perfect scale by default and additional scaling only ruins it.
4. Please, try at least one more model while working with imbalance
5. Please, try at least one more way to deal with imbalance. For instance, you can train a model with parameters class_weights='balanced'
6. Please, tune hyperparameters at least for one model while working with imbalance. The easiest way is to GridSearcCV with a model with class_weights='balanced'.

</div>

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

1. If you're going to use any linear models, all quantitative features should be scaled. Not all features but only quantitative ones. Binary features which you got by one hot encoding have a perfect scale by default and additional scaling only ruins it. So, please, fix it.
2. Please, try at least one more model while working with imbalance. Right now you tried only one model (RandomForestClassifier) but you should try at least two models.
3. Please, try at least one more way to deal with imbalance. For instance, you can apply upsampling or downsampling function for the train data. You have such examples in the lesson. You don't even need to tune hyperparameters in such case. Just try at least one model with upsampled or downsampled train data.

</div>

<div class="alert alert-info">
  Just for some clarification, I thought i had used 2 different models, liner regression and random forest classifier. Like i have talked with a tutor this section has logically made sense however the practical use has made zero sense. Am i wrong about the liner regression vs random forest classifier? Sorry for not getting this project at all and thanks for your paitence with me. 
</div>

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V3</b> <a class="tocSkip"></a>

You tried LogisticRegression before you started to work with imbalance. But in the section where you work with imbalance you need to try at least two models but now you tried only one RandomForestClassifier model.
    
So, all 3 problems from my previous comment are not fixed yet.

</div>

Our Prediction models are telling us that it looks like the worst is behind us! With more customers staying then leaving this should be music to our ears. As a side note of possibly keeping more customers with us here at Beta Bank, We could look into things such as higher interest rates on savings accounts, Loyalty gifts for those with higher tenure, or maybe even a cash back credit card. All that being said lets figure out more ways to keep those customers that are predicted to leave!