<div style="border:solid blue 2px; padding: 20px"> 

<strong>Reviewer's Introduction</strong>

Hello Collin! 👋 

I'm happy to review your project today.

I will categorize my comments in green, blue or red boxes like this:

<div class="alert alert-success">
    <b>Success:</b> Everything is done successfully.
</div>
<div class="alert alert-warning">
    <b>Remarks:</b> Suggestions for optimizations or improvements.
</div>
<div class="alert alert-danger">
    <b>Needs fixing:</b> This must be fixed for a project to be approved.
</div>

Please don't remove my comments :) If you have any questions or comments, don't hesitate to respond to my comments by creating a box that looks like this: 
<div class="alert alert-info"> <b>Student's comment:</b> Your text here.</div>    
<br>


📌 Here's how to create code for student comments inside a Markdown cell:
    
    
    <div class="alert alert-info">
    <b> Student's comment</b>

    Your text here. 
    </div>

You can find out how to **format text** in a Markdown cell or how to **add links** [here](https://sqlbak.com/blog/jupyter-notebook-markdown-cheatsheet). 


<hr>
Reviewer: Han Lee (hanlee_97297 on Discord)<br>
Don’t forget to rate your experience by leaving feedback here:  
<a href="https://form.typeform.com/to/msiTC4LB" target="_blank">https://form.typeform.com/to/msiTC4LB</a>
</div>


<div style="border: solid blue 2px; padding: 15px; margin: 10px">
	<b>Reviewer's Comments – Iteration 1</b>
	
Thank you for submitting this project. It is clear that you have a strong understanding of the material covered up to this point, from data exploration and cleaning, to machine learning fundamentals.

<b>Notable strengths:</b>  

✔️ Extensive and largely persuasive discussions interspersed throughout the project, including a strong conclusion. They really help your audience understand your thought process and decisions.

✔️ A solid understanding of data imbalance, and how to address it via sampling.

✔️ Translating data findings into actionable insights to meet business needs.


<hr>

A few things require your attention before approval. Please see my notes below for further info:

🔴 Please reconsider data imputation over dropping rows with null values in the `Tenure` column. 

🔴 The data should be split into training, validation, and testing sets.


</div>



<div style="border: solid blue 2px; padding: 15px; margin: 10px">
	<b>Reviewer's Comments – Iteration 2</b>

Congratulations! 

This project now meets all requirements ✅, and is approved. 🎉

Great job addressing the requested changes from the first iteration!

Please see my notes before for further discussion points.

Again, well done, and I wish you continued success in the upcoming sprints.

</div>

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.utils import shuffle
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

In [2]:
data = pd.read_csv('/datasets/Churn.csv')
print(data.head())
print(data.info())

   RowNumber  CustomerId   Surname  CreditScore Geography  Gender  Age  \
0          1    15634602  Hargrave          619    France  Female   42   
1          2    15647311      Hill          608     Spain  Female   41   
2          3    15619304      Onio          502    France  Female   42   
3          4    15701354      Boni          699    France  Female   39   
4          5    15737888  Mitchell          850     Spain  Female   43   

   Tenure    Balance  NumOfProducts  HasCrCard  IsActiveMember  \
0     2.0       0.00              1          1               1   
1     1.0   83807.86              1          0               1   
2     8.0  159660.80              3          1               0   
3     1.0       0.00              2          0               0   
4     2.0  125510.82              1          1               1   

   EstimatedSalary  Exited  
0        101348.88       1  
1        112542.58       0  
2        113931.57       1  
3         93826.63       0  
4         790

In [3]:
data_necessary = data.drop(columns=['RowNumber', 'CustomerId', 'Surname'], axis=1)
data_ohe = pd.get_dummies(data_necessary, drop_first=True)
data_ohe['Tenure'] = data_ohe['Tenure'].fillna(data_ohe['Tenure'].median())
target = data_ohe['Exited']
features = data_ohe.drop('Exited', axis=1)
features_train_val, features_test, target_train_val, target_test = train_test_split(
    features, target, test_size=0.2, random_state=42
)
features_train, features_valid, target_train, target_valid = train_test_split(
    features_train_val, target_train_val, test_size=0.25, random_state=42
)


In [5]:
data.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [4]:
data_ohe.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,Gender_Male
0,619,42,2.0,0.0,1,1,1,101348.88,1,0,0,0
1,608,41,1.0,83807.86,1,0,1,112542.58,0,0,1,0
2,502,42,8.0,159660.8,3,1,0,113931.57,1,0,0,0
3,699,39,1.0,0.0,2,0,0,93826.63,0,0,0,0
4,850,43,2.0,125510.82,1,1,1,79084.1,0,0,1,0


<div class="alert alert-danger">
    <b>Reviewer's comment – Iteration 1:</b><br>
I agree with the decision to drop the unnecessary columns.

Here are two requested changes:

1. Dropping rows with null values in the `Tenure` column is a bit more nuanced, as such rows comprise roughly 10% of the dataset. By dropping these rows, you risk losing valuable data from the other features, while imputing it can distort this column. In this case, I believe it makes sense to impute it with the column's median value, and suggest that you give it a try. It would be interesting to discover whether it makes a difference in model performances.

2. Please split the dataset into training, validation, and testing sets. Below, you both tune your hyperparameters and perform final testing on the same validation set. This can cause data leakage and lead to overfitting. The testing set should be untouched during training and tuning.
</div>

<div class="alert alert-info">
<b> Student's comment</b>

Fixed :) 
</div>


<div class="alert alert-success">
    <b>Reviewer's comment – Iteration 2:</b><br>
Well done!
</div>

In [4]:
print(data_ohe.head())
print(data_ohe.info())
print(data_ohe.shape)

   CreditScore  Age  Tenure    Balance  NumOfProducts  HasCrCard  \
0          619   42     2.0       0.00              1          1   
1          608   41     1.0   83807.86              1          0   
2          502   42     8.0  159660.80              3          1   
3          699   39     1.0       0.00              2          0   
4          850   43     2.0  125510.82              1          1   

   IsActiveMember  EstimatedSalary  Exited  Geography_Germany  \
0               1        101348.88       1                  0   
1               1        112542.58       0                  0   
2               0        113931.57       1                  0   
3               0         93826.63       0                  0   
4               1         79084.10       0                  0   

   Geography_Spain  Gender_Male  
0                0            0  
1                1            0  
2                0            0  
3                0            0  
4                1            

**Data Preparation Strategy**

- Firstly, I dropped any columns that don't influence one's decision to leave the bank (customer ID, surname, and the Row number columns).
- Secondly, I created dummy columns for the geography and gender columns since those were categorical columns that are best used when numeric values are assigned in place of categorical ones.
- Thirdly, I filled any rows with missing values in the "Tenure" with the median.
- Lastly, I split the data into a training, validation, and test set at a ratio of 60-20-20 respectively.

In [5]:
for col in data_ohe.columns:
    print(f"\nColumn: {col}")
    print(data_ohe[col].value_counts(dropna=False))


Column: CreditScore
850    233
678     63
655     54
667     53
705     53
      ... 
412      1
351      1
365      1
373      1
423      1
Name: CreditScore, Length: 460, dtype: int64

Column: Age
37    478
38    477
35    474
36    456
34    447
     ... 
92      2
88      1
82      1
85      1
83      1
Name: Age, Length: 70, dtype: int64

Column: Tenure
5.0     1836
1.0      952
2.0      950
8.0      933
3.0      928
7.0      925
4.0      885
9.0      882
6.0      881
10.0     446
0.0      382
Name: Tenure, dtype: int64

Column: Balance
0.00         3617
105473.74       2
130170.82       2
72594.00        1
139723.90       1
             ... 
130306.49       1
92895.56        1
132005.77       1
166287.85       1
104001.38       1
Name: Balance, Length: 6382, dtype: int64

Column: NumOfProducts
1    5084
2    4590
3     266
4      60
Name: NumOfProducts, dtype: int64

Column: HasCrCard
1    7055
0    2945
Name: HasCrCard, dtype: int64

Column: IsActiveMember
1    5151
0    4849
N

**Class Imbalance Examination**

- Looking at the target column (Exited), most customers stay with the bank, and only about 20% have chosen to leave. This number is still extremely significant to a bank with tens of thousands of customers and a matter of millions (maybe even billions) of dollars.
- In regards to the significance for a machine learning model, this is a huge imbalance and needs to be handled before a model's accuracy can be taken seriously.

In [6]:
#Finding best number of estimators

best_model = None
best_est = 0
best_score = 0

for est in range(1, 50):
    model = RandomForestClassifier(n_estimators=est, random_state=12345)
    model.fit(features_train, target_train)
    predicted_valid = model.predict(features_valid)
    score = f1_score(target_valid, predicted_valid)
    if score > best_score:
        best_score = score
        best_est = est
        best_model = model

print(f'Best # of estimators: {best_est}')

Best # of estimators: 38


In [7]:
#Finding best tree depth

best_model = None
best_depth = 0
best_score = 0

for depth in range(1, 20):
    model = RandomForestClassifier(n_estimators=47, max_depth=depth, random_state=12345)
    model.fit(features_train, target_train)
    predicted_valid = model.predict(features_valid)
    score = f1_score(target_valid, predicted_valid)
    if score > best_score:
        best_score = score
        best_depth = depth
        best_model = model

print(f'Best depth: {best_depth}')

Best depth: 18


In [8]:
#Training model based on findings of estimators and depth 

model = RandomForestClassifier(n_estimators=38, max_depth=18, random_state=12345)
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
f1 = f1_score(target_valid, predicted_valid)
auc_roc = roc_auc_score(target_valid, predicted_valid)
print(f'F1 Score on best model with imbalanced classes: {f1}')
print(f'AUC_ROC score on best model with imbalanced classes: {auc_roc}')

F1 Score on best model with imbalanced classes: 0.6153846153846154
AUC_ROC score on best model with imbalanced classes: 0.7344876882539886


**Model Training with Class Imbalance**

- I chose a Random Forest since that is what gave me the best accuracy in the previous binary classification project, and ran loops to figure out the best number of estimators and tree depth that would give me the best accuracy.
- As we can see from the cell above, the F1 and AUC_ROC scores are not where I want them to be, so I will balance the classes and retrain the models. 

In [9]:
#scale the features

numeric = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']
scaler = StandardScaler()
scaler.fit(features_train[numeric])
features_train[numeric] = scaler.transform(features_train[numeric])
features_valid[numeric] = scaler.transform(features_valid[numeric])
print(features_train.head())

      CreditScore       Age    Tenure   Balance  NumOfProducts  HasCrCard  \
8588     0.626553 -0.948125  0.745571  0.026803      -0.919788          1   
3178    -1.143262  0.006684 -0.351933  0.538874       0.806433          1   
5200    -1.455583  0.293126  1.477240  0.283178       0.806433          1   
8889    -0.747657  0.006684  1.477240  0.833254      -0.919788          1   
5789     0.387107  1.534377 -1.449437  0.000856      -0.919788          1   

      IsActiveMember  EstimatedSalary  Geography_Germany  Geography_Spain  \
8588               0         0.389943                  0                1   
3178               1        -1.026089                  0                0   
5200               0        -1.486725                  1                0   
8889               0        -0.246001                  0                0   
5789               0        -1.006993                  1                0   

      Gender_Male  
8588            0  
3178            0  
5200          

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features_train[numeric] = scaler.transform(features_train[numeric])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features_valid[numeric] = scaler.transform(features_valid[numeric])
A value is

In [10]:
#Retrain model with balanced classes

model_balanced = RandomForestClassifier(n_estimators=38, max_depth=18, random_state=12345, class_weight='balanced')
model_balanced.fit(features_train, target_train)
predicted_valid = model_balanced.predict(features_valid)
f1 = f1_score(target_valid, predicted_valid)
auc_roc = roc_auc_score(target_valid, predicted_valid)
print(f'F1 Score on best model with balanced classes: {f1}')
print(f'AUC_ROC score on best model with balanced classes: {auc_roc}')

F1 Score on best model with balanced classes: 0.5941176470588235
AUC_ROC score on best model with balanced classes: 0.7240731671220922


In [11]:
#Retrain model with upsampling

def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]
    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=12345)
    return features_upsampled, target_upsampled

features_upsampled, target_upsampled = upsample(features_train, target_train, 10)

model_upsampled = RandomForestClassifier(n_estimators=38, max_depth=18, random_state=12345)
model_upsampled.fit(features_upsampled, target_upsampled)
predicted_valid_upsampled = model_upsampled.predict(features_valid)
f1_upsampled = f1_score(target_valid, predicted_valid_upsampled)
auc_roc_upsampled = roc_auc_score(target_valid, predicted_valid_upsampled)
print(f'F1 Score on best model with upsampling: {f1_upsampled}')
print(f'AUC_ROC score on best model with upsampling: {auc_roc_upsampled}')

F1 Score on best model with upsampling: 0.6122961104140527
AUC_ROC score on best model with upsampling: 0.7512911351461863


In [12]:
#Retrain model with downsampling

def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)]
        + [features_ones]
    )
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)]
        + [target_ones]
    )

    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345
    )

    return features_downsampled, target_downsampled


features_downsampled, target_downsampled = downsample(
    features_train, target_train, 0.1
)

model_downsampled = RandomForestClassifier(n_estimators=38, max_depth=18, random_state=12345)
model_downsampled.fit(features_downsampled, target_downsampled)
predicted_valid_downsampled = model_downsampled.predict(features_valid)
f1_downsampled = f1_score(target_valid, predicted_valid_downsampled)
auc_roc_downsampled = roc_auc_score(target_valid, predicted_valid_downsampled)
print(f'F1 Score on best model with downsampling: {f1_downsampled}')
print(f'AUC_ROC score on best model with downsampling: {auc_roc_downsampled}')

F1 Score on best model with downsampling: 0.4873294346978557
AUC_ROC score on best model with downsampling: 0.7173852014933686


**Balancing Techniques Examination**

- Techniques used: Balancing classes, upsampling, downsampling
- Best technique with highest F1 score: Balancing classes (F1 of 0.61)

In [13]:
#Final Model based on all findings

final_model = RandomForestClassifier(n_estimators=38, max_depth=18, random_state=12345)
final_model.fit(features_upsampled, target_upsampled)
predicted_valid = final_model.predict(features_test)
f1 = f1_score(target_test, predicted_valid)
auc_roc = roc_auc_score(target_test, predicted_valid)
print(f'F1 Score on best model with balanced classes: {f1}')
print(f'AUC_ROC score on best model with balanced classes: {auc_roc}')

F1 Score on best model with balanced classes: 0.15887850467289721
AUC_ROC score on best model with balanced classes: 0.4636141815942022


In [14]:
probs = final_model.predict_proba(features_valid)[:, 1]
thresholds = [i/100 for i in range(30, 70)]  # try thresholds from 0.30 to 0.70
best_threshold = 0.5
best_f1 = 0

for t in thresholds:
    preds = (probs > t).astype(int)
    score = f1_score(target_valid, preds)
    if score > best_f1:
        best_f1 = score
        best_threshold = t

print(f"Best F1 score: {best_f1:.4f} at threshold {best_threshold}")


Best F1 score: 0.6173 at threshold 0.49


In [15]:
probs_test = final_model.predict_proba(features_test)[:, 1]
final_preds = (probs_test > best_threshold).astype(int)

f1 = f1_score(target_test, final_preds)
auc = roc_auc_score(target_test, final_preds)

print(f'Final F1 score on test set: {f1:.4f}')
print(f'Final AUC-ROC score on test set: {auc:.4f}')

Final F1 score on test set: 0.1695
Final AUC-ROC score on test set: 0.4678


### Conclusion

In this project, we developed a machine learning model to predict customer churn for **Beta Bank**, with a focus on maximizing the **F1 score** to account for class imbalance. The objective was to help the bank proactively identify customers at risk of leaving, as retaining existing clients is more cost-effective than acquiring new ones.

**Key Highlights:**

* The data was thoroughly preprocessed, including handling missing values, encoding categorical variables, and scaling numerical features.
* Class imbalance was addressed using class_weight='balanced' to balance the target class
* Several models were evaluated, and the best-performing model was selected based on validation and test performance.

**Final Model Performance:**

* **F1 Score (Test Set):** approximately **0.61**
* **ROC AUC Score:** approximately **0.75**

These results exceed the project requirement of an F1 score of 0.59, indicating that the model is effective in identifying customers likely to churn while maintaining a strong balance between precision and recall. The ROC AUC score also reflects good overall class separation performance.

**Next Steps and Recommendations:**

* Tune the classification threshold to better align with the bank’s risk tolerance and operational priorities.
* Explore more advanced algorithms for potentially improved results.
* Consider incorporating additional customer data, such as product usage patterns or customer service interactions, to further enhance predictive power.
* Deploy the model in a production environment for real-time churn monitoring and intervention.

With this model, Beta Bank gains a valuable tool for improving customer retention and reducing churn, supporting long-term business sustainability.


In [16]:
logreg_model = LogisticRegression(class_weight='balanced', solver='liblinear', random_state=42)
logreg_model.fit(features_train, target_train)

# Predict probabilities instead of class labels
probs_valid = logreg_model.predict_proba(features_valid)[:, 1]

# Try different thresholds to find the best F1 score
best_threshold = 0.5
best_f1 = 0
for t in [i/100 for i in range(20, 80)]:
    preds = (probs_valid > t).astype(int)
    score = f1_score(target_valid, preds)
    if score > best_f1:
        best_f1 = score
        best_threshold = t

print(f'Best threshold: {best_threshold}, Best F1 on validation: {best_f1:.4f}')


Best threshold: 0.56, Best F1 on validation: 0.5253


In [17]:
probs_test = logreg_model.predict_proba(features_test)[:, 1]
final_preds = (probs_test > best_threshold).astype(int)

f1 = f1_score(target_test, final_preds)
auc = roc_auc_score(target_test, final_preds)

print(f'Final F1 score on test set: {f1:.4f}')
print(f'Final AUC-ROC score on test set: {auc:.4f}')

Final F1 score on test set: 0.3540
Final AUC-ROC score on test set: 0.5699


<div class="alert alert-info">
<b> Student's comment</b>

Okay, after making predictions on the test set, I don't get nearly as good as an F1 score. I tried a logistic regression model and implemented threshold tuning to see if that would work better, but I'm still so far off of the required target. Do you have any suggestions? 
</div>

<div class="alert alert-success">
    <b>Reviewer's comment – Iteration 2:</b><br>
Logistic regression is less robust than random forest here because the former is a linear model, and the patterns in this dataset are not well captured by linearity. So it's probably unlikely that we'll be able to achieve comparable results with logistic regression. (The other side of the coin is that logistic regression is less prone to overfitting than random forest.)

For your reference, I've also included a snippet of how GridSearchCV can simplify the tuning process a bit. Note that it incorporates cross validation, which does not require a separate validation set: just training and testing sets will do, as seen below (X stands for feature, and y for target).

For more information on cross-validation, you can peruse this page: https://scikit-learn.org/stable/modules/cross_validation.html (the diagram is helpful).
</div>

In [20]:
# Reviewer Code
from sklearn.model_selection import GridSearchCV

In [18]:
X_train, X_test, y_train, y_test = train_test_split(
    features, target, test_size=0.2, random_state=42
)

In [19]:
param_grid_rf = {'n_estimators': [100, 200, 300], 
                 'max_depth': [10, 20, 30, None], 
                 'min_samples_split': [2, 5, 10], 
                 'class_weight': [None, 'balanced'] }
randomf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid_rf, scoring="f1", cv=3, n_jobs=-1)
randomf.fit(X_train, y_train)
best_rf = randomf.best_estimator_

y_pred = best_rf.predict(X_test)
f1_score_best_rf = f1_score(y_test, y_pred)
f1_score_best_rf

0.6201183431952663