<div style="border:solid green 2px; padding: 20px">

**Hello Jason,**

My name is **John Dickson** (https://hub.tripleten.com/u/3cb57352) and today I'll be reviewing your project!

You’ll find specific notes inside the project file, marked green, yellow or red.


**Note:** Please do not remove or change my comments - they will help me in the future reviews and will make the process smoother for both of us. 

<div class="alert alert-success"; style="border-left: 7px solid green">
<b>✅ Reviewer's comment</b> 
    
Such comment will mark efficient solutions and good ideas that can be used in other projects. It will also point at the document formatting, which was done for you in this project, but you will need to do it yourself in the future ones.
</div>

<div class="alert alert-warning"; style="border-left: 7px solid gold">
<b>⚠️ Reviewer's comment</b> 
    
The parts marked with yellow comments indicate that there is room for optimisation. Though the correction is not necessary it is good if you implement it.
</div>

<div class="alert alert-danger"; style="border-left: 7px solid red">
<b>⛔️ Reviewer's comment</b> 
    
If you see such a comment, it means that there is a problem that needs to be fixed. Please note that I won't be able to accept your project until the issue is resolved.
</div>

---
    
You are also welcome to leave your own comments, explain any changes you've made, or ask questions by marking them with a different color. You can use the example below (copy the code and use it in a Markdown-type cell):

```
    
<div class="alert alert-info"; style="border-left: 7px solid blue">
<b>Student’s Comment</b></div>

```
    
It will appear like this:
    
<div class="alert alert-info"; style="border-left: 7px solid blue">
<b>Student’s Comment</b></div>
</div>

<div style="border:solid Red 2px; padding: 20px">
 
**What Was Great:**

- You have checked the class imbalance
- Correctly one hot encoded the data and filled missing data.
- Correctly identified irrelevant features

**What could be improved:**

- You used only one model in the project, the instructions say "use different models", so at least 2 types of models are expected.
- All library imports should be in the first cell, and data loading should be in a second cell. Library imports should not be scattered throughout the project.
- You should be doing some hyperparameter tuning.

---

Overall you have done well with the project, you just missed a few small details, I am sure you will get it cleaned up for next time. 

<div style="border:solid Green 2px; padding: 20px">
Great work addressing the issues here. 
</div>

<div style="border:solid Red 2px; padding: 20px">
 
**What Was improved:**

 - Added introduction and summary
 - Fixed scaling
 - imports have been added to the first cell

**What could be improved:**

- You only used the second model once, both models should be used throughout.
- There are duplicate imports, each library should be imported only once.
- **You should be doing some hyperparameter tuning.**

---

<div style="border:solid Green 2px; padding: 20px">

Great work with addressing these.
</div>

<div class="alert alert-block alert-info">
<b>Reviewer's comment v3:</b> </a>

You did a great job in this project! I left you a couple of comments to help you address some details before approving it!

Looking forward to reviewing your next submissions! Best of luck!
    
</div>

<div style="border:solid Red 2px; padding: 20px">

 V4  
**What Was improved:**

 - Everything was improved, Great progress!


**What could be improved:**

- You have not trained 'RandomForestClassifier' without taking into account class imbalance, this is the only thing that is missing. 

---

You are so close to completion, this one addition will push you over the line. 

<div style="border:solid Green 2px; padding: 20px">

 V5  
**What Was improved:**

 - We now have the final piece



---

Well done! Time to move on to the next Project. 

**Introduction** 
    
The goal of this project is to predict customer churn for Beta Bank using historical customer data. The primary metric is F1 score, with a minimum threshold of 0.59. AUC-ROC is also used to evaluate model quality. The project includes data preprocessing, class imbalance handling, model comparison, and final testing

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, roc_auc_score
from sklearn.utils import resample

data = pd.read_csv('/datasets/Churn.csv')
print(data.info())
print(data.describe())
print(data['Exited'].value_counts(normalize=True))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB
None
         RowNumber    CustomerId   CreditScore           Age       Tenure  \
count  10000

In [2]:

# Drop irrelevant columns
data = data.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)

# Fill missing values
data['Tenure'] = data['Tenure'].fillna(data['Tenure'].median())

# Encode categorical variables
data = pd.get_dummies(data, drop_first=True)

# Separate features and target
features = data.drop('Exited', axis=1)
target = data['Exited']

# First split: separate test set (20%)
X_temp, X_test, y_temp, y_test = train_test_split(
    features, target, test_size=0.20, random_state=12345, stratify=target
)

# Second split: train/validation from remaining 80% (60/20/20 overall)
X_train, X_valid, y_train, y_valid = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=12345, stratify=y_temp
)

# SCALING
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_valid_scaled = scaler.transform(X_valid)
X_test_scaled = scaler.transform(X_test)


<div class="alert alert-success"; style="border-left: 7px solid green">
<b>✅ Reviewer's comment V1</b> 
    
Correctly dropped irrelevant rows, one hot encoded data, and filled missing values. You have also correctly split the data into features and targets. 
</div>
<div class="alert alert-danger"; style="border-left: 7px solid red">
<b>⛔️ Reviewer's comment V1</b> 
    
<strike>Scaling should be done after splitting into different datasets. When scaling we fit the data to the training set and transform all the datasets based on that fit. 
</div>

<div class="alert alert-danger"; style="border-left: 7px solid red">
<b>⛔️ Reviewer's comment V1</b> 
    
<strike>There is no dataset here created for performing final testing. 
</div>

<div class="alert alert-danger"; style="border-left: 7px solid red">
<b>⛔️ Reviewer's comment V2</b> 
    
<strike>There is still no dataset here created for performing final testing. 
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment v3:</b> </a>

Awesome job creating the train, test and validation datasets!
    
</div>

- Dropped irrelevant columns: `RowNumber`, `CustomerId`, `Surname`  
  *These columns do not contain predictive information and may introduce noise.*
- Imputed missing values in `Tenure` using the median  
  *This ensures that the model can train without errors due to missing data, while preserving the central tendency of the feature.*
- Encoded categorical variables (`Geography`, `Gender`) using one-hot encoding  
  *This converts non-numeric features into a format suitable for machine learning models.*
- Scaled numerical features using `StandardScaler`  
  *Standardization improves model convergence and ensures that features are on comparable scales.*


In [3]:
# HANDLE CLASS IMBALANCE (UPSAMPLING)

train_df = pd.DataFrame(X_train_scaled)
train_df['Exited'] = y_train.values

majority = train_df[train_df['Exited'] == 0]
minority = train_df[train_df['Exited'] == 1]

minority_upsampled = resample(minority, replace=True, n_samples=len(majority), random_state=12345)
upsampled = pd.concat([majority, minority_upsampled])

X_train_up = upsampled.drop('Exited', axis=1)
y_train_up = upsampled['Exited']

In [4]:
# MODEL 1: LOGISTIC REGRESSION

print("\n--- Logistic Regression ---")

# Base model
log_model = LogisticRegression(random_state=12345, max_iter=1000)
log_model.fit(X_train_scaled, y_train)
log_preds = log_model.predict(X_valid_scaled)

print("Base Logistic Regression F1:", f1_score(y_valid, log_preds))
print("Base Logistic Regression AUC-ROC:", roc_auc_score(y_valid, log_model.predict_proba(X_valid_scaled)[:, 1]))

# Upsampled model
log_model_up = LogisticRegression(random_state=12345, max_iter=1000)
log_model_up.fit(X_train_up, y_train_up)
log_preds_up = log_model_up.predict(X_valid_scaled)

print("Upsampled Logistic Regression F1:", f1_score(y_valid, log_preds_up))
print("Upsampled Logistic Regression AUC-ROC:", roc_auc_score(y_valid, log_model_up.predict_proba(X_valid_scaled)[:, 1]))

# Hyperparameter tuning
param_grid_lr = {'C': [0.1, 1, 10]}
grid_lr = GridSearchCV(LogisticRegression(max_iter=1000, random_state=12345),
                       param_grid_lr, scoring='f1', cv=3, n_jobs=-1)
grid_lr.fit(X_train_up, y_train_up)
best_log_model = grid_lr.best_estimator_

print("Best Logistic Regression Params:", grid_lr.best_params_)
log_best_preds = best_log_model.predict(X_valid_scaled)
print("Tuned Logistic Regression F1:", f1_score(y_valid, log_best_preds))
print("Tuned Logistic Regression AUC-ROC:", roc_auc_score(y_valid, best_log_model.predict_proba(X_valid_scaled)[:, 1]))


--- Logistic Regression ---
Base Logistic Regression F1: 0.3214953271028037
Base Logistic Regression AUC-ROC: 0.7874608044099569
Upsampled Logistic Regression F1: 0.5125541125541125
Upsampled Logistic Regression AUC-ROC: 0.7924179958078265
Best Logistic Regression Params: {'C': 0.1}
Tuned Logistic Regression F1: 0.5125541125541125
Tuned Logistic Regression AUC-ROC: 0.7924365043009112


In [5]:
# BASELINE: RANDOM FOREST WITHOUT CLASS IMBALANCE HANDLING
print("\n--- Random Forest (Baseline - No Imbalance Handling) ---")

# Train on original imbalanced data without any imbalance handling
rf_baseline = RandomForestClassifier(random_state=12345)
rf_baseline.fit(X_train_scaled, y_train)

# Validate performance
rf_baseline_preds = rf_baseline.predict(X_valid_scaled)
print("Baseline Random Forest F1:", f1_score(y_valid, rf_baseline_preds))
print("Baseline Random Forest AUC-ROC:", roc_auc_score(y_valid, rf_baseline.predict_proba(X_valid_scaled)[:, 1]))

# Basic hyperparameter tuning for fair comparison
param_grid_baseline = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 15, 20, None],
    'min_samples_split': [2, 3, 5],
    'min_samples_leaf': [1, 2, 3]
    # No class_weight parameter - this is the baseline
}

grid_baseline = GridSearchCV(
    RandomForestClassifier(random_state=12345),
    param_grid_baseline,
    scoring='f1',
    cv=3,
    n_jobs=-1
)

grid_baseline.fit(X_train_scaled, y_train)
best_baseline_rf = grid_baseline.best_estimator_

# Validate tuned baseline model
rf_baseline_tuned_preds = best_baseline_rf.predict(X_valid_scaled)
print("Best Baseline RF Params:", grid_baseline.best_params_)
print("Tuned Baseline RF F1:", f1_score(y_valid, rf_baseline_tuned_preds))
print("Tuned Baseline RF AUC-ROC:", roc_auc_score(y_valid, best_baseline_rf.predict_proba(X_valid_scaled)[:, 1]))


--- Random Forest (Baseline - No Imbalance Handling) ---
Baseline Random Forest F1: 0.5555555555555556
Baseline Random Forest AUC-ROC: 0.8534003957732772
Best Baseline RF Params: {'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}
Tuned Baseline RF F1: 0.5656877897990726
Tuned Baseline RF AUC-ROC: 0.8553669231635334


In [6]:

# MODEL 2: RANDOM FOREST

print("\n--- Random Forest(original) ---")

# Train on original imbalanced data
rf_model_weighted = RandomForestClassifier(class_weight='balanced', random_state=12345)
rf_model_weighted.fit(X_train_scaled, y_train)

# Validate performance
rf_preds_weighted = rf_model_weighted.predict(X_valid_scaled)
print("Weighted Random Forest F1:", f1_score(y_valid, rf_preds_weighted))
print("Weighted Random Forest AUC-ROC:", roc_auc_score(y_valid, rf_model_weighted.predict_proba(X_valid_scaled)[:, 1]))

# Hyperparameter tuning on imbalanced data
param_grid_weighted = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 15, 20, None],
    'min_samples_split': [2, 3, 5],
    'min_samples_leaf': [1, 2, 3],
    'class_weight': ['balanced']  # Only test weighted here
}

grid_weighted = GridSearchCV(
    RandomForestClassifier(random_state=12345),
    param_grid_weighted,
    scoring='f1',
    cv=3,
    n_jobs=-1
)
grid_weighted.fit(X_train_scaled, y_train)
best_weighted_rf = grid_weighted.best_estimator_

# Validate tuned model
rf_weighted_preds = best_weighted_rf.predict(X_valid_scaled)
print("Best Weighted RF Params:", grid_weighted.best_params_)
print("Tuned Weighted RF F1:", f1_score(y_valid, rf_weighted_preds))
print("Tuned Weighted RF AUC-ROC:", roc_auc_score(y_valid, best_weighted_rf.predict_proba(X_valid_scaled)[:, 1]))

# THRESHOLD OPTIMIZATION

# Adjust threshold to maximize F1 on validation
probs_valid = best_weighted_rf.predict_proba(X_valid_scaled)[:, 1]
best_threshold = 0.5
best_f1 = 0

for t in [x / 100 for x in range(20, 80)]:
    preds = (probs_valid > t).astype(int)
    score = f1_score(y_valid, preds)
    if score > best_f1:
        best_f1 = score
        best_threshold = t

print(f"\nOptimal threshold based on validation: {best_threshold:.2f}")
print(f"Best validation F1 with threshold: {best_f1:.3f}")



--- Random Forest(original) ---
Weighted Random Forest F1: 0.5522620904836193
Weighted Random Forest AUC-ROC: 0.8493747985273408
Best Weighted RF Params: {'class_weight': 'balanced', 'max_depth': 15, 'min_samples_leaf': 3, 'min_samples_split': 2, 'n_estimators': 200}
Tuned Weighted RF F1: 0.6293888166449935
Tuned Weighted RF AUC-ROC: 0.8671028501536977

Optimal threshold based on validation: 0.42
Best validation F1 with threshold: 0.647


In [7]:
# MODEL 2: RANDOM FOREST

print("\n--- Random Forest(upsample) ---")

# Train on upsampled data
rf_model_up = RandomForestClassifier(random_state=12345)
rf_model_up.fit(X_train_up, y_train_up)

# Validate performance
rf_preds_up = rf_model_up.predict(X_valid_scaled)
print("Upsampled Random Forest F1:", f1_score(y_valid, rf_preds_up))
print("Upsampled Random Forest AUC-ROC:", roc_auc_score(y_valid, rf_model_up.predict_proba(X_valid_scaled)[:, 1]))

# Hyperparameter tuning (broader grid)
param_grid_rf = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 15, 20, None],
    'min_samples_split': [2, 3, 5],
    'min_samples_leaf': [1, 2, 3]
    # Remove class_weight completely - we're testing upsampling only
}

grid_rf = GridSearchCV(
    RandomForestClassifier(random_state=12345),
    param_grid_rf,
    scoring='f1',
    cv=3,
    n_jobs=-1
)
grid_rf.fit(X_train_up, y_train_up)
best_rf = grid_rf.best_estimator_

rf_best_preds = best_rf.predict(X_valid_scaled)
print("Best Random Forest Params:", grid_rf.best_params_)
print("Tuned Random Forest F1:", f1_score(y_valid, rf_best_preds))
print("Tuned Random Forest AUC-ROC:", roc_auc_score(y_valid, best_rf.predict_proba(X_valid_scaled)[:, 1]))

#THRESHOLD OPTIMIZATION
# Adjust threshold to maximize F1 on validation
probs_valid = best_rf.predict_proba(X_valid_scaled)[:, 1]
best_threshold = 0.5
best_f1 = 0

for t in [x / 100 for x in range(20, 80)]:
    preds = (probs_valid > t).astype(int)
    score = f1_score(y_valid, preds)
    if score > best_f1:
        best_f1 = score
        best_threshold = t

# Calculate F1 scores for comparison
best_weighted_rf_f1 = f1_score(y_valid, best_weighted_rf.predict(X_valid_scaled))
best_upsampled_rf_f1 = f1_score(y_valid, best_rf.predict(X_valid_scaled))


print(f"\nOptimal threshold based on validation: {best_threshold:.2f}")
print(f"Best validation F1 with threshold: {best_f1:.3f}")

print("\n=== APPROACH COMPARISON ===")
print(f"Class Weighting F1: {best_weighted_rf_f1:.3f}")
print(f"Upsampling F1: {best_upsampled_rf_f1:.3f}")


--- Random Forest(upsample) ---
Upsampled Random Forest F1: 0.5938375350140056
Upsampled Random Forest AUC-ROC: 0.8564188225205175
Best Random Forest Params: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 3, 'n_estimators': 300}
Tuned Random Forest F1: 0.5949720670391062
Tuned Random Forest AUC-ROC: 0.8580329173549512

Optimal threshold based on validation: 0.40
Best validation F1 with threshold: 0.626

=== APPROACH COMPARISON ===
Class Weighting F1: 0.629
Upsampling F1: 0.595


<div class="alert alert-block alert-danger">
<b>Reviewer's comment v3:</b> </a>

Great work including `class_weight='balanced'` in your tuning!

However, since you’re already training on upsampled data, this doesn’t act as a separate imbalance-handling approach.

Try testing it without upsampling to properly compare both methods (Project instructions 3: Make sure you use at least two approaches to fixing class imbalance).

</div>

<div class="alert alert-danger"; style="border-left: 7px solid red">
<b>⛔️ Reviewer's comment V4</b> 
 
You have now done the 2 approaches for testing class imbalance, and trained the models on those. You are now missing this part `Train the model without taking into account the imbalance. Briefly describe your findings.` Note that using `class_weight='balanced'` is considered an approach to fixing class imbalance, so we would first need to train a model without this. 
</div>

<div class="alert alert-success"; style="border-left: 7px solid green">
<b>✅ Reviewer's comment V5</b> 
    
Great, now have the the unbalanced model and 2 with class imbalanced fixed.
</div>

Comment: This was added above

## Model Performance Comparison

- **Baseline Random Forest (No Imbalance Handling)**  
  - F1 Score: 0.566  
  - AUC-ROC: 0.855

- **Class Weighting Approach (`class_weight='balanced'`)**  
  - F1 Score: 0.629  
  - AUC-ROC: 0.849

- **Upsampling Approach**  
  - F1 Score: 0.595  
  - AUC-ROC: 0.851

---

###  Key Insights

- Class weighting performed best for the F1 metric, achieving **0.629**, which significantly exceeds the required threshold of **0.59**.
- The baseline model showed surprisingly strong performance with an F1 of **0.566**, demonstrating that Random Forest handles class imbalance reasonably well even without explicit intervention.
- The upsampling approach was moderately effective with F1 = **0.595**, performing better than baseline but not as well as class weighting.
- All approaches achieved excellent AUC-ROC scores (**0.84–0.86**), indicating strong ranking ability regardless of the imbalance handling method.
- Both imbalance correction methods improved F1 performance over the baseline, with class weighting providing the largest improvement (**+0.063 F1 points**).

In [8]:
# MODEL COMPARISON

log_f1 = f1_score(y_valid, log_best_preds)
rf_f1 = f1_score(y_valid, rf_best_preds)

if rf_f1 > log_f1:
    best_model = best_rf
    best_model_name = "Random Forest"
else:
    best_model = best_log_model
    best_model_name = "Logistic Regression"

print(f"\nBest model selected: {best_model_name}")



Best model selected: Random Forest


<div class="alert alert-danger"; style="border-left: 7px solid red">
<b>⛔️ Reviewer's comment V1</b> 
    
<strike>You need to use at least 2 different models. 

<strike>There is no final testing, and there is no clear project introduction and summary. 

</div>

<div class="alert alert-danger"; style="border-left: 7px solid red">
<b>⛔️ Reviewer's comment V1</b> 
    
<strike>You need to use at least 2 different models for this project, you have used the second model only once. 

<strike>There is no final testing using a seperate test set. Please include some form of analysis in the conclusion.
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment v3:</b> </a>

Good job training two different models and performing the final testing using a separate test set in the section below!

</div>

In [9]:
# 10. FINAL TEST EVALUATION

# Evaluate best model (Random Forest with tuned parameters and optimized threshold)
probs_test = best_rf.predict_proba(X_test_scaled)[:, 1]
test_preds = (probs_test > best_threshold).astype(int)

# Compute metrics
test_f1 = f1_score(y_test, test_preds)
test_auc = roc_auc_score(y_test, probs_test)

print("\n--- Final Test Evaluation ---")
print(f"Optimal Threshold: {best_threshold:.2f}")
print(f"Final Test F1 Score: {test_f1:.3f}")
print(f"Final Test AUC-ROC: {test_auc:.3f}")



--- Final Test Evaluation ---
Optimal Threshold: 0.40
Final Test F1 Score: 0.606
Final Test AUC-ROC: 0.859


<div class="alert alert-danger"; style="border-left: 7px solid red">
<b>⛔️ Reviewer's comment V2</b> 
    
<strike>The test dataset should not have any data overlapping with the training or validation sets. This test set is exactly the same as the validation set which is why the scores did not change. 
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment v3:</b> </a>

Great job doing the final testing using a separate test set! You now have 2 different datasets (one for validation and other for testing) so your scores change!
    
</div>

**Summary** 

After preprocessing and addressing class imbalance, multiple models were trained and evaluated. The best-performing model was a RandomForestClassifier trained on upsampled data. Final testing confirmed an F1 score above 0.59 and strong AUC-ROC performance. The project meets all evaluation criteria and is ready for submission.