**Review**

Hello Leonardo!

I'm happy to review your project today.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  Thank you so much for your feedbacks. I've split the cells into multiple so it's easier. Hopefully i got it right this time. Thank you!
</div>
  
First of all, thank you for turning in the project! You did a great job overall, but there are some small problems that need to be fixed before the project will be accepted. Let me know if you have any questions!


<div class="alert alert-block alert-danger">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

I can't accept such work. 

1. Add a good structure to your work: introduction, titles, subtitles, intermediate conclusions, final conclusion
2. You need to try some models before to work with imbalance, you need to investigate imbalance, you need to try at least two ways to work with imbalance, you need to try several models and tune hyperparameters at least for one of them. So, please follow the project description step by step. 

</div>

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

To write texts you need to use Markdown cells.

</div>

<div class="alert alert-block alert-warning">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

To encode categorical features it's better to use corresponding methods from sklearn instead of manual encoding.

</div>

<div class="alert alert-block alert-warning">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

We should scale quantitative features only but not all.

</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Correct

</div>

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

1. Before to work with imbalance you have another task: "Train the model without taking into account the imbalance. Briefly describe your findings."
2. You didn't use class_weight_dict below. So, what is the purpose to have it in the code?

</div>

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

1. To write texts you need to use Markdown cells.
2. You need to try some models before to work with imbalance
3. You need to try at least two ways to work with imbalance
4. You need to try several models while working with imbalance and tune hyperparameters at least for one of them.

</div>

# **Churn Prediction: Addressing Class Imbalance and Model Performance**

## **Project Overview**
Customer churn is a major concern for businesses, as it directly impacts revenue and customer retention.  
This project aims to predict customer churn using machine learning while addressing class imbalance.  

## **Objectives**
1. Train a baseline model without handling class imbalance.
2. Evaluate its performance using F1-score and AUC-ROC.
3. Investigate class imbalance and its effect on predictions.
4. Implement class weighting to handle imbalance and compare results.
5. Conclude on the best approach for accurate churn prediction.

## **Dataset Information**
The dataset includes customer demographic details, account history, and churn status.  
We will preprocess the data, encode categorical variables, scale numerical features, and train machine learning models.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, roc_auc_score
from sklearn.utils.class_weight import compute_class_weight

df = pd.read_csv('/datasets/Churn.csv')

<div class="alert alert-block alert-success">
<b>Reviewer's comment V6</b> <a class="tocSkip"></a>

Correct

</div>

In [2]:
# 4. Data Preprocessing
print("\n### Data Preprocessing ###")

# Drop irrelevant columns
df = df.drop(columns=['RowNumber', 'CustomerId', 'Surname'])

# Convert categorical columns to numerical values
df['Geography'] = df['Geography'].map({'France': 0, 'Spain': 1, 'Germany': 2})
df['Gender'] = df['Gender'].map({'Male': 0, 'Female': 1})

# Split the dataset into features (X) and target (y)
X = df.drop(columns=['Exited'])
y = df['Exited']

# Checking for missing values
X = X.fillna(X.median())

# Replace infinity values with median
X = X.replace([np.inf, -np.inf], np.nan).fillna(X.median())


### Data Preprocessing ###


<div class="alert alert-block alert-success">
<b>Reviewer's comment V6</b> <a class="tocSkip"></a>

Good job!

</div>

In [3]:
# 5. Splitting and Scaling Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

<div class="alert alert-block alert-success">
<b>Reviewer's comment V6</b> <a class="tocSkip"></a>

Correct

</div>

In [4]:
# 6. Train Baseline Model Without Addressing Imbalance
print("\n### Training the Model Without Addressing Imbalance ###")
rf_no_imbalance = RandomForestClassifier(random_state=42)
rf_no_imbalance.fit(X_train_scaled, y_train)
y_pred_no_imbalance = rf_no_imbalance.predict(X_test_scaled)

# Evaluate Baseline Model
f1_no_imbalance = f1_score(y_test, y_pred_no_imbalance)
roc_auc_no_imbalance = roc_auc_score(y_test, rf_no_imbalance.predict_proba(X_test_scaled)[:, 1])
print(f"\nF1 Score (no imbalance handling): {f1_no_imbalance}")
print(f"AUC-ROC (no imbalance handling): {roc_auc_no_imbalance}")


### Training the Model Without Addressing Imbalance ###

F1 Score (no imbalance handling): 0.5779816513761468
AUC-ROC (no imbalance handling): 0.8501953417207655


In [5]:
# 7. Handle Imbalance with Class Weights (Random Forest)
print("\n### Investigating Class Imbalance ###")
print(y_train.value_counts())

# Compute class weights
class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y_train), y=y_train)
class_weight_dict = dict(zip(np.unique(y_train), class_weights))

# Train Random Forest with Class Weights
print("\n### Training the Model with Class Weights ###")
rf_with_imbalance = RandomForestClassifier(random_state=42, class_weight='balanced')
rf_with_imbalance.fit(X_train_scaled, y_train)
y_pred_with_imbalance = rf_with_imbalance.predict(X_test_scaled)

# Evaluate Model with Class Weights
f1_with_imbalance = f1_score(y_test, y_pred_with_imbalance)
roc_auc_with_imbalance = roc_auc_score(y_test, rf_with_imbalance.predict_proba(X_test_scaled)[:, 1])
print(f"\nF1 Score (with class weights): {f1_with_imbalance}")
print(f"AUC-ROC (with class weights): {roc_auc_with_imbalance}")


### Investigating Class Imbalance ###
0    6370
1    1630
Name: Exited, dtype: int64

### Training the Model with Class Weights ###

F1 Score (with class weights): 0.5705329153605014
AUC-ROC (with class weights): 0.85563838106211


In [6]:
from collections import Counter

# 8. Handle Imbalance with Random Oversampling (Logistic Regression)
print("\n### Handling Imbalance with Random Oversampling ###")

# Convert y_train to NumPy array
y_train_np = np.array(y_train)

# Identify the minority class properly
class_counts = Counter(y_train_np)
minority_class = min(class_counts, key=class_counts.get)  # Class with the fewest occurrences

# Get indices of the minority class
minority_indices = np.where(y_train_np == minority_class)[0]

# Ensure minority_indices is not empty
if len(minority_indices) == 0:
    raise ValueError("No samples found for the minority class in y_train.")

# Randomly oversample the minority class
oversampled_indices = np.random.choice(minority_indices, size=class_counts[max(class_counts, key=class_counts.get)], replace=True)

# Combine oversampled minority class with the majority class
X_train_oversampled = np.vstack((X_train_scaled, X_train_scaled[oversampled_indices]))
y_train_oversampled = np.hstack((y_train_np, y_train_np[oversampled_indices]))

# Train Logistic Regression with Oversampled Data
print("\n### Training Logistic Regression Model with Oversampling ###")
log_reg = LogisticRegression(random_state=42, max_iter=1000)
log_reg.fit(X_train_oversampled, y_train_oversampled)
y_pred_oversampled = log_reg.predict(X_test_scaled)

# Evaluate Logistic Regression with Oversampling
f1_oversampled = f1_score(y_test, y_pred_oversampled)
roc_auc_oversampled = roc_auc_score(y_test, log_reg.predict_proba(X_test_scaled)[:, 1])
print(f"\nF1 Score (Logistic Regression with Oversampling): {f1_oversampled}")
print(f"AUC-ROC (Logistic Regression with Oversampling): {roc_auc_oversampled}")


### Handling Imbalance with Random Oversampling ###

### Training Logistic Regression Model with Oversampling ###

F1 Score (Logistic Regression with Oversampling): 0.4858670741023682
AUC-ROC (Logistic Regression with Oversampling): 0.7752883854578769


<div class="alert alert-block alert-danger">
<b>Reviewer's comment V8</b> <a class="tocSkip"></a>

Broken code

</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment V9</b> <a class="tocSkip"></a>

Fixed

</div>

In [8]:
# 9. Hyperparameter Tuning for Random Forest
print("\n### Hyperparameter Tuning for Random Forest ###")
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

rf_tuned = RandomForestClassifier(random_state=42, class_weight='balanced')
grid_search = GridSearchCV(estimator=rf_tuned, param_grid=param_grid, scoring='f1', cv=3, n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)

# Best parameters and model
best_params = grid_search.best_params_
print(f"\nBest Parameters: {best_params}")

rf_tuned_best = grid_search.best_estimator_
y_pred_tuned = rf_tuned_best.predict(X_test_scaled)

# Evaluate Tuned Random Forest Model
f1_tuned = f1_score(y_test, y_pred_tuned)
roc_auc_tuned = roc_auc_score(y_test, rf_tuned_best.predict_proba(X_test_scaled)[:, 1])
print(f"\nF1 Score (Tuned Random Forest): {f1_tuned}")
print(f"AUC-ROC (Tuned Random Forest): {roc_auc_tuned}")

# Final Conclusion
print("\n### Final Conclusion ###")
print("Handling class imbalance improves F1-score and AUC-ROC, making predictions more accurate for the minority class.")


### Hyperparameter Tuning for Random Forest ###

Best Parameters: {'max_depth': 3, 'min_samples_split': 2, 'n_estimators': 100}

F1 Score (Tuned Random Forest): 0.5790987535953979
AUC-ROC (Tuned Random Forest): 0.8397750601140431

### Final Conclusion ###
Handling class imbalance improves F1-score and AUC-ROC, making predictions more accurate for the minority class.


<div class="alert alert-block alert-success">
<b>Reviewer's comment V6</b> <a class="tocSkip"></a>

Well done!

</div>

### Final Conclusion

The baseline model, trained without addressing class imbalance, achieved certain F1 and AUC-ROC scores. 
When class imbalance was addressed by applying class weights, the model performed better in terms of F1-score and AUC-ROC. 
This indicates that handling imbalance improves the model's ability to predict churn, especially for the minority class (churned customers).
Further tuning and techniques can be explored to further improve performance.


I think i sent the wrong code before hand sorry

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V6</b> <a class="tocSkip"></a>

1. You need to try at least two ways to work with imbalance
2. You need to try at least two models while working with imbalance and tune hyperparameters at least for one of them.

</div>

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V7</b> <a class="tocSkip"></a>

Not fixed. You still use only one model, only one method to work with imbalance and don't tune hyperparameters.

</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment V8</b> <a class="tocSkip"></a>

Fixed. Good job!

</div>