# Model Quality Project

I will be attempting to improve model quality by check alternative routes

## Preparing Data

In [2]:
# imports
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score,  roc_auc_score, roc_curve
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
import plotly.express as px

All needed imports

In [3]:
data = pd.read_csv(r"C:\Users\alexi\Desktop\Coding Projects\Churn-Project\Churn.csv")

All CSV files

In [4]:
data.info()
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


i checked for anomalies and we seem to have a column (Tenure) with missing values but our data types seem fine so ill check the missing values

In [5]:
missing_values = data[data.isnull().any(axis=1)]
missing_values.sample(20)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
2988,2989,15684801,Abbott,689,France,Male,47,,93871.95,3,1,0,156878.42,1
5536,5537,15795878,Anayochukwu,636,Spain,Male,45,,0.0,2,1,1,159463.8,0
3250,3251,15587419,Shipton,611,France,Male,58,,0.0,2,0,1,107665.68,1
1965,1966,15772243,MacDonald,612,France,Female,33,,0.0,1,0,0,142797.5,1
5172,5173,15813095,Nwebube,553,France,Male,37,,0.0,2,1,0,33877.29,0
8617,8618,15672481,Ulyanov,641,France,Male,37,,0.0,2,1,0,45309.24,0
5443,5444,15590199,Temple,701,Spain,Male,28,,103421.32,1,0,1,76304.73,0
2900,2901,15668575,Hao,626,Spain,Female,26,,148610.41,3,0,1,104502.02,1
2120,2121,15651554,Anenechukwu,618,Germany,Female,54,,118449.21,1,1,1,133573.29,1
9901,9902,15802909,Hu,706,Germany,Female,56,,139603.22,1,1,1,86383.61,0


created a missing value dataframe to access the missing values better

In [6]:
data['Exited'].value_counts()

Exited
0    7963
1    2037
Name: count, dtype: int64

In [7]:
missing_values['Exited']

30      1
48      0
51      0
53      1
60      0
       ..
9944    0
9956    1
9964    0
9985    0
9999    0
Name: Exited, Length: 909, dtype: int64

the 2 cells above were to test if there is a correlation with the exited customers and the missing values and i dont believe there is

In [8]:
missing_values['IsActiveMember'].value_counts()

IsActiveMember
1    464
0    445
Name: count, dtype: int64

I also tested if it had anyhting to do with their activeness and after running this i dont believe there is another patter or correlation to be made it is also too much data to lose so i will fill the value with 0 as i think it could indicate newer accounts or simply ones that havent been active for more than a year and as the test is to see customer turnover it could be useful information later

In [9]:
data['Tenure']  = data['Tenure'].fillna(0)

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


The Values have been filled and i can work with the data now

## Checking and Calculating With Imbalance

In [11]:
class_distribution = data['Exited'].value_counts(normalize=True) * 100
class_distribution

Exited
0    79.63
1    20.37
Name: proportion, dtype: float64

The classes are highly imbalanced as the target is not closer to even with the in this case feature so it might favor the majority in our training

In [12]:
# Encoding Categorical Data
encode_geo = LabelEncoder()
encode_gender = LabelEncoder()

data['Geography'] = encode_geo.fit_transform(data['Geography'])
data['Gender'] = encode_gender.fit_transform(data['Gender'])

# Feature and Target
features = data.drop(columns=['Exited', 'Surname'])
target = data['Exited']

# Splitting Data into Training and Test
x = features
y = target
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=50, stratify=y)

# Standardize the Features
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

# RandomForest Model Imbalanced Training
model = RandomForestClassifier(random_state=50)
model.fit(x_train_scaled, y_train)

# Predictions
y_pred = model.predict(x_test_scaled)

# Evaluating the Model
report = classification_report(y_test, y_pred, output_dict=True)
report

{'0': {'precision': 0.8715179079022172,
  'recall': 0.9623352165725048,
  'f1-score': 0.9146778042959427,
  'support': 1593.0},
 '1': {'precision': 0.7510373443983402,
  'recall': 0.44471744471744473,
  'f1-score': 0.558641975308642,
  'support': 407.0},
 'accuracy': 0.857,
 'macro avg': {'precision': 0.8112776261502788,
  'recall': 0.7035263306449747,
  'f1-score': 0.7366598898022924,
  'support': 2000.0},
 'weighted avg': {'precision': 0.8470001132291781,
  'recall': 0.857,
  'f1-score': 0.8422245130970271,
  'support': 2000.0}}

I decided to run a classification report for a more detailed list of what we are looking at and its returned some useful information when the model predicts 0 (non-Exited) it is fairly accurate the model successfully identified ~96% of the class and when it does predict 0 it is 87% accurate and a 91% f1 or overall performance for that class however when tasked with identifying 1(exiters) it performs poorly identifying only ~44% of the class with a 75% precision so when it does predict is is fairly accurate the f1_score being a ~55% showing a need for improvement in this class most likely due to the imbalance

## Balancing, Choosing Best Model, and Hypertuning

### Balancing and Model Choice

In [13]:
# Applying SMOTE for Oversample
smote = SMOTE(random_state=50)
x_train_rescale, y_train_smote = smote.fit_resample(x_train, y_train)

# Scaling for Resample
x_resample_scaled = scaler.fit_transform(x_train_rescale)

# Balancing Models 
rf_model = RandomForestClassifier(class_weight="balanced", random_state=50)
log_reg_model = LogisticRegression(class_weight="balanced",max_iter=1000, random_state=50)
dt_model = DecisionTreeClassifier(class_weight="balanced", random_state=50)

# Train Models
rf_model.fit(x_resample_scaled, y_train_smote)
log_reg_model.fit(x_resample_scaled, y_train_smote)
dt_model.fit(x_resample_scaled, y_train_smote)

# Make Predictions
rf_pred = rf_model.predict(x_test_scaled)
log_reg_pred = log_reg_model.predict(x_test_scaled)
dt_pred = dt_model.predict(x_test_scaled)

# Evaluate Models
rf_report = classification_report(y_test, rf_pred, output_dict=True)
log_reg_report = classification_report(y_test, log_reg_pred, output_dict=True)
dt_report = classification_report(y_test, dt_pred, output_dict=True)

# Convert to DF For Easy Comparison
rf_df = pd.DataFrame(rf_report).transpose()
log_reg_df = pd.DataFrame(log_reg_report).transpose()
dt_df = pd.DataFrame(dt_report).transpose()

print("Random Forest Results:\n", rf_df)
print("\nLogistic Regression Results:\n", log_reg_df)
print("\nDecision Tree Results:\n", dt_df)

Random Forest Results:
               precision    recall  f1-score    support
0              0.913502  0.815443  0.861692  1593.0000
1              0.491349  0.697789  0.576650   407.0000
accuracy       0.791500  0.791500  0.791500     0.7915
macro avg      0.702426  0.756616  0.719171  2000.0000
weighted avg   0.827594  0.791500  0.803686  2000.0000

Logistic Regression Results:
               precision    recall  f1-score   support
0              0.889103  0.578782  0.701141  1593.000
1              0.303219  0.717445  0.426277   407.000
accuracy       0.607000  0.607000  0.607000     0.607
macro avg      0.596161  0.648113  0.563709  2000.000
weighted avg   0.769876  0.607000  0.645206  2000.000

Decision Tree Results:
               precision    recall  f1-score    support
0              0.895631  0.694915  0.782609  1593.0000
1              0.363874  0.683047  0.474808   407.0000
accuracy       0.692500  0.692500  0.692500     0.6925
macro avg      0.629753  0.688981  0.628708  2

I applyed SMOTE for oversampling as it made more sense than just duplicating some of the other rows as it creates 'synthetic' rows. Scaled the Resample. Balanced all 3 Models with. Trained all 3 Models. Ran a classification report and converted it to a dataframe for ease of access. The results were very telling of all 3 models for what we are looking for which is a consistent way to predict churn rate although logistic has the highest recall it just wasnt accurate enough leaving random forest as out best option with a recall of 69% slightly lower than the 71% from logistic but also a ~49% precision which is almost 20% higher it also proves to be the balanaced approach and we will improve upon it with hyperparamters

### HyperTuning

In [14]:
# Hyperparameters
param_grid_rf = {
    'n_estimators': [50, 100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

# Perform RandomSearchCV for Best Parameters
rf_search = RandomizedSearchCV(
    estimator=rf_model,
    param_distributions=param_grid_rf,
    n_iter=20,
    cv=3,
    scoring='f1',
    random_state=50,
    n_jobs=-1
)

# Fit Model
rf_search.fit(x_resample_scaled, y_train_smote)

# Best Params
best_params = rf_search.best_params_

# Train Best rf Model
best_rf_model = RandomForestClassifier(**best_params, class_weight='balanced', random_state=50)
best_rf_model.fit(x_resample_scaled, y_train_smote)

# Predict, Evaluate, Report
best_rf_pred = best_rf_model.predict(x_test_scaled)
best_rf_report = classification_report(y_test, best_rf_pred, output_dict=True)
best_rf_df = pd.DataFrame(best_rf_report).transpose()

# Print
print("Best RandomForest Results:")
best_rf_df

Best RandomForest Results:


Unnamed: 0,precision,recall,f1-score,support
0,0.909972,0.824859,0.865328,1593.0
1,0.498201,0.68059,0.575286,407.0
accuracy,0.7955,0.7955,0.7955,0.7955
macro avg,0.704087,0.752724,0.720307,2000.0
weighted avg,0.826177,0.7955,0.806304,2000.0


I set up Parameters and used randomsearchcv to find the best parameters for the rf model and we see marginal success with this as accuracy and precision slightly increase but we see a slight decrease in recall.

In [15]:
# Predict probabilities for ROC curve
rf_probs = best_rf_model.predict_proba(x_test_scaled)[:, 1]

# Compute AUC-ROC
auc_roc = roc_auc_score(y_test, rf_probs)

# Compute ROC curve
fpr, tpr, _ = roc_curve(y_test, rf_probs)

# Create a DataFrame for Plotly
roc_df = pd.DataFrame({'False Positive Rate': fpr, 'True Positive Rate': tpr})

# Plot using Plotly Express
fig = px.line(
    roc_df, x='False Positive Rate', y='True Positive Rate',
    title=f'Receiver Operating Characteristic (ROC) Curve (AUC = {auc_roc:.4f})',
    labels={'False Positive Rate': 'False Positive Rate', 'True Positive Rate': 'True Positive Rate'}
)

# Add diagonal reference line
fig.add_shape(type='line', x0=0, y0=0, x1=1, y1=1, line=dict(dash='dash', color='gray'))

# Show plot
fig.show()

# Return AUC-ROC Score
auc_roc

np.float64(0.8295953889174229)

Our AUC_ROC is ~.83 which suggest the model performs well in distiguishing between churners and non churners.

## Conclusion

The optimized Random Forest model strikes a solid balance between accuracy (79.55%), precision (49.82%), and recall (68.06%), making it a reliable tool for predicting customer churn. Initially, it struggled with false positives and missed churn cases, but SMOTE helped balance the dataset, and class weighting made churn cases more significant. After fine-tuning with hyperparameter optimization, the model became more precise while still identifying a good portion of churners. While it’s now a bit more selective, it does a better job at reducing misclassification errors. Overall, this version is the best so far, but there’s room for improvement—adjusting the decision threshold, testing XGBoost or LightGBM, or analyzing feature importance could take it even further.