In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

                                       Task-1:Data Exploration & Preprocessing 

1.1 Dataset Overview
Loaded dataset: Customer_data.csv with 7043 records and 21 columns

Dataset includes:

Customer demographics: gender, senior citizen, dependents

Services used: internet, phone, streaming, online security

Billing info: monthly charges, total charges, contract type

Target variable: Churn (Yes/No)

In [3]:
df=pd.read_csv('Customer_data.csv')   #Loaded dataset (Customer_data.csv)
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


1.2 Data Inspection
Used df.info(), df.isnull().sum() and df.describe() to understand:

Data types (object, int64, float64)

Statistical summaries for numeric columns

Found that TotalCharges had 11 missing values

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [5]:
df.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges
count,7043.0,7043.0,7043.0,7032.0
mean,0.162147,32.371149,64.761692,2283.300441
std,0.368612,24.559481,30.090047,2266.771362
min,0.0,0.0,18.25,18.8
25%,0.0,9.0,35.5,401.45
50%,0.0,29.0,70.35,1397.475
75%,0.0,55.0,89.85,3794.7375
max,1.0,72.0,118.75,8684.8


In [6]:
df.isnull().sum()

customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

1.3 Missing Value Handling
TotalCharges had 11 null entries:

In [7]:
df['TotalCharges'].fillna(df['TotalCharges'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['TotalCharges'].fillna(df['TotalCharges'].median(), inplace=True)


*Chose median imputation to retain data integrity.

1.4 Column Cleanup
Dropped customerID since it holds no predictive value:

In [8]:
df.drop('customerID', axis=1, inplace=True)

1.4(1) Showing the dataset after cleanup.

In [9]:
df.isnull().sum()

gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

 1.5 Categorical Variable Encoding
Binary columns (Yes/No) converted using .map():

In [10]:
binary_cols = ['Partner', 'Dependents', 'PhoneService', 'PaperlessBilling', 'Churn']
for col in binary_cols:
    df[col] = df[col].map({'Yes': 1, 'No': 0})


Multi-category columns (e.g., InternetService, Contract) encoded with pd.get_dummies() using One Hot Encoding:

In [11]:
multi_cat_cols = ['gender', 'MultipleLines', 'InternetService', 'OnlineSecurity',
                  'OnlineBackup', 'DeviceProtection', 'TechSupport', 
                  'StreamingTV', 'StreamingMovies', 'Contract', 'PaymentMethod']
df = pd.get_dummies(df, columns=multi_cat_cols, drop_first=True)


In [12]:
df.head()

Unnamed: 0,SeniorCitizen,Partner,Dependents,tenure,PhoneService,PaperlessBilling,MonthlyCharges,TotalCharges,Churn,gender_Male,...,TechSupport_Yes,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0,1,0,1,0,1,29.85,29.85,0,False,...,False,False,False,False,False,False,False,False,True,False
1,0,0,0,34,1,0,56.95,1889.5,0,True,...,False,False,False,False,False,True,False,False,False,True
2,0,0,0,2,1,1,53.85,108.15,1,True,...,False,False,False,False,False,False,False,False,False,True
3,0,0,0,45,0,0,42.3,1840.75,0,True,...,True,False,False,False,False,True,False,False,False,False
4,0,0,0,2,1,1,70.7,151.65,1,False,...,False,False,False,False,False,False,False,False,True,False


 1.6 Feature Scaling
Scaled continuous features to ensure balanced learning for models like SVM, Logistic Regression, and KNN:

In [13]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['tenure', 'MonthlyCharges', 'TotalCharges']] = scaler.fit_transform(df[['tenure', 'MonthlyCharges', 'TotalCharges']])


 1.7 Train-Test Split
Split data into 80% training and 20% test:

In [14]:
from sklearn.model_selection import train_test_split

X = df.drop('Churn', axis=1)
y = df['Churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


                                   Task-2: Machine Learning Modeling 

Models Training:

We have trained five models on the preprocessed data:

1.Logistic Regression

2.Random Forest

3.Decision Tree

4.XGBoost

5.K-Nearest Neighbors (KNN)

6.Support Vector Machine (SVM)

2.1 Logistic Regression

In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score



# 3. Train Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# 4. Make predictions
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]  # For ROC-AUC

# 5. Evaluate model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\n Classification Report:\n", classification_report(y_test, y_pred))
print("\n Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\n ROC-AUC Score:", roc_auc_score(y_test, y_proba))


Accuracy: 0.8211497515968772

 Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.90      0.88      1036
           1       0.69      0.60      0.64       373

    accuracy                           0.82      1409
   macro avg       0.77      0.75      0.76      1409
weighted avg       0.82      0.82      0.82      1409


 Confusion Matrix:
 [[934 102]
 [150 223]]

 ROC-AUC Score: 0.8621748424027245


2.2 Random Forest 

In [16]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Initialize the model
rf_model = RandomForestClassifier(random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions
rf_pred = rf_model.predict(X_test)
rf_proba = rf_model.predict_proba(X_test)[:, 1]

# Evaluate
print("Random Forest Results")
print("Accuracy:", rf_model.score(X_test, y_test))
print("Classification Report:\n", classification_report(y_test, rf_pred))
print("ROC-AUC Score:", roc_auc_score(y_test, rf_proba))


Random Forest Results
Accuracy: 0.7984386089425124
Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.92      0.87      1036
           1       0.67      0.47      0.55       373

    accuracy                           0.80      1409
   macro avg       0.75      0.69      0.71      1409
weighted avg       0.79      0.80      0.79      1409

ROC-AUC Score: 0.8374923659776207


2.3 Decision Tree

In [17]:
from sklearn.tree import DecisionTreeClassifier

# Initialize the model
dt_model = DecisionTreeClassifier(random_state=42)

# Train the model
dt_model.fit(X_train, y_train)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Predict
dt_preds = dt_model.predict(X_test)
dt_probs = dt_model.predict_proba(X_test)[:, 1]

# Evaluate
print("Decision Tree Performance:")
print(f"Accuracy  : {accuracy_score(y_test, dt_preds):.4f}")
print(f"Precision : {precision_score(y_test, dt_preds):.4f}")
print(f"Recall    : {recall_score(y_test, dt_preds):.4f}")
print(f"F1 Score  : {f1_score(y_test, dt_preds):.4f}")
print(f"ROC-AUC   : {roc_auc_score(y_test, dt_probs):.4f}")



Decision Tree Performance:
Accuracy  : 0.7119
Precision : 0.4576
Recall    : 0.4772
F1 Score  : 0.4672
ROC-AUC   : 0.6373


2.4 XGBoost

In [18]:
from xgboost import XGBClassifier

# Initialize the model
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)

# Train the model
xgb_model.fit(X_train, y_train)

# Make predictions
xgb_pred = xgb_model.predict(X_test)
xgb_proba = xgb_model.predict_proba(X_test)[:, 1]

# Evaluate
print("XGBoost Results")
print("Accuracy:", xgb_model.score(X_test, y_test))
print("Classification Report:\n", classification_report(y_test, xgb_pred))
print("ROC-AUC Score:", roc_auc_score(y_test, xgb_proba))


XGBoost Results
Accuracy: 0.7892122072391767
Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.89      0.86      1036
           1       0.63      0.50      0.56       373

    accuracy                           0.79      1409
   macro avg       0.73      0.70      0.71      1409
weighted avg       0.78      0.79      0.78      1409

ROC-AUC Score: 0.8391770265094662


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


2.5 KNN

In [19]:
from sklearn.neighbors import KNeighborsClassifier

# Initialize the model
knn_model = KNeighborsClassifier(n_neighbors=5)

# Train the model
knn_model.fit(X_train, y_train)

# Predictions
knn_pred = knn_model.predict(X_test)
knn_proba = knn_model.predict_proba(X_test)[:, 1]

# Evaluation
print("KNN Results")
print("Accuracy:", knn_model.score(X_test, y_test))
print("Classification Report:\n", classification_report(y_test, knn_pred))
print("ROC-AUC Score:", roc_auc_score(y_test, knn_proba))


KNN Results
Accuracy: 0.7721788502484032
Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.86      0.85      1036
           1       0.58      0.53      0.55       373

    accuracy                           0.77      1409
   macro avg       0.71      0.69      0.70      1409
weighted avg       0.77      0.77      0.77      1409

ROC-AUC Score: 0.7965804755348992


2.6 Support Vector Machine(SVR)

In [20]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Initialize the model
svm_model = SVC(probability=True, random_state=42)

# Train the model
svm_model.fit(X_train, y_train)

# Predictions
svm_pred = svm_model.predict(X_test)
svm_proba = svm_model.predict_proba(X_test)[:, 1]

In [21]:
print("SVM Results")
print("Accuracy:", svm_model.score(X_test, y_test))
print("Classification Report:\n", classification_report(y_test, svm_pred))
print("ROC-AUC Score:", roc_auc_score(y_test, svm_proba))

SVM Results
Accuracy: 0.8140525195173882
Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.92      0.88      1036
           1       0.70      0.52      0.60       373

    accuracy                           0.81      1409
   macro avg       0.77      0.72      0.74      1409
weighted avg       0.80      0.81      0.80      1409

ROC-AUC Score: 0.8084209736354508


                                      Task-3: Model Evaluation 
                                    (Before Hyper Parameter Tuning)

3.1 Evaluation Metrics Used
All models were evaluated using:

Accuracy

Precision

Recall

F1 Score

ROC-AUC Score

This ensures both class balance and ranking performance are captured.                                      

In [22]:
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

#  Define evaluation function
def evaluate_model(name, model, X_test, y_test):
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]  # Ensure model supports predict_proba
    return {
        'Model': name,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1 Score': f1_score(y_test, y_pred),
        'ROC-AUC': roc_auc_score(y_test, y_proba)
    }

#  Replace with your actual trained models
models = {
    "Logistic Regression": model,
    "Random Forest": rf_model,
    "XGBoost": xgb_model,
    "SVM": svm_model,
    "KNN": knn_model,
    "Decision Tree": dt_model
}

# Evaluate each model
results = []
for name, model in models.items():
    results.append(evaluate_model(name, model, X_test, y_test))

#  Display results
results_df = pd.DataFrame(results)
results_df = results_df.sort_values(by="ROC-AUC", ascending=False)
print(results_df)


                 Model  Accuracy  Precision    Recall  F1 Score   ROC-AUC
0  Logistic Regression  0.821150   0.686154  0.597855  0.638968  0.862175
2              XGBoost  0.789212   0.628378  0.498660  0.556054  0.839177
1        Random Forest  0.798439   0.667925  0.474531  0.554859  0.837492
3                  SVM  0.814053   0.700361  0.520107  0.596923  0.808421
4                  KNN  0.772179   0.576471  0.525469  0.549790  0.796580
5        Decision Tree  0.711852   0.457584  0.477212  0.467192  0.637336


3.2 Hyperparameter Tuning

Used GridSearchCV on all models to improve performance:

Logistic Regression: tuned C, penalty, solver

Random Forest: tuned n_estimators, max_depth, min_samples_split, min_samples_leaf

XGBoost: tuned learning_rate, max_depth, n_estimators, subsample

SVM: tuned C, kernel, gamma

KNN: tuned n_neighbors, weights, metric

Best models were selected using ROC-AUC as the scoring metric.

3.2(1) Logistic Regression using Hyperparameter Tuning

In [23]:
from sklearn.model_selection import GridSearchCV


# Define model
lr = LogisticRegression(max_iter=1000)

# Define hyperparameter grid
param_grid_lr = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l2'],
    'solver': ['lbfgs']
}

# Grid Search
grid_lr = GridSearchCV(lr, param_grid_lr, cv=5, scoring='roc_auc', n_jobs=-1)
grid_lr.fit(X_train, y_train)

# Best model
best_lr = grid_lr.best_estimator_

# Predict
y_pred = best_lr.predict(X_test)
y_proba = best_lr.predict_proba(X_test)[:, 1]

# Evaluation metrics
print("Best Parameters (Logistic Regression):", grid_lr.best_params_)
print("Evaluation Metrics:")
print("Accuracy  :", accuracy_score(y_test, y_pred))
print("Precision :", precision_score(y_test, y_pred))
print("Recall    :", recall_score(y_test, y_pred))
print("F1 Score  :", f1_score(y_test, y_pred))
print("ROC-AUC   :", roc_auc_score(y_test, y_proba))


Best Parameters (Logistic Regression): {'C': 10, 'penalty': 'l2', 'solver': 'lbfgs'}
Evaluation Metrics:
Accuracy  : 0.8204400283889283
Precision : 0.6851851851851852
Recall    : 0.5951742627345844
F1 Score  : 0.6370157819225251
ROC-AUC   : 0.8618306644446054


3.2(2) Random Forest Classification using Hyperparameter Tuning

In [24]:

from sklearn.model_selection import GridSearchCV


# 1. Define base model
rf = RandomForestClassifier(random_state=42)

# 2. Define hyperparameter grid
param_grid_rf = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# 3. Grid Search
grid_rf = GridSearchCV(rf, param_grid_rf, cv=5, scoring='roc_auc', n_jobs=-1)
grid_rf.fit(X_train, y_train)

# 4. Best model
best_rf = grid_rf.best_estimator_

# 5. Predictions
y_pred = best_rf.predict(X_test)
y_proba = best_rf.predict_proba(X_test)[:, 1]

# 6. Evaluation
print("Best Parameters (Random Forest):", grid_rf.best_params_)
print("Evaluation Metrics:")
print("Accuracy  :", accuracy_score(y_test, y_pred))
print("Precision :", precision_score(y_test, y_pred))
print("Recall    :", recall_score(y_test, y_pred))
print("F1 Score  :", f1_score(y_test, y_pred))
print("ROC-AUC   :", roc_auc_score(y_test, y_proba))


Best Parameters (Random Forest): {'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100}
Evaluation Metrics:
Accuracy  : 0.8147622427253371
Precision : 0.7028985507246377
Recall    : 0.5201072386058981
F1 Score  : 0.5978428351309707
ROC-AUC   : 0.8598186466819174


3.2(3) XGBoost Classification using Hyperparameter Tuning

In [25]:

from sklearn.model_selection import GridSearchCV


# 1. Define the model
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)

# 2. Define hyperparameters
param_grid_xgb = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5],
    'learning_rate': [0.05, 0.1],
    'subsample': [0.8, 1.0]
}

# 3. Grid Search
grid_xgb = GridSearchCV(xgb, param_grid_xgb, cv=5, scoring='roc_auc', n_jobs=-1)
grid_xgb.fit(X_train, y_train)

# 4. Best model
best_xgb = grid_xgb.best_estimator_

# 5. Predictions
y_pred = best_xgb.predict(X_test)
y_proba = best_xgb.predict_proba(X_test)[:, 1]

# 6. Evaluation
print("Best Parameters (XGBoost):", grid_xgb.best_params_)
print("Evaluation Metrics:")
print("Accuracy  :", accuracy_score(y_test, y_pred))
print("Precision :", precision_score(y_test, y_pred))
print("Recall    :", recall_score(y_test, y_pred))
print("F1 Score  :", f1_score(y_test, y_pred))
print("ROC-AUC   :", roc_auc_score(y_test, y_proba))


Best Parameters (XGBoost): {'learning_rate': 0.05, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.8}
Evaluation Metrics:
Accuracy  : 0.8133427963094393
Precision : 0.6883561643835616
Recall    : 0.5388739946380697
F1 Score  : 0.6045112781954888
ROC-AUC   : 0.8647717039137951


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


3.2(4) Support Vector Machine(SVR) classification using Hyperparameter Tuning

In [26]:

from sklearn.model_selection import GridSearchCV


# 1. Define SVM model
svm = SVC(probability=True, random_state=42)

# 2. Define hyperparameter grid
param_grid_svm = {
    'C': [0.1, 1, 10],
    'kernel': ['rbf', 'linear'],
    'gamma': ['scale', 'auto']
}

# 3. Grid Search
grid_svm = GridSearchCV(svm, param_grid_svm, cv=5, scoring='roc_auc', n_jobs=-1)
grid_svm.fit(X_train, y_train)

# 4. Best model
best_svm = grid_svm.best_estimator_

# 5. Predictions
y_pred = best_svm.predict(X_test)
y_proba = best_svm.predict_proba(X_test)[:, 1]

# 6. Evaluation
print("Best Parameters (SVM):", grid_svm.best_params_)
print("Evaluation Metrics:")
print("Accuracy  :", accuracy_score(y_test, y_pred))
print("Precision :", precision_score(y_test, y_pred))
print("Recall    :", recall_score(y_test, y_pred))
print("F1 Score  :", f1_score(y_test, y_pred))
print("ROC-AUC   :", roc_auc_score(y_test, y_proba))


Best Parameters (SVM): {'C': 10, 'gamma': 'scale', 'kernel': 'linear'}
Evaluation Metrics:
Accuracy  : 0.8197303051809794
Precision : 0.6842105263157895
Recall    : 0.5924932975871313
F1 Score  : 0.6350574712643678
ROC-AUC   : 0.8523398925543697


3.2(5) KNN Classification using Hyperparameter Tuning

In [27]:

from sklearn.model_selection import GridSearchCV


# 1. Define KNN model
knn = KNeighborsClassifier()

# 2. Define hyperparameter grid
param_grid_knn = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

# 3. Grid Search
grid_knn = GridSearchCV(knn, param_grid_knn, cv=5, scoring='roc_auc', n_jobs=-1)
grid_knn.fit(X_train, y_train)

# 4. Best model
best_knn = grid_knn.best_estimator_

# 5. Predictions
y_pred = best_knn.predict(X_test)
y_proba = best_knn.predict_proba(X_test)[:, 1]

# 6. Evaluation
print("Best Parameters (KNN):", grid_knn.best_params_)
print("Evaluation Metrics:")
print("Accuracy  :", accuracy_score(y_test, y_pred))
print("Precision :", precision_score(y_test, y_pred))
print("Recall    :", recall_score(y_test, y_pred))
print("F1 Score  :", f1_score(y_test, y_pred))
print("ROC-AUC   :", roc_auc_score(y_test, y_proba))


Best Parameters (KNN): {'metric': 'manhattan', 'n_neighbors': 9, 'weights': 'uniform'}
Evaluation Metrics:
Accuracy  : 0.7955997161107168
Precision : 0.6253687315634219
Recall    : 0.5683646112600537
F1 Score  : 0.5955056179775281
ROC-AUC   : 0.8305389361019389


                                 Final Evaluation Table after Hyperparameter Tuning

In [28]:
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Final evaluation list
tuned_results = []

# Logistic Regression
tuned_results.append({
    'Model': 'Logistic Regression',
    'Accuracy': accuracy_score(y_test, best_lr.predict(X_test)),
    'Precision': precision_score(y_test, best_lr.predict(X_test)),
    'Recall': recall_score(y_test, best_lr.predict(X_test)),
    'F1 Score': f1_score(y_test, best_lr.predict(X_test)),
    'ROC-AUC': roc_auc_score(y_test, best_lr.predict_proba(X_test)[:, 1])
})

# Random Forest
tuned_results.append({
    'Model': 'Random Forest',
    'Accuracy': accuracy_score(y_test, best_rf.predict(X_test)),
    'Precision': precision_score(y_test, best_rf.predict(X_test)),
    'Recall': recall_score(y_test, best_rf.predict(X_test)),
    'F1 Score': f1_score(y_test, best_rf.predict(X_test)),
    'ROC-AUC': roc_auc_score(y_test, best_rf.predict_proba(X_test)[:, 1])
})

# XGBoost
tuned_results.append({
    'Model': 'XGBoost',
    'Accuracy': accuracy_score(y_test, best_xgb.predict(X_test)),
    'Precision': precision_score(y_test, best_xgb.predict(X_test)),
    'Recall': recall_score(y_test, best_xgb.predict(X_test)),
    'F1 Score': f1_score(y_test, best_xgb.predict(X_test)),
    'ROC-AUC': roc_auc_score(y_test, best_xgb.predict_proba(X_test)[:, 1])
})

# SVM (with probability=True)
tuned_results.append({
    'Model': 'SVM',
    'Accuracy': accuracy_score(y_test, best_svm.predict(X_test)),
    'Precision': precision_score(y_test, best_svm.predict(X_test)),
    'Recall': recall_score(y_test, best_svm.predict(X_test)),
    'F1 Score': f1_score(y_test, best_svm.predict(X_test)),
    'ROC-AUC': roc_auc_score(y_test, best_svm.predict_proba(X_test)[:, 1])
})

# KNN
tuned_results.append({
    'Model': 'KNN',
    'Accuracy': accuracy_score(y_test, best_knn.predict(X_test)),
    'Precision': precision_score(y_test, best_knn.predict(X_test)),
    'Recall': recall_score(y_test, best_knn.predict(X_test)),
    'F1 Score': f1_score(y_test, best_knn.predict(X_test)),
    'ROC-AUC': roc_auc_score(y_test, best_knn.predict_proba(X_test)[:, 1])
})

# Convert to DataFrame and sort by ROC-AUC
tuned_results_df = pd.DataFrame(tuned_results)
tuned_results_df = tuned_results_df.round(4)
tuned_results_df.sort_values(by='ROC-AUC', ascending=False, inplace=True)

# Display final table
print("Final Evaluation Table (After Hyperparameter Tuning):")
print(tuned_results_df)


Final Evaluation Table (After Hyperparameter Tuning):
                 Model  Accuracy  Precision  Recall  F1 Score  ROC-AUC
2              XGBoost    0.8133     0.6884  0.5389    0.6045   0.8648
0  Logistic Regression    0.8204     0.6852  0.5952    0.6370   0.8618
1        Random Forest    0.8148     0.7029  0.5201    0.5978   0.8598
3                  SVM    0.8197     0.6842  0.5925    0.6351   0.8523
4                  KNN    0.7956     0.6254  0.5684    0.5955   0.8305


Conclusion

The best-performing model is XGBoost with a ROC-AUC of 0.8648.

Logistic Regression also performed well with high interpretability.

All models exceeded 0.79 accuracy, with fair balance between precision and recall.

GridSearchCV helped fine-tune all models, improving performance by a small but significant margin.

Actionable Insights to Reduce Customer Churn

Based on model predictions and feature importance analysis, the following key insights and business recommendations were derived:



1. Focus on Customers with Low Tenure

   Insight: Customers with a tenure of less than 12 months have a significantly higher risk of churning.

   Action: Implement a "Welcome Program" that includes onboarding support, personalized offers, and proactive engagement in the first 3 months.



2. Target Fiber Optic Internet Users

   Insight: Customers using fiber optic internet show higher churn rates.

   Action: Investigate potential service issues or pricing dissatisfaction. Offer bundled packages, exclusive discounts, or service guarantees to retain these users.



3. Promote Long-Term Contracts Over Month-to-Month Plans

   Insight: Churn is significantly higher among customers with month-to-month contracts.

   Action:Encourage contract conversions by offering loyalty incentives, lower monthly rates, or added features for long-term plans.



4. Increase Adoption of Online Security and Tech Support

   Insight: Customers not subscribed to online security or tech support are more likely to churn.

   Action: Offer these services at discounted rates or bundle them with popular plans. Run promotional campaigns highlighting their benefits.



5. Improve Engagement for Paperless Billing Customers

   Insight: Paperless billing users show slightly higher churn, possibly due to reduced engagement or missed communications.
   
   Action: Increase touchpoints with digital users through personalized emails, usage reports, and billing summaries.




