# ***STEP:0 _ SETTING UP THE ENVIRONMENT***

In [1]:
import numpy as np
import pandas as pd
from google.colab import files
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score, precision_recall_fscore_support
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier


# ***STEP:1 _ LOADING THE DATASET***

In [2]:
uploaded = files.upload()

Saving Telco_Encoded.csv to Telco_Encoded.csv


In [3]:
df = pd.read_csv('Telco_Encoded.csv')

df.sample(5)

Unnamed: 0,SeniorCitizen,Partner,Dependents,tenure,PhoneService,PaperlessBilling,MonthlyCharges,TotalCharges,Churn,InternetService_DSL,...,DeviceProtection_Yes,TechSupport_No,TechSupport_No internet service,TechSupport_Yes,StreamingTV_No,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No,StreamingMovies_No internet service,StreamingMovies_Yes
4356,0,1,0,56,1,1,85.6,4902.8,0,0,...,0,1,0,0,0,0,1,1,0,0
2059,0,1,1,30,1,0,19.55,608.5,0,0,...,0,0,1,0,0,1,0,0,1,0
2861,0,0,0,12,1,1,84.6,1017.35,0,0,...,0,1,0,0,1,0,0,0,0,1
4943,1,1,0,29,1,1,84.3,2357.75,0,0,...,0,1,0,0,1,0,0,0,0,1
4418,0,0,1,16,1,1,79.5,1264.2,0,0,...,0,0,0,1,1,0,0,1,0,0


# ***STEP:2 _ SPLITTING THE DATASET***

In [4]:
X = df.drop('Churn', axis=1)
y = df["Churn"]

In [5]:
X_train, X_test, y_train, y_test = train_test_split(
 X, y , test_size=0.2, random_state=42, stratify=y
)

print("Train Shape : ", X_train.shape, y_train.shape)
print("Test Shape : ", X_test.shape, y_test.shape)

Train Shape :  (5625, 39) (5625,)
Test Shape :  (1407, 39) (1407,)


# ***STEP:3 _ TRAINING ON DIFFERENT MODELS***

#***(a) TRAINING LOGISTIC REGRESSION***

In [6]:
lr = Pipeline([
    ("scalar", StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000, class_weight="balanced", random_state=42))
])

lr.fit(X_train, y_train)

In [7]:
y_pred = lr.predict(X_test)

print("Accuracy : \n", accuracy_score(y_test, y_pred))

print("\nClassification Report : \n", classification_report(y_test, y_pred, zero_division=0))

y_proba = lr.predict_proba(X_test)[:, 1]

print("ROC-AUC Score : \n", roc_auc_score(y_test, y_proba))


Accuracy : 
 0.7256574271499645

Classification Report : 
               precision    recall  f1-score   support

           0       0.90      0.70      0.79      1033
           1       0.49      0.79      0.61       374

    accuracy                           0.73      1407
   macro avg       0.70      0.75      0.70      1407
weighted avg       0.79      0.73      0.74      1407

ROC-AUC Score : 
 0.8348432218086567


# summary
`We trained a Logistic Regression model on the Telco churn dataset. It achieved 72.6% accuracy with a strong ROC-AUC of 0.83. The model showed high recall (0.79) for churners, meaning it captured most customers likely to leave, but lower precision (0.49), leading to some false positives. Overall, it serves as a solid baseline with good discrimination ability, though it sacrifices precision for recall.`

# ***(b) TRAINING ON DECISION TREE***

In [8]:
dt_model = DecisionTreeClassifier(random_state=42)

dt_model.fit(X_train, y_train)

y_pred_dt = dt_model.predict(X_test)
y_pred_proba_dt = dt_model.predict_proba(X_test)[:, 1]


print("ACCURACY : \n", accuracy_score(y_test, y_pred_dt))
print("\nCLASSIFICATION REPORT : \n", classification_report(y_test, y_pred_dt))
print("\nROC-AUC SCORE : \n", roc_auc_score(y_test, y_pred_proba_dt))

ACCURACY : 
 0.7341862117981521

CLASSIFICATION REPORT : 
               precision    recall  f1-score   support

           0       0.82      0.82      0.82      1033
           1       0.50      0.51      0.50       374

    accuracy                           0.73      1407
   macro avg       0.66      0.66      0.66      1407
weighted avg       0.74      0.73      0.73      1407


ROC-AUC SCORE : 
 0.6615679889838537


#summary

`The Decision Tree model achieved about 73% accuracy, slightly higher than Logistic Regression. It performed well for the majority class (non-churn) with good precision and recall (~82%), but its performance for the churn class was weaker, with precision and recall around 50%. The ROC-AUC score of 0.66 indicates the model is not very strong at distinguishing churn vs. non-churn. Overall, the Decision Tree fits the training data well but struggles to generalize, especially for the churn class.`

# ***(c) : TRAINING ON RANDOM FOREST***

In [9]:
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
y_proba = rf.predict_proba(X_test) [:, 1]

print("ACCURACY : \n", accuracy_score(y_test, y_pred))
print("\nCLASSIFICATION REPORT : \n", classification_report(y_test, y_pred))
print("\nROC-AUC SCORE : \n", roc_auc_score(y_test, y_proba))

ACCURACY : 
 0.7803837953091685

CLASSIFICATION REPORT : 
               precision    recall  f1-score   support

           0       0.83      0.88      0.85      1033
           1       0.61      0.50      0.55       374

    accuracy                           0.78      1407
   macro avg       0.72      0.69      0.70      1407
weighted avg       0.77      0.78      0.77      1407


ROC-AUC SCORE : 
 0.8170998752400722


# summary

`The Random Forest model performed the best so far with an accuracy of 78%. It showed strong performance on the majority class (non-churn) with precision and recall around 83–88%, but it still struggled with the churn class, achieving 61% precision and 50% recall. The ROC-AUC score of 0.82 indicates good ability to distinguish between churn and non-churn. Overall, Random Forest provides a solid balance of accuracy and discrimination, outperforming both Logistic Regression and Decision Tree.`

# ***(d) TRAINING ON GRADIENT BOOSTING***

In [10]:
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
gb.fit(X_train, y_train)

y_pred = gb.predict(X_test)
y_proba = gb.predict_proba(X_test)[:, 1]

print("ACCURACY : \n", accuracy_score(y_test, y_pred))
print("\nCLASSIFICATION REPORT : \n", classification_report(y_test, y_pred))
print("\nROC-AUC SCORE : \n", roc_auc_score(y_test, y_proba))

ACCURACY : 
 0.7960199004975125

CLASSIFICATION REPORT : 
               precision    recall  f1-score   support

           0       0.84      0.89      0.87      1033
           1       0.64      0.53      0.58       374

    accuracy                           0.80      1407
   macro avg       0.74      0.71      0.72      1407
weighted avg       0.79      0.80      0.79      1407


ROC-AUC SCORE : 
 0.838785324919372


# summary

`The Gradient Boosting model achieved the highest accuracy so far at 79.6% with a strong ROC-AUC of 0.84. It performed very well on the non-churn class (precision 0.84, recall 0.89) and showed better precision on churn (0.64) compared to previous models, though recall for churn was still modest at 0.53. Overall, GB provides a stronger balance between accuracy and discrimination, making it the best-performing model up to this point.`

# ***(e) TRAINING ON SVM***

In [11]:
svm = SVC(kernel='rbf', probability=True, random_state=42 , class_weight="balanced")
svm.fit(X_train, y_train)

y_pred = svm.predict(X_test)
y_proba = svm.predict_proba(X_test)[:, 1]

print("ACCURACY : \n", accuracy_score(y_test, y_pred))
print("\nCLASSIFICATION REPORT : \n", classification_report(y_test, y_pred))
print("\nROC-AUC SCORE : \n", roc_auc_score(y_test, y_proba))

ACCURACY : 
 0.6588486140724946

CLASSIFICATION REPORT : 
               precision    recall  f1-score   support

           0       0.82      0.69      0.75      1033
           1       0.40      0.58      0.47       374

    accuracy                           0.66      1407
   macro avg       0.61      0.63      0.61      1407
weighted avg       0.71      0.66      0.67      1407


ROC-AUC SCORE : 
 0.7156418406489586


# summary

`The SVM model with class balancing achieved an accuracy of 66%, lower than the tree-based models. It handled the majority class reasonably well (precision 0.82, recall 0.69) but struggled with churn, giving 40% precision and 58% recall. The ROC-AUC score of 0.72 shows moderate discrimination. Overall, SVM underperformed compared to Logistic Regression, Random Forest, and Gradient Boosting, making it less suitable for this churn dataset.`

# ***(f) TRAINING ON NEURAL NETWORK***

In [12]:
mlp = MLPClassifier(hidden_layer_sizes=(64,32), activation='relu', solver='adam', max_iter=500, random_state=42)
mlp.fit(X_train, y_train)

y_pred = mlp.predict(X_test)
y_proba = mlp.predict_proba(X_test)[:, 1]

print("ACCURACY : \n", accuracy_score(y_test, y_pred))
print("\nCLASSIFICATION REPORT : \n", classification_report(y_test, y_pred))
print("\nROC-AUC SCORE : \n", roc_auc_score(y_test, y_proba))

ACCURACY : 
 0.8031272210376688

CLASSIFICATION REPORT : 
               precision    recall  f1-score   support

           0       0.84      0.90      0.87      1033
           1       0.66      0.53      0.59       374

    accuracy                           0.80      1407
   macro avg       0.75      0.72      0.73      1407
weighted avg       0.79      0.80      0.80      1407


ROC-AUC SCORE : 
 0.8331167721863012


# summary

`The MLP Neural Network achieved the best results overall with an accuracy of 80.3% and a ROC-AUC of 0.83. It performed strongly on the majority class (precision 0.84, recall 0.90) and showed better precision on churn (0.66) compared to earlier models, though recall for churn remained modest at 0.53. Overall, the MLP delivered a strong balance between accuracy and discrimination, performing slightly better than Gradient Boosting and Random Forest.`

# ***STEP:4 _ PRESENTING THE RESULTS***

In [13]:
import pandas as pd

# Store results (replace with your actual values from above)
results = {
    "Model": [
        "Logistic Regression",
        "Decision Tree",
        "Random Forest",
        "Gradient Boosting",
        "SVM",
        "MLP Neural Network"
    ],
    "Accuracy": [
        0.726,   # Logistic Regression
        0.734,   # Decision Tree
        0.780,   # Random Forest
        0.796,   # Gradient Boosting
        0.659,   # SVM
        0.803    # MLP
    ],
    "ROC-AUC": [
        0.835,   # Logistic Regression
        0.662,   # Decision Tree
        0.817,   # Random Forest
        0.839,   # Gradient Boosting
        0.716,   # SVM
        0.833    # MLP
    ]
}

# Convert to DataFrame
df_results = pd.DataFrame(results)

# Display table
print(df_results)

# Optional: Sort by Accuracy or ROC-AUC
df_results = df_results.sort_values(by="Accuracy", ascending=False)
df_results


                 Model  Accuracy  ROC-AUC
0  Logistic Regression     0.726    0.835
1        Decision Tree     0.734    0.662
2        Random Forest     0.780    0.817
3    Gradient Boosting     0.796    0.839
4                  SVM     0.659    0.716
5   MLP Neural Network     0.803    0.833


Unnamed: 0,Model,Accuracy,ROC-AUC
5,MLP Neural Network,0.803,0.833
3,Gradient Boosting,0.796,0.839
2,Random Forest,0.78,0.817
1,Decision Tree,0.734,0.662
0,Logistic Regression,0.726,0.835
4,SVM,0.659,0.716


# ***STEP:5 _ BEST PERFORMING MODEL***

`The MLP Neural Network emerged as the best-performing model with 80.3% accuracy and a ROC-AUC of 0.83, showing strong balance between overall accuracy and the ability to distinguish churn vs. non-churn. It achieved the best precision for churn (0.66) among all models, meaning it was better at correctly identifying actual churners compared to others.`

# ***STEP:6 _ KEY FEATURES DRIVING CHURN***

`Tenure → Customers with shorter tenure are more likely to churn.`

`Contract type → Month-to-month contracts have the highest churn risk compared to yearly contracts.`

`Internet service type → Fiber optic customers often churn more due to higher costs or service dissatisfaction.`

`MonthlyCharges and TotalCharges → Higher charges are strongly linked to churn likelihood.`

`OnlineSecurity, TechSupport, DeviceProtection → Lack of these services correlates with higher churn.`

`PaperlessBilling & Payment method → Customers using electronic check payments tend to churn more often.`

**summary:**

`Customers with short tenure, month-to-month contracts, high monthly charges, and fewer additional services are the most at risk of churn. Models like Gradient Boosting confirmed these as the strongest predictors by ranking them at the top of feature importance.`





# ***END-TO-END SUMMARY OF THIS PROJECT***

1) Loaded the encoded dataset (telco_encoded.csv).

2) Split the dataset into training and testing sets.

3) Trained and evaluated Logistic Regression model.

4) Trained and evaluated Decision Tree model.

5) Trained and evaluated Random Forest model.

6) Trained and evaluated Gradient Boosting model.

7) Trained and evaluated Support Vector Machine (SVM) model.

8) Trained and evaluated MLP Neural Network model.

9) Generated evaluation reports (Accuracy, Classification Report, ROC-AUC) for each model.

10) Created a comparison table for all models vs metrics.

11) Selected the best-performing model (MLP).