# Telco Customer Churn Prediction
#Problem Statement: Predict which customers are likely to churn based on their behavior and interactions with the company.

### Dataset: https://www.kaggle.com/datasets/blastchar/telco-customer-churn?resource=download

In [23]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [24]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


In [25]:
# Load the dataset
data = pd.read_csv("/content/drive/My Drive/ML Project/telco_customer_churn.csv")

In [26]:
# Display the first few rows of the dataset
print(data.head())

   customerID  gender  SeniorCitizen Partner Dependents  tenure PhoneService  \
0  7590-VHVEG  Female              0     Yes         No       1           No   
1  5575-GNVDE    Male              0      No         No      34          Yes   
2  3668-QPYBK    Male              0      No         No       2          Yes   
3  7795-CFOCW    Male              0      No         No      45           No   
4  9237-HQITU  Female              0      No         No       2          Yes   

      MultipleLines InternetService OnlineSecurity  ... DeviceProtection  \
0  No phone service             DSL             No  ...               No   
1                No             DSL            Yes  ...              Yes   
2                No             DSL            Yes  ...               No   
3  No phone service             DSL            Yes  ...              Yes   
4                No     Fiber optic             No  ...               No   

  TechSupport StreamingTV StreamingMovies        Contract Pape

# Data preprocessing

In [27]:
# TODO: Separate dataset into feature matrix and target vector
X = data.drop('Churn', axis=1)
y = data['Churn']

In [28]:
# TODO: Check if there are any missing values
missing_values = data.isnull().sum()

if missing_values.any():
    data = data.dropna()

print(data.isnull().sum())

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64


In [29]:
# TODO: Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [30]:
# TODO: Define preprocessing steps for numerical and categorical columns
# TODO: Create Numeric Tranformer
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

#TODO: Create ColumnTransformer
categorical_features = X.select_dtypes(include=['object']).columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

In [31]:
# TODO: ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

#Define Models (Random Forest, SVM, and KNN)

In [32]:
# TODO: Build a parameter grid
models_grid = {
    "Random Forest": (RandomForestClassifier(), {
        'classifier__n_estimators': [100, 200, 300],
        'classifier__max_depth': [None, 10, 20]
    }),
    "SVM": (SVC(), {
        'classifier__C': [0.1, 1, 10],
        'classifier__gamma': [0.1, 0.01, 0.001],
        'classifier__kernel': ['rbf', 'linear']
    }),
    "KNN": (KNeighborsClassifier(), {
        'classifier__n_neighbors': [3, 5, 7],
        'classifier__weights': ['uniform', 'distance']
    })
}

#Perform grid search for each model

In [33]:
# TODO: Implement Grid Search
for name, (model, param_grid) in models_grid.items():
    print(f"Performing Grid Search for {name}...")
    clf = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])
    grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
    grid_search.fit(X_train, y_train)

    # # TODO: Print the best hyperparameters found
    print(f"Best Hyperparameters for {name}: {grid_search.best_params_}")

    # TODO: Print the cross-validation test scores
    print(f"Cross-validation Test Scores for {name}:")
    cv_results = grid_search.cv_results_
    for mean_score, params in zip(cv_results["mean_test_score"], cv_results["params"]):
        print(f"{params}: {mean_score:.3f}")
    print("\n")

    # TODO: Evaluate the model with the best hyperparameters
    y_pred = grid_search.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{name} Accuracy: {accuracy:.2f}")
    print(confusion_matrix(y_test, y_pred))
    print(classification_report(y_test, y_pred))
    print("\n")

    print("Best parameters:", grid_search.best_params_)
    print("Best cross-validation train score:", grid_search.best_score_)
    print("Best cross-validation test score:", grid_search.score(X_test, y_test))
    print("Test set accuracy:", grid_search.best_estimator_.score(X_test, y_test))

Performing Grid Search for Random Forest...
Best Hyperparameters for Random Forest: {'classifier__max_depth': None, 'classifier__n_estimators': 300}
Cross-validation Test Scores for Random Forest:
{'classifier__max_depth': None, 'classifier__n_estimators': 100}: 0.790
{'classifier__max_depth': None, 'classifier__n_estimators': 200}: 0.787
{'classifier__max_depth': None, 'classifier__n_estimators': 300}: 0.791
{'classifier__max_depth': 10, 'classifier__n_estimators': 100}: 0.734
{'classifier__max_depth': 10, 'classifier__n_estimators': 200}: 0.734
{'classifier__max_depth': 10, 'classifier__n_estimators': 300}: 0.734
{'classifier__max_depth': 20, 'classifier__n_estimators': 100}: 0.768
{'classifier__max_depth': 20, 'classifier__n_estimators': 200}: 0.767
{'classifier__max_depth': 20, 'classifier__n_estimators': 300}: 0.766


Random Forest Accuracy: 0.80
[[949  87]
 [200 173]]
              precision    recall  f1-score   support

          No       0.83      0.92      0.87      1036
    

According to the result, it is evident that the Support Vector Machine (SVM) model Works best. The SVM model with specific hyperparameters achieved the highest accuracy compared to other models after doing grid search. Best parameters: {'classifier__C': 0.1, 'classifier__gamma': 0.1, 'classifier__kernel': 'linear'} with an accuracy of 82%, outperforming both the Random Forest and KNN models.

The Random Forest model has an accuracy of 80%, its performance metrics for predicting churn were not as good as the SVM model. The precision, recall, and F1-score for the minority class were lower, indicating that it might not be as accurate.

The KNN model achievede accuracy of 78%, showed lower precision, recall, and F1-score for predicting churn compared to the SVM model.

SVM model with its best parameters look to be a good fit for the customer churn prediction problem.

I took help from the examples provided in D2L of our course specially from the folder named Pre-Processing Methods, I also used the following repository to learn about the coding part of it (https://github.com/amueller/introduction_to_ml_with_python/blob/master/02-supervised-learning.ipynb) (https://chat.openai.com/) and also took help from the following youtube videos: (https://www.youtube.com/watch?v=M9Itm95JzL0&t=951s) (https://www.youtube.com/watch?v=i_LwzRVP7bg&t=3362s). I used chatgpt to mostly go over the error and understand the code & problem.I also encountered few error, gpt helped me understant the problem and finnaly solve it.