## Assignment: Predicting Customer Churn
### Objective:
* The goal of this assignment is to build a classification model to predict whether a customer will churn (stop using a service) based on their features and interactions with the service.



<span style="font-size: 30px; color: green">Import Libraries</span>

In [30]:
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

<span style="font-size: 30px; color: green">Load Dataset</span>

In [2]:
data = pd.read_csv("telco_customer_churn.csv")
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


<span style="font-size: 30px; color: green">Data Preprocessing</span>

In [3]:
# check the datatypes
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [4]:
# Change the TotalCharges column to float
data["TotalCharges"] = pd.to_numeric(data["TotalCharges"], errors='coerce')

In [5]:
# check for missing values
data.isnull().sum()

customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

In [6]:
# Handle missing values
data["TotalCharges"].fillna(data["TotalCharges"].mean(), inplace=True)

In [7]:
data.columns

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

<span style="font-size: 30px; color: green">Encoding</span>

<span style="font-size: 20px; color: blue">converting categorical variables to numeric</span> 

In [8]:
# Loop through for object variables
for column in data.select_dtypes(include="object"):
    if column != "customerID":
        # use label encoding
        data[column] = LabelEncoder().fit_transform(data[column])

In [9]:
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,0,0,1,0,1,0,1,0,0,...,0,0,0,0,0,1,2,29.85,29.85,0
1,5575-GNVDE,1,0,0,0,34,1,0,0,2,...,2,0,0,0,1,0,3,56.95,1889.5,0
2,3668-QPYBK,1,0,0,0,2,1,0,0,2,...,0,0,0,0,0,1,3,53.85,108.15,1
3,7795-CFOCW,1,0,0,0,45,0,1,0,2,...,2,2,0,0,1,0,0,42.3,1840.75,0
4,9237-HQITU,0,0,0,0,2,1,0,1,0,...,0,0,0,0,0,1,2,70.7,151.65,1


<span style="font-size: 30px; color: green">Data Splitting</span>

In [10]:
x = data.drop(["customerID", "Churn"], axis=1)
y = data["Churn"]

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

<span style="font-size: 20px; color: blue"> Feature scaling</span>

In [11]:
# Normalization or standardization of the data
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

<span style="font-size: 30px; color: green">Model Training</span>

In [12]:
# Train classification models:
# Train various classification models
# (e.g., logistic regression, KNN, SVM, decision tree, random forest, etc.) on the training dataset.

In [13]:
# logistic regression
logreg = LogisticRegression()
logreg.fit(X_train_scaled, y_train)
logreg_prediction = logreg.predict(X_test_scaled)

In [14]:
# KNN
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_scaled, y_train)
knn_prediction = knn.predict(X_test_scaled)

In [15]:
# suport Vector machine(SVM)
svm = SVC(kernel="linear", C=1)
svm.fit(X_train_scaled, y_train)
svm_prediction = svm.predict(X_test_scaled)

<span style="font-size: 30px; color: green">Model Evaluation</span>
* Evaluate the models: Assess the performance of the models using appropriate metrics such as accuracy, precision, recall, F1-score, and ROC-AUC.

In [16]:
# logistic regression
logreg_accuracy = accuracy_score(y_test, logreg_prediction)
logreg_precision = precision_score(y_test, logreg_prediction)
logreg_recall = precision_score(y_test, logreg_prediction)
logreg_f1 = f1_score(y_test, logreg_prediction)
logreg_roc_auc = roc_auc_score(y_test, logreg_prediction) # 82, 68, 68,74

In [17]:
# KNN
knn_accuracy = accuracy_score(y_test, knn_prediction)
knn_precision = precision_score(y_test, knn_prediction)
knn_recall = precision_score(y_test, knn_prediction)
knn_f1 = f1_score(y_test, knn_prediction)
knn_roc_auc = roc_auc_score(y_test, knn_prediction)# 75, 53, 53, 51, 67

In [23]:
# suport Vector machine(SVM)
svm_accuracy = accuracy_score(y_test, svm_prediction)
svm_precision = precision_score(y_test, svm_prediction)
svm_recall = precision_score(y_test, svm_prediction)
svm_f1 = f1_score(y_test, svm_prediction)
svm_roc_auc = roc_auc_score(y_test, svm_prediction) #82, 68, 68, 63, 74

<span style="font-size: 30px; color: green">Hyperparameter Tuning</span>
* Optimize the models: Perform hyperparameter tuning using techniques like grid search or random search to improve the performance of the models.

In [26]:
# Using Grid Search for SVM
param_grid = {
    'C': [0.1, 1, 10], # Values for the regularization parameter
    'kernel': ['linear', 'rbf'],# Types of kernels to try
}
svm = SVC()
grid_search = GridSearchCV(svm, param_grid, cv=5) #cv=5 number of folds for cross-validation
grid_search.fit(X_train_scaled, y_train)

best_params = grid_search.best_params_
best_params

{'C': 0.1, 'kernel': 'linear'}

In [27]:
# Using Grid Search for KNN
param_grid = {
    'n_neighbors': [3, 5, 7, 9, 11],  # Values for the number of neighbors
}

knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv=5)
grid_search.fit(X_train_scaled, y_train)

best_params = grid_search.best_params_
best_params

{'n_neighbors': 11}

In [32]:
# Using Random Search for KNN
param_dist = {
    'n_neighbors': np.arange(1, 21),  # Values for the number of neighbors (1 to 20)
}

knn = KNeighborsClassifier()
random_search = RandomizedSearchCV(knn, param_distributions=param_dist, n_iter=10, cv=5)# n_iter=10 randomly try 10 combinations of n_neighbors 
random_search.fit(X_train_scaled, y_train)

best_params = random_search.best_params_
best_params

{'n_neighbors': 13}

<span style="font-size: 30px; color: green">Import Libraries</span>

<span style="font-size: 30px; color: green">Import Libraries</span>