# <u><b> Objective </b></u>
## <b>To predict the whether a customer will churn or not, based on the variables available in the Telco customer churn data. </b>


### Logistic regression does not make many of the key assumptions of linear regression and general linear models that are based on ordinary least squares algorithms – particularly regarding linearity, normality, homoscedasticity, and measurement level.

### First, logistic regression does not require a linear relationship between the dependent and independent variables.  Second, the error terms (residuals) do not need to be normally distributed.  Third, homoscedasticity is not required.  Finally, the dependent variable in logistic regression is not measured on an interval or ratio scale.

### However, some other assumptions still apply.

### First, binary logistic regression requires the dependent variable to be binary and ordinal logistic regression requires the dependent variable to be ordinal.

### Second, logistic regression requires the observations to be independent of each other.  In other words, the observations should not come from repeated measurements or matched data.

### Third, logistic regression requires there to be little or no multicollinearity among the independent variables.  This means that the independent variables should not be too highly correlated with each other.

### Fourth, logistic regression assumes linearity of independent variables and log odds.  although this analysis does not require the dependent and independent variables to be related linearly, it requires that the independent variables are linearly related to the log odds.


### <b> In this assignment, you need to do the following : </b>

* ### Remove correlated variables and run Logistic Regression
* ### Also implement regularized logistic regression using the hyperparameter C in the sklearn implementation. Add details about how this hyperparameter affects the learning and performance of the model.
* ### Evaluate your logistic regression models using metrics such as roc_auc, log_loss, precision, recall, accuracy and f-score. You already know these metrics from your assignments in Module 1. Explain your observations about these metrics results.

### Let's Start solving assignment !

In [1]:
# Import Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, log_loss, precision_score, recall_score, accuracy_score, f1_score

In [5]:
# Import Dataset
data = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")

In [7]:
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [9]:
data.columns

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

In [11]:
data["class"] = data["Churn"].apply(lambda x : 1 if x == "Yes" else 0)
X = data[["tenure","MonthlyCharges"]].copy()
y = data["class"].copy()

In [22]:
# SPlit the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

# Create the model
model = LogisticRegression()
# Train the model
model.fit(X_train,y_train)
# Make predictions
y_pred = model.predict(X_test)




In [23]:
# Regularized Logistic Regression with different values of C
C_values = [0.001, 0.01, 0.1, 1, 10, 100]
for C_val in C_values:
    regularized_model = LogisticRegression(C=C_val)
    regularized_model.fit(X_train, y_train)
    # Evaluate the model and print performance metrics
    y_pred = regularized_model.predict(X_test)
    print(f"\nEvaluation metrics for C={C_val}:")
    print(f"ROC-AUC: {roc_auc_score(y_test, y_pred)}")
    print(f"Log Loss: {log_loss(y_test, y_pred)}")
    print(f"Precision: {precision_score(y_test, y_pred)}")
    print(f"Recall: {recall_score(y_test, y_pred)}")
    print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
    print(f"F1 Score: {f1_score(y_test, y_pred)}")

print("=================================================================================")

# Evaluate the model and print performance metrics
y_pred = model.predict(X_test)
print("\nEvaluation metrics for Logistic Regression:")
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred)}")
print(f"Log Loss: {log_loss(y_test, y_pred)}")
print(f"Precision: {precision_score(y_test, y_pred)}")
print(f"Recall: {recall_score(y_test, y_pred)}")
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"F1 Score: {f1_score(y_test, y_pred)}")



Evaluation metrics for C=0.001:
ROC-AUC: 0.6973679443518584
Log Loss: 7.26500891590438
Precision: 0.6642066420664207
Recall: 0.48257372654155495
Accuracy: 0.7984386089425124
F1 Score: 0.5590062111801243

Evaluation metrics for C=0.01:
ROC-AUC: 0.696885318869233
Log Loss: 7.290589933213902
Precision: 0.6617647058823529
Recall: 0.48257372654155495
Accuracy: 0.7977288857345636
F1 Score: 0.5581395348837209

Evaluation metrics for C=0.1:
ROC-AUC: 0.696885318869233
Log Loss: 7.290589933213902
Precision: 0.6617647058823529
Recall: 0.48257372654155495
Accuracy: 0.7977288857345636
F1 Score: 0.5581395348837209

Evaluation metrics for C=1:
ROC-AUC: 0.696885318869233
Log Loss: 7.290589933213902
Precision: 0.6617647058823529
Recall: 0.48257372654155495
Accuracy: 0.7977288857345636
F1 Score: 0.5581395348837209

Evaluation metrics for C=10:
ROC-AUC: 0.696885318869233
Log Loss: 7.290589933213902
Precision: 0.6617647058823529
Recall: 0.48257372654155495
Accuracy: 0.7977288857345636
F1 Score: 0.5581395

In [24]:
from sklearn.metrics import confusion_matrix, classification_report

In [25]:
 # Standardize features (optional but often beneficial for logistic regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and train logistic regression model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

# Predictions on the test set
y_pred = model.predict(X_test_scaled)

# Evaluate the model
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

# Print results
print("Confusion Matrix:")
print(conf_matrix)

print("\nClassification Report:")
print(classification_rep)

Confusion Matrix:
[[945  91]
 [193 180]]

Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.91      0.87      1036
           1       0.66      0.48      0.56       373

    accuracy                           0.80      1409
   macro avg       0.75      0.70      0.71      1409
weighted avg       0.79      0.80      0.79      1409

