## Scenario:


You are working as an analyst for an internet service provider. You are provided with this historical data about your company's customers and their churn trends. Your task is to build a machine learning model that will help the company identify customers that are more likely to default/churn and thus prevent losses from such customers.

## Instructions:
In this lab, we will first take a look at the degree of imbalance in the data and correct it using the techniques we learned in class.

Here is the list of steps to be followed (building a simple model without balancing the data):

# Round 1

Import the required libraries and modules that you would need.
Read that data into Python and call the dataframe churnData.
Check the datatypes of all the columns in the data. You will see that the column TotalCharges is object type. Convert this column into numeric type using pd.to_numeric function.
Check for null values in the dataframe. Replace the null values.
Use the following features: tenure, SeniorCitizen, MonthlyCharges and TotalCharges:
Split the data into a training set and a test set.
Scale the features either by using normalizer or a standard scaler.
Fit a logistic Regression model on the training data.
Fit a Knn Classifier(NOT KnnRegressor please!)model on the training data.


# Round 2

Fit a Decision Tree Classifier on the training data.
Check the accuracy on the test data.

# Round 3

apply K-fold cross validation on your models before and check the model score. Note: So far we have not balanced the data.

# Round 4

fit a Random forest Classifier on the data and compare the accuracy.
tune the hyper paramters with gridsearch and check the results.
Managing imbalance in the dataset

Check for the imbalance.
Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.
Each time fit the model and see how the accuracy of the model is.

In [21]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler, PowerTransformer
from sklearn.datasets import load_iris, load_breast_cancer
from sklearn.datasets import fetch_california_housing
from sklearn.datasets import load_iris, load_breast_cancer
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import plot_tree
%matplotlib inline 
#Importing our defined functions to clean
from clean_data_functions import *

import pandas as pd
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix,  ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, precision_score, recall_score


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import PowerTransformer
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

%matplotlib inline

In [22]:
churnData= pd.read_csv("DATA_Customer-Churn.csv")
churnData.dtypes

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

In [23]:


churnData['TotalCharges'] = pd.to_numeric(churnData['TotalCharges'], errors='coerce')

data_frame_overview(churnData)

Column names: 
 Index(['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',
       'PhoneService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
       'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

Dimensions: (7043, 16)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   OnlineSecurity    7043 non-null   object 
 7   OnlineBackup      7043 non-null   object 
 8   DeviceProtection  7043 non-null   object 
 9   TechSupport       7043 non-null   object 
 

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.7,151.65,Yes
5,Female,0,No,No,8,Yes,No,No,Yes,No,Yes,Yes,Month-to-month,99.65,820.5,Yes
6,Male,0,No,Yes,22,Yes,No,Yes,No,No,Yes,No,Month-to-month,89.1,1949.4,No
7,Female,0,No,No,10,No,Yes,No,No,No,No,No,Month-to-month,29.75,301.9,No
8,Female,0,Yes,No,28,Yes,No,No,Yes,Yes,Yes,Yes,Month-to-month,104.8,3046.05,Yes
9,Male,0,No,Yes,62,Yes,Yes,Yes,No,No,No,No,One year,56.15,3487.95,No


In [24]:
null_check(churnData)

Total null values per row: 
0       0
1       0
2       0
3       0
4       0
       ..
7038    0
7039    0
7040    0
7041    0
7042    0
Length: 7043, dtype: int64

Total null values per column: 
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64



In [25]:
churnData['TotalCharges'].fillna(churnData['TotalCharges'].mean(), inplace=True)

In [26]:
null_check(churnData)

Total null values per row: 
0       0
1       0
2       0
3       0
4       0
       ..
7038    0
7039    0
7040    0
7041    0
7042    0
Length: 7043, dtype: int64

Total null values per column: 
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64



In [27]:
churn = churnData['Churn'].replace({'Yes': 1, 'No': 0})

churnData = churnData[['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges', 'Churn']]

churnData['default'] = churnData['Churn'].replace({'Yes': 1, 'No': 0})
churnData


Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,Churn,default
0,0,1,29.85,29.85,No,0
1,0,34,56.95,1889.50,No,0
2,0,2,53.85,108.15,Yes,1
3,0,45,42.30,1840.75,No,0
4,0,2,70.70,151.65,Yes,1
...,...,...,...,...,...,...
7038,0,24,84.80,1990.50,No,0
7039,0,72,103.20,7362.90,No,0
7040,0,11,29.60,346.45,No,0
7041,1,4,74.40,306.60,Yes,1


In [28]:
churnData.drop(columns =['Churn'], inplace = True
            )

churnData

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,default
0,0,1,29.85,29.85,0
1,0,34,56.95,1889.50,0
2,0,2,53.85,108.15,1
3,0,45,42.30,1840.75,0
4,0,2,70.70,151.65,1
...,...,...,...,...,...
7038,0,24,84.80,1990.50,0
7039,0,72,103.20,7362.90,0
7040,0,11,29.60,346.45,0
7041,1,4,74.40,306.60,1


# Round 3

apply K-fold cross validation on your models before and check the model score. 

Note: So far we have not balanced the data.

# Handling Imbalanced Data

In [29]:
## Data splitting first to avoid data leaking

X = churnData.drop(columns=['default'])
y = churnData[['default']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=11)

print(X.shape)
print(y.shape)

# If 'y' is categorical and you want to see the unique classes
unique_classes = np.unique(y)
print(unique_classes)

(7043, 4)
(7043, 1)
[0 1]


In [30]:
category_0 = churnData[churnData['default'] == 0] # negative class (majority)
category_1 = churnData[churnData['default'] == 1] # positive class (minority)
print(category_0.shape)
print(category_1.shape)

(5174, 5)
(1869, 5)


In [31]:
# SMOTE for Upsampling using

from imblearn.over_sampling import SMOTE
smote = SMOTE()

In [32]:
y_test.value_counts()

default
0          1031
1           378
Name: count, dtype: int64

In [33]:
X_sm, y_sm = smote.fit_resample(X_train, y_train)
y_sm.value_counts()

default
0          4143
1          4143
Name: count, dtype: int64

In [34]:
y_test.value_counts()

default
0          1031
1           378
Name: count, dtype: int64

# Model comparison Cross-Validation

In [None]:
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

# Initialize models
model1 = LogisticRegression()
model2 = KNeighborsClassifier()
model3 = DecisionTreeClassifier()

# Store models in a list
models = [model1, model2, model3]

# Define model names
model_names = ['Logistic Regression', 'KNN', 'Classification Tree']

# Dictionary to store mean scores and standard deviations
scores = {}

# Loop over each model
for model, name in zip(models, model_names):
    # Perform cross-validation and calculate recall scores
    recall_scores = cross_val_score(model, X_sm, y_sm, cv=5, scoring='recall')
    # Calculate mean score and standard deviation
    mean_score = np.mean(recall_scores)
    std_score = np.std(recall_scores)
    # Store mean score and standard deviation in the dictionary
    scores[name] = {'mean_score': mean_score, 'std_score': std_score}

# Print scores
print(scores)


In [44]:

model1 = LogisticRegression()
model2 = KNeighborsClassifier()
model3 = DecisionTreeClassifier()

models = [model1, model2, model3]
model_names = ['Logistic Regression', 'KNN', 'Classification Tree',]
scores = {}
mean_scores = []
std_scores = []
for model, name in zip(models, model_names):
    # Perform cross-validation and calculate recall scores
    recall_scores = cross_val_score(model, X_sm, y_sm, cv=5, scoring='recall')
    # Calculate mean score and standard deviation
    mean_score = np.mean(recall_scores)
    std_score = np.std(recall_scores)
    # Store mean score and standard deviation in the dictionary
    scores[name] = {'mean_score': mean_score, 'std_score': std_score}
    # Append mean score and standard deviation to lists
    mean_scores.append(mean_score)
    std_scores.append(std_score)
scores

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)


{'Logistic Regression': {'mean_score': 0.7487436117084201,
  'std_score': 0.025680255099922684},
 'KNN': {'mean_score': 0.8170372895578749, 'std_score': 0.03620983725567911},
 'Classification Tree': {'mean_score': 0.7617701904978351,
  'std_score': 0.05890008906570003}}

In [45]:
df = pd.DataFrame({'Model': model_names, 'Mean Score': mean_scores, 'Standard Deviation': std_scores})

In [46]:
df

Unnamed: 0,Model,Mean Score,Standard Deviation
0,Logistic Regression,0.748744,0.02568
1,KNN,0.817037,0.03621
2,Classification Tree,0.76177,0.0589
