## CLASSification_Handling_Imbalanced_Data_20.Feb

Lab | Classification, Handling Imbalanced Data
For this lab we will build a model on customer churn binary classification problem. You will be using this file: https://drive.google.com/drive/folders/1yLmZrS-uQ2BY98vvlkTsJ4UOfKi_vIaH?usp=sharing

Scenario
You are working as an analyst for an internet service provider. You are provided with this historical data about your company's customers and their churn trends. Your task is to build a machine learning model that will help the company identify customers that are more likely to default/churn and thus prevent losses from such customers.

Instructions
In this lab, we will first take a look at the degree of imbalance in the data and correct it using the techniques we learned in class.

Here is the list of steps to be followed (building a simple model without balancing the data):

Round 1

Import the required libraries and modules that you would need.
Read that data into Python and call the dataframe churnData.
Check the datatypes of all the columns in the data. You will see that the column TotalCharges is object type. Convert this column into numeric type using pd.to_numeric function.
Check for null values in the dataframe. Replace the null values.
Use the following features: tenure, SeniorCitizen, MonthlyCharges and TotalCharges:
Split the data into a training set and a test set.
Scale the features either by using normalizer or a standard scaler.
Fit a logistic Regression model on the training data.
Fit a Knn Classifier(NOT KnnRegressor please!)model on the training data.
Round 2

Fit a Decision Tree Classifier on the training data.
Check the accuracy on the test data.
Round 3

apply K-fold cross validation on your models before and check the model score. Note: So far we have not balanced the data.
Round 4

fit a Random forest Classifier on the data and compare the accuracy.
tune the hyper paramters with gridsearch and check the results.
Managing imbalance in the dataset

Check for the imbalance.
Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.
Each time fit the model and see how the accuracy of the model is.

In [5]:
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix,  ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, precision_score, recall_score

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

%matplotlib inline

Training Accuracy: 0.9908


In [6]:

df = pd.read_csv('customer_churn.csv')

In [7]:
len(df.columns)  # 16 Columns

16

In [8]:
df.columns


Index(['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',
       'PhoneService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
       'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

In [9]:
df.tail()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
7038,Male,0,Yes,Yes,24,Yes,Yes,No,Yes,Yes,Yes,Yes,One year,84.8,1990.5,No
7039,Female,0,Yes,Yes,72,Yes,No,Yes,Yes,No,Yes,Yes,One year,103.2,7362.9,No
7040,Female,0,Yes,Yes,11,No,Yes,No,No,No,No,No,Month-to-month,29.6,346.45,No
7041,Male,1,Yes,No,4,Yes,No,No,No,No,No,No,Month-to-month,74.4,306.6,Yes
7042,Male,0,No,No,66,Yes,Yes,No,Yes,Yes,Yes,Yes,Two year,105.65,6844.5,No


In [10]:
#change Dataframe to churnData
churnData=df
churnData.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.7,151.65,Yes


In [11]:
# checking for data type
churnData.dtypes

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

In [12]:
# converting an object('TotalCharges') to numeric
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')


In [13]:
churnData.dtypes

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
MonthlyCharges      float64
TotalCharges        float64
Churn                object
dtype: object

In [14]:
#checking for null values
null_counts = df.isnull().sum()
print(null_counts)
# result shows dataframe has 15 coliumns with null values. the oly column without is the TotalCharges

gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64


In [15]:
# replacing the null values with the mean of the column

#df['gender','SeniorCitizen','Partner','Dependents','tenure','PhoneService','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies','Contract','MonthlyCharges','Churn'].fillna(df['gender','SeniorCitizen','Partner','Dependents','tenure','PhoneService','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies','Contract','MonthlyCharges','Churn'].mean(), inplace=True)
df['TotalCharges'].fillna(df['TotalCharges'].mean(), inplace=True)

In [16]:
df.columns

Index(['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',
       'PhoneService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
       'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

In [17]:
df.dtypes


gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
MonthlyCharges      float64
TotalCharges        float64
Churn                object
dtype: object

In [18]:
#Use the following features: tenure, SeniorCitizen, MonthlyCharges and TotalCharges:
#Split the data into a training set and a test set.
from sklearn.model_selection import train_test_split

# Assuming 'df' is your DataFrame and null values in 'TotalCharges' have been handled

# Selecting specific features
X = df[['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']]

# Assuming 'Churn' is the target variable
y = df['Churn']

# Splitting the data into training and test sets
# Adjust the test_size as needed (e.g., 0.2 for 20% test size) and random_state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set size:", X_train.shape[0])
print("Test set size:", X_test.shape[0])

Training set size: 5634
Test set size: 1409


In [19]:
#Scale the features either by using normalizer or a standard scaler
from sklearn.preprocessing import StandardScaler

# Assuming X_train and X_test are already defined as shown in the previous step

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit on the training data
scaler.fit(X_train)

# Transform both the training and test data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [20]:
# Alternate scaler: Normalizer
from sklearn.preprocessing import Normalizer


# Initialize the Normalizer
normalizer = Normalizer()

# Fit on the training data is not needed as Normalizer works on the rows independently

# Transform both the training and test data
X_train_normalized = normalizer.transform(X_train)
X_test_normalized = normalizer.transform(X_test)



In [71]:
#Fitting a logistic Regression model on the training data.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Assuming X_train_scaled and y_train are your scaled training features and target variable, respectively

# Initialize the Logistic Regression model
log_reg = LogisticRegression()

# Fit the model on the scaled training data
log_reg.fit(X_train_scaled, y_train)


# Make predictions on the scaled training data (to evaluate the model)
y_train_pred = log_reg.predict(X_train_scaled)
y_pred=log_reg.predict(X_test_scaled)

# Evaluate the model's performance using accuracy as an example metric
train_accuracy = accuracy_score(y_train, y_train_pred)
print(f"Training Accuracy: {train_accuracy:.4f}")
test_accuracy = accuracy_score (y_test, y_pred)
print(f'Test Accuracy: {test_accuracy:.4f}')   # test accuracy added to enable a test compariso

Training Accuracy: 0.7875
Test Accuracy: 0.8077


In [109]:
#Fitting a Knn Classifier(NOT KnnRegressor please!)model on the training data.
knn_clf = KNeighborsClassifier(n_neighbors=5)

# Fit the model on the scaled training data
knn_clf.fit(X_train_scaled, y_train)

# Make predictions on the scaled training data (for evaluation purposes)
y_train_pred = knn_clf.predict(X_train_scaled)

# Evaluate the model's performance using accuracy as an example metric
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}") # test accuracy added to enable a test comparison

Training Accuracy: 0.8371
Test Accuracy: 0.8077


In [None]:
##Recommendation would be a Recall, to avoid a FN