## Lab | Handling Data Imbalance in Classification Models.

#### Import the required libraries and modules that you would need.

In [63]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks
import warnings
warnings.filterwarnings("ignore")

#### Read that data into Python and call the dataframe churnData.


In [2]:
churnData= pd.read_csv('Customer-Churn.csv')
churnData

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.30,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,Male,0,Yes,Yes,24,Yes,Yes,No,Yes,Yes,Yes,Yes,One year,84.80,1990.5,No
7039,Female,0,Yes,Yes,72,Yes,No,Yes,Yes,No,Yes,Yes,One year,103.20,7362.9,No
7040,Female,0,Yes,Yes,11,No,Yes,No,No,No,No,No,Month-to-month,29.60,346.45,No
7041,Male,1,Yes,No,4,Yes,No,No,No,No,No,No,Month-to-month,74.40,306.6,Yes


#### Check the datatypes of all the columns in the data. You would see that the column TotalCharges is object type. Convert this column into numeric type using pd.to_numeric function.

In [3]:
churnData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   OnlineSecurity    7043 non-null   object 
 7   OnlineBackup      7043 non-null   object 
 8   DeviceProtection  7043 non-null   object 
 9   TechSupport       7043 non-null   object 
 10  StreamingTV       7043 non-null   object 
 11  StreamingMovies   7043 non-null   object 
 12  Contract          7043 non-null   object 
 13  MonthlyCharges    7043 non-null   float64
 14  TotalCharges      7043 non-null   object 
 15  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(13)
memory

In [4]:
churnData['TotalCharges']= pd.to_numeric(churnData['TotalCharges'], errors = 'coerce' )

In [6]:
churnData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   OnlineSecurity    7043 non-null   object 
 7   OnlineBackup      7043 non-null   object 
 8   DeviceProtection  7043 non-null   object 
 9   TechSupport       7043 non-null   object 
 10  StreamingTV       7043 non-null   object 
 11  StreamingMovies   7043 non-null   object 
 12  Contract          7043 non-null   object 
 13  MonthlyCharges    7043 non-null   float64
 14  TotalCharges      7032 non-null   float64
 15  Churn             7043 non-null   object 
dtypes: float64(2), int64(2), object(12)
memory

#### Check for null values in the dataframe. Replace the null values.


In [8]:
churnData.isna().sum()

gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

In [None]:
# We have 7043 rows in TotalCharges column and it has only 11 nulls. We can drop them.

In [12]:
churnData=churnData.dropna(subset = ['TotalCharges'])
churnData.isna().sum()

gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

#### Use the following features: tenure, SeniorCitizen, MonthlyCharges and TotalCharges:
- 1.1 Scale the features either by using normalizer or a standard scaler.
- 1.2 Split the data into a training set and a test set.
- 1.3 Fit a logistic regression model on the training data.
- 1.4 Check the accuracy on the test data.

In [16]:
X = churnData[['tenure','SeniorCitizen','MonthlyCharges','TotalCharges']].copy()
y = churnData['Churn']

- 1.1 Scale the features either by using normalizer or a standard scaler.

In [19]:
scaler= MinMaxScaler()
X_scaled= scaler.fit_transform(X)

- 1.2 Split the data into a training set and a test set.

In [20]:
X_train, X_test, y_train, y_test= train_test_split(X_scaled, y, test_size=0.2, random_state=42)

- 1.3 Fit a logistic regression model on the training data. 


In [21]:
lr= LogisticRegression()
lr.fit(X_train, y_train)

In [23]:
predictions= lr.predict(X_test)
predictions

array(['No', 'No', 'Yes', ..., 'No', 'Yes', 'No'], dtype=object)

- 1.4 Check the accuracy on the test data.

In [24]:
accuracy = accuracy_score(y_test,predictions)
accuracy

In [26]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

          No       0.82      0.91      0.86      1033
         Yes       0.63      0.43      0.51       374

    accuracy                           0.78      1407
   macro avg       0.72      0.67      0.69      1407
weighted avg       0.76      0.78      0.77      1407



#### Managing imbalance in the dataset:

- 2.1 Check for the imbalance.
- 2.2 Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.
- 2.3 Each time fit the model and see how the accuracy of the model is.

- 2.1 Check for the imbalance.

In [27]:
churnData['Churn'].value_counts() 

No     5163
Yes    1869
Name: Churn, dtype: int64

- 2.2 Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.
- 2.3 Each time fit the model and see how the accuracy of the model is.

We can use upsampling technique SMOTE to balance the data.

In [29]:
smote = SMOTE()
X_sm, y_sm = smote.fit_resample(X, y)
y_sm.value_counts()

No     5163
Yes    5163
Name: Churn, dtype: int64

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.2, random_state=100)
classification = LogisticRegression(random_state=100, multi_class='ovr').fit(X_train, y_train)
s_predictions = classification.predict(X_test)

classification.score(X_test, y_test)

0.7216844143272023

In [31]:
print(classification_report(y_test, s_predictions))

              precision    recall  f1-score   support

          No       0.77      0.65      0.70      1052
         Yes       0.69      0.79      0.74      1014

    accuracy                           0.72      2066
   macro avg       0.73      0.72      0.72      2066
weighted avg       0.73      0.72      0.72      2066



We can use Tomek Links to balance the data.

In [64]:
tl = TomekLinks(sampling_strategy='majority')
X_tl, y_tl = tl.fit_resample(X, y)
y_tl.value_counts()

No     4610
Yes    1869
Name: Churn, dtype: int64

In [88]:
X_train, X_test, y_train, y_test = train_test_split(X_tl, y_tl, test_size=0.2, random_state=100)
classification = LogisticRegression(random_state=100, multi_class='ovr').fit(X_train, y_train)
t_predictions = classification.predict(X_test)

classification.score(X_test, y_test)

0.7986111111111112

In [68]:
print(classification_report(y_test, t_predictions))

              precision    recall  f1-score   support

          No       0.83      0.90      0.86       909
         Yes       0.71      0.55      0.62       387

    accuracy                           0.80      1296
   macro avg       0.77      0.73      0.74      1296
weighted avg       0.79      0.80      0.79      1296



### Lab | Cross Validation

##### Apply SMOTE for upsampling the data:

- Use logistic regression to fit the model and compute the accuracy of the model.
- Use decision tree classifier to fit the model and compute the accuracy of the model.
- Compare the accuracies of the two models.


In [70]:
from sklearn.tree import DecisionTreeClassifier

In [75]:
lr=LogisticRegression(random_state=42)
lr.fit(X_sm, y_sm)
lr_predictions=lr.predict(X_test)
accuracy_lr=accuracy_score(y_test, lr_predictions)
accuracy_lr

0.7561728395061729

In [84]:
DTC=DecisionTreeClassifier(random_state=42)
DTC.fit(X_sm, y_sm)
s_predictions_dtc = DTC.predict(X_test)
accuracy_DTC = accuracy_score(y_test, s_predictions_dtc)
accuracy_DTC

0.9884259259259259

In [None]:
# Decision Tree gives better accuracy than logistic regression.

##### Apply TomekLinks for downsampling:

- It is important to remember that it does not make the two classes equal but only removes the points from the majority class that are close to other points in minority class.
- Use logistic regression to fit the model and compute the accuracy of the model.
- Use decision tree classifier to fit the model and compute the accuracy of the model.
- Compare the accuracies of the two models.
- You can also apply this algorithm one more time and check the how the imbalance in the two classes changed from the last time.

In [79]:
lr=LogisticRegression(random_state=42)
lr.fit(X_tl, y_tl)
lr_predictions=lr.predict(X_test)
accuracy_lr=accuracy_score(y_test, lr_predictions)
accuracy_lr

0.7916666666666666

In [89]:
DTC=DecisionTreeClassifier(random_state=42)
DTC.fit(X_tl, y_tl)
tl_predictions_dtc = DTC.predict(X_test)
accuracy_DTC = accuracy_score(y_test, tl_predictions_dtc)
accuracy_DTC

0.9876543209876543

In [None]:
# Decision Tree gives better accuracy than logistic regression.