# Lab | Handling Data Imbalance in Classification Models

-----------------------------------------------------------------------------------------------------------
For this lab and in the next lessons we will build a model on customer churn binary classification problem. You will be using files_for_lab/Customer-Churn.csv file.

----------------------------------------------------------------------------------------------------

### Scenario

----------------------------------------------------------------------------------------------
You are working as an analyst with this internet service provider. You are provided with this historical data about your company's customers and their churn trends. Your task is to build a machine learning model that will help the company identify customers that are more likely to default/churn and thus prevent losses from such customers.

-------------------------------------------------------------------------------------------

### Instructions

------------------------------------------------------------------------------------------------------------
In this lab, we will first take a look at the degree of imbalance in the data and correct it using the techniques we learned on the class.

Here is the list of steps to be followed (building a simple model without balancing the data):
    
----------------------------------------------------------------------------------------------------------

##### 1. Import the required libraries and modules that you would need.

In [1]:
#Importing necessary libraries
import pandas as pd
import numpy as np

##### 2. Read that data into Python and call the dataframe churnData.

In [2]:
#Importing the csv file into a varieble
churnData = pd.read_csv(r"C:\Users\mafal\Documents\ironhack\labs\lab-handling-data-imbalance-classification\files_for_lab\Customer-Churn.csv")

In [3]:
churnData

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.30,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,Male,0,Yes,Yes,24,Yes,Yes,No,Yes,Yes,Yes,Yes,One year,84.80,1990.5,No
7039,Female,0,Yes,Yes,72,Yes,No,Yes,Yes,No,Yes,Yes,One year,103.20,7362.9,No
7040,Female,0,Yes,Yes,11,No,Yes,No,No,No,No,No,Month-to-month,29.60,346.45,No
7041,Male,1,Yes,No,4,Yes,No,No,No,No,No,No,Month-to-month,74.40,306.6,Yes


##### 3. Check the datatypes of all the columns in the data. You would see that the column TotalCharges is object type. Convert this column into numeric type using pd.to_numeric function.

In [4]:
# Check the data types of all columns
print(churnData.dtypes)

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object


In [5]:
# Convert column 'TotalCharges' from object to numeric
# 'coerce' will set 'not a number' to NaN (missing value)
churnData['TotalCharges'] = pd.to_numeric(churnData['TotalCharges'], errors='coerce')

# Display the dtypes after conversion
print("\nDtypes after conversion:")
print(churnData.dtypes)

# Display the DataFrame after conversion
print("\nDataFrame after conversion:")
print(churnData)


Dtypes after conversion:
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
MonthlyCharges      float64
TotalCharges        float64
Churn                object
dtype: object

DataFrame after conversion:
      gender  SeniorCitizen Partner Dependents  tenure PhoneService  \
0     Female              0     Yes         No       1           No   
1       Male              0      No         No      34          Yes   
2       Male              0      No         No       2          Yes   
3       Male              0      No         No      45           No   
4     Female              0      No         No       2          Yes   
...      ...            ...     ...        ...     ...   

##### 4. Check for null values in the dataframe. Replace the null values.

In [6]:
# Count null values in each column
null_counts = churnData.isnull().sum()
print("Number of null values in each column:")
print(null_counts)

Number of null values in each column:
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64


In [7]:
# Replace null values in column 'A' with 0
churnData['TotalCharges'] = churnData['TotalCharges'].fillna(0)

In [8]:
# Count null values in each column
null_counts = churnData.isnull().sum()
print("Number of null values in each column:")
print(null_counts)

Number of null values in each column:
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64


##### 5. Use the following features: tenure, SeniorCitizen, MonthlyCharges and TotalCharges to:

###### 5.1 Scale the features either by using normalizer or a standard scaler.

In [9]:
# Scaling the following features: tenure, SeniorCitizen, MonthlyCharges and TotalCharges

churnData_copy = churnData.copy()

from sklearn.preprocessing import StandardScaler

# Selecting columns to standardize
columns_to_scale = ['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']

# Creating a scaler object
scaler = StandardScaler()

# Fitting the scaler to the data and transforming it
churnData_copy[columns_to_scale] = scaler.fit_transform(churnData_copy[columns_to_scale])

print(churnData_copy)

      gender  SeniorCitizen Partner Dependents    tenure PhoneService  \
0     Female      -0.439916     Yes         No -1.277445           No   
1       Male      -0.439916      No         No  0.066327          Yes   
2       Male      -0.439916      No         No -1.236724          Yes   
3       Male      -0.439916      No         No  0.514251           No   
4     Female      -0.439916      No         No -1.236724          Yes   
...      ...            ...     ...        ...       ...          ...   
7038    Male      -0.439916     Yes        Yes -0.340876          Yes   
7039  Female      -0.439916     Yes        Yes  1.613701          Yes   
7040  Female      -0.439916     Yes        Yes -0.870241           No   
7041    Male       2.273159     Yes         No -1.155283          Yes   
7042    Male      -0.439916      No         No  1.369379          Yes   

     OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV  \
0                No          Yes              

###### 5.2 Split the data into a training set and a test set.

In [10]:
# Label encoding the gender column
# Use pd.get_dummies() to convert all categorical columns to dummy variables
#churnData_copy_encoded = pd.get_dummies(churnData_copy, drop_first=True)

# Display the resulting DataFrame with encoded categorical columns
#print("\nDataFrame after encoding:")
#print(churnData_copy_encoded)

In [11]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [12]:
from sklearn.model_selection import train_test_split

In [17]:
X = churnData_copy[columns_to_scale]  # Features
y = churnData_copy['Churn']  # Target variable

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [19]:
# Displaying the result
print("Training set:")
print(X_train)
print(y_train)
print("\nTest set:")
print(X_test)
print(y_test)

Training set:
        tenure  SeniorCitizen  MonthlyCharges  TotalCharges
2142 -0.463037      -0.439916        0.002935     -0.416007
1623  0.880735      -0.439916        1.078118      1.257246
6074 -1.277445      -0.439916       -1.373033     -0.995434
1362 -1.155283      -0.439916        0.180747     -0.900800
6754 -1.318165      -0.439916       -0.095111     -1.005780
...        ...            ...             ...           ...
3772 -1.277445      -0.439916        1.004999     -0.963867
5191 -0.381597      -0.439916        0.875378     -0.035927
5226 -0.829521      -0.439916       -1.449476     -0.870756
5390 -0.829521       2.273159        1.152899     -0.476294
860  -0.259435      -0.439916       -1.494344     -0.804027

[5634 rows x 4 columns]
2142     No
1623     No
6074    Yes
1362    Yes
6754     No
       ... 
3772    Yes
5191     No
5226     No
5390    Yes
860      No
Name: Churn, Length: 5634, dtype: object

Test set:
        tenure  SeniorCitizen  MonthlyCharges  TotalCharg

###### 5.3 Fit a logistic regression model on the training data.

In [21]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Initialize the model
model = LogisticRegression()

# Fit the model
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

###### 5.4 Check the accuracy on the test data.

In [22]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy}")

Model Accuracy: 0.8062455642299503


# Lab | Cross Validation

### Instructions

##### 1. Apply SMOTE for upsampling the data

---------------------------------------------------------------------------------------------------------

- Use logistic regression to fit the model and compute the accuracy of the model.
- Use decision tree classifier to fit the model and compute the accuracy of the model.
- Compare the accuracies of the two models.

----------------------------------------------------------------------------------------------------

In [23]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

model = DecisionTreeRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

ValueError: could not convert string to float: 'No'

##### 2. Apply TomekLinks for downsampling

--------------------------------------------------------------------------------------------------
- It is important to remember that it does not make the two classes equal but only removes the points from the majority class that are close to other points in minority class.
- Use logistic regression to fit the model and compute the accuracy of the model.
- Use decision tree classifier to fit the model and compute the accuracy of the model.
- Compare the accuracies of the two models.
- You can also apply this algorithm one more time and check the how the imbalance in the two classes changed from the last time.

----------------------------------------------------------------------------------------------------------------

In [32]:
from sklearn.utils import resample

# Upsampling
churn_majority = churnData[churnData['Churn'] == 'No']
churn_minority = churnData[churnData['Churn'] == 'Yes']

churn_minority_upsampled = resample(df_minority, replace=True, n_samples=len(df_majority), random_state=42)

churn_upsampled = pd.concat([churn_majority, churn_minority_upsampled])

print(churn_upsampled['Churn'].value_counts())

Churn
No     5174
Yes    5174
Name: count, dtype: int64


In [33]:
X_upsampled = churn_upsampled[columns_to_scale]
y_upsampled = churn_upsampled['Churn']

X_upsampled = scaler.fit_transform(X_upsampled)

X_train_up, X_test_up, y_train_up, y_test_up = train_test_split(X_upsampled, y_upsampled, test_size=0.2, random_state=42)

model_up = LogisticRegression()
model_up.fit(X_train_up, y_train_up)

y_pred_up = model_up.predict(X_test_up)

accuracy_up = accuracy_score(y_test_up, y_pred_up)
print(accuracy_up)

0.7323671497584541


In [29]:
#Downsampling

churn_majority = churnData[churnData['Churn'] == 'No']
churn_minority = churnData[churnData['Churn'] == 'Yes']

churn_majority_downsampled = resample(df1_majority, replace=False, n_samples=len(df1_minority), random_state=42)

churn_downsampled = pd.concat([df1_majority_downsampled, df1_minority])

print(churn_downsampled['Churn'].value_counts())

Churn
No     1869
Yes    1869
Name: count, dtype: int64


In [31]:
X_downsampled = df1_downsampled[columns_to_scale]
y_downsampled = df1_downsampled['Churn']

X_downsampled = scaler.fit_transform(X_downsampled)

X_train_down, X_test_down, y_train_down, y_test_down = train_test_split(X_downsampled, y_downsampled, test_size=0.2, random_state=42)

model_down = LogisticRegression()
model_down.fit(X_train_down, y_train_down)

y_pred_down = model_down.predict(X_test_down)

accuracy_down = accuracy_score(y_test_down, y_pred_down)
print(accuracy_down)

0.7553475935828877
