# Lab | Handling Data Imbalance in Classification Models

-----------------------------------------------------------------------------------------------------------
For this lab and in the next lessons we will build a model on customer churn binary classification problem. You will be using files_for_lab/Customer-Churn.csv file.

----------------------------------------------------------------------------------------------------

### Scenario

----------------------------------------------------------------------------------------------
You are working as an analyst with this internet service provider. You are provided with this historical data about your company's customers and their churn trends. Your task is to build a machine learning model that will help the company identify customers that are more likely to default/churn and thus prevent losses from such customers.

-------------------------------------------------------------------------------------------

### Instructions

------------------------------------------------------------------------------------------------------------
In this lab, we will first take a look at the degree of imbalance in the data and correct it using the techniques we learned on the class.

Here is the list of steps to be followed (building a simple model without balancing the data):
    
----------------------------------------------------------------------------------------------------------

##### 1. Import the required libraries and modules that you would need.

In [2]:
#Importing necessary libraries
import pandas as pd
import numpy as np

##### 2. Read that data into Python and call the dataframe churnData.

In [3]:
#Importing the csv file into a varieble
churnData = pd.read_csv(r"C:\Users\mafal\Documents\ironhack\labs\lab-handling-data-imbalance-classification\files_for_lab\Customer-Churn.csv")

In [4]:
churnData

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.30,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,Male,0,Yes,Yes,24,Yes,Yes,No,Yes,Yes,Yes,Yes,One year,84.80,1990.5,No
7039,Female,0,Yes,Yes,72,Yes,No,Yes,Yes,No,Yes,Yes,One year,103.20,7362.9,No
7040,Female,0,Yes,Yes,11,No,Yes,No,No,No,No,No,Month-to-month,29.60,346.45,No
7041,Male,1,Yes,No,4,Yes,No,No,No,No,No,No,Month-to-month,74.40,306.6,Yes


##### 3. Check the datatypes of all the columns in the data. You would see that the column TotalCharges is object type. Convert this column into numeric type using pd.to_numeric function.

In [5]:
# Check the data types of all columns
print(churnData.dtypes)

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object


In [6]:
# Convert column 'TotalCharges' from object to numeric
# 'coerce' will set 'not a number' to NaN (missing value)
churnData['TotalCharges'] = pd.to_numeric(churnData['TotalCharges'], errors='coerce')

# Display the dtypes after conversion
print("\nDtypes after conversion:")
print(churnData.dtypes)

# Display the DataFrame after conversion
print("\nDataFrame after conversion:")
print(churnData)


Dtypes after conversion:
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
MonthlyCharges      float64
TotalCharges        float64
Churn                object
dtype: object

DataFrame after conversion:
      gender  SeniorCitizen Partner Dependents  tenure PhoneService  \
0     Female              0     Yes         No       1           No   
1       Male              0      No         No      34          Yes   
2       Male              0      No         No       2          Yes   
3       Male              0      No         No      45           No   
4     Female              0      No         No       2          Yes   
...      ...            ...     ...        ...     ...   

##### 4. Check for null values in the dataframe. Replace the null values.

In [7]:
# Count null values in each column
null_counts = churnData.isnull().sum()
print("Number of null values in each column:")
print(null_counts)

Number of null values in each column:
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64


In [8]:
# Replace null values in column 'A' with 0
churnData['TotalCharges'] = churnData['TotalCharges'].fillna(0)

In [9]:
# Count null values in each column
null_counts = churnData.isnull().sum()
print("Number of null values in each column:")
print(null_counts)

Number of null values in each column:
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64


##### 5. Use the following features: tenure, SeniorCitizen, MonthlyCharges and TotalCharges to:

###### 5.1 Scale the features either by using normalizer or a standard scaler.

In [10]:
# Scaling the following features: tenure, SeniorCitizen, MonthlyCharges and TotalCharges

churnData_copy = churnData.copy()

from sklearn.preprocessing import StandardScaler

# Selecting columns to standardize
columns_to_scale = ['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']

# Creating a scaler object
scaler = StandardScaler()

# Fitting the scaler to the data and transforming it
churnData_copy[columns_to_scale] = scaler.fit_transform(churnData_copy[columns_to_scale])

print(churnData_copy)

      gender  SeniorCitizen Partner Dependents    tenure PhoneService  \
0     Female      -0.439916     Yes         No -1.277445           No   
1       Male      -0.439916      No         No  0.066327          Yes   
2       Male      -0.439916      No         No -1.236724          Yes   
3       Male      -0.439916      No         No  0.514251           No   
4     Female      -0.439916      No         No -1.236724          Yes   
...      ...            ...     ...        ...       ...          ...   
7038    Male      -0.439916     Yes        Yes -0.340876          Yes   
7039  Female      -0.439916     Yes        Yes  1.613701          Yes   
7040  Female      -0.439916     Yes        Yes -0.870241           No   
7041    Male       2.273159     Yes         No -1.155283          Yes   
7042    Male      -0.439916      No         No  1.369379          Yes   

     OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV  \
0                No          Yes              

In [11]:
#Using normalizer

churnData_copy2 = churnData.copy()

from sklearn.preprocessing import MinMaxScaler

# Selecting columns to normalize
columns_to_normalize = ['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']

# Creating a scaler object for Min-Max scaling
min_max_scaler = MinMaxScaler()

# Fitting the scaler to the data and transforming it
churnData_copy2[columns_to_normalize] = min_max_scaler.fit_transform(churnData_copy2[columns_to_normalize])

print(churnData_copy2)

      gender  SeniorCitizen Partner Dependents    tenure PhoneService  \
0     Female            0.0     Yes         No  0.013889           No   
1       Male            0.0      No         No  0.472222          Yes   
2       Male            0.0      No         No  0.027778          Yes   
3       Male            0.0      No         No  0.625000           No   
4     Female            0.0      No         No  0.027778          Yes   
...      ...            ...     ...        ...       ...          ...   
7038    Male            0.0     Yes        Yes  0.333333          Yes   
7039  Female            0.0     Yes        Yes  1.000000          Yes   
7040  Female            0.0     Yes        Yes  0.152778           No   
7041    Male            1.0     Yes         No  0.055556          Yes   
7042    Male            0.0      No         No  0.916667          Yes   

     OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV  \
0                No          Yes              

###### 5.2 Split the data into a training set and a test set.

In [12]:
# Label encoding the gender column
# Use pd.get_dummies() to convert all categorical columns to dummy variables
churnData_copy_encoded = pd.get_dummies(churnData_copy, drop_first=True)

# Display the resulting DataFrame with encoded categorical columns
print("\nDataFrame after encoding:")
print(churnData_copy_encoded)


DataFrame after encoding:
      SeniorCitizen    tenure  MonthlyCharges  TotalCharges  gender_Male  \
0         -0.439916 -1.277445       -1.160323     -0.992611        False   
1         -0.439916  0.066327       -0.259629     -0.172165         True   
2         -0.439916 -1.236724       -0.362660     -0.958066         True   
3         -0.439916  0.514251       -0.746535     -0.193672         True   
4         -0.439916 -1.236724        0.197365     -0.938874        False   
...             ...       ...             ...           ...          ...   
7038      -0.439916 -0.340876        0.665992     -0.127605         True   
7039      -0.439916  1.613701        1.277533      2.242606        False   
7040      -0.439916 -0.870241       -1.168632     -0.852932        False   
7041       2.273159 -1.155283        0.320338     -0.870513         True   
7042      -0.439916  1.369379        1.358961      2.013897         True   

      Partner_Yes  Dependents_Yes  PhoneService_Yes  \
0    

In [13]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [14]:
from sklearn.model_selection import train_test_split

In [15]:
X = churnData_copy_encoded.drop('TotalCharges', axis=1)  # Features
y = churnData_copy_encoded['TotalCharges']  # Target variable

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [17]:
# Displaying the result
print("Training set:")
print(X_train)
print(y_train)
print("\nTest set:")
print(X_test)
print(y_test)

Training set:
      SeniorCitizen    tenure  MonthlyCharges  gender_Male  Partner_Yes  \
6607      -0.439916 -1.277445       -1.311546         True        False   
2598      -0.439916 -1.033122        0.345265        False        False   
2345      -0.439916 -1.155283       -1.486035        False        False   
4093      -0.439916 -0.137274        0.373516        False        False   
693       -0.439916 -1.196004        0.343603        False        False   
...             ...       ...             ...          ...          ...   
3772      -0.439916 -1.277445        1.004999         True         True   
5191      -0.439916 -0.381597        0.875378        False         True   
5226      -0.439916 -0.829521       -1.449476         True         True   
5390       2.273159 -0.829521        1.152899         True        False   
860       -0.439916 -0.259435       -1.494344         True        False   

      Dependents_Yes  PhoneService_Yes  OnlineSecurity_No internet service  \
6607   

###### 5.3 Fit a logistic regression model on the training data.

In [18]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Initialize the model
model = LogisticRegression()

# Fit the model
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

ValueError: Unknown label type: continuous. Maybe you are trying to fit a classifier, which expects discrete classes on a regression target with continuous values.

###### 5.4 Check the accuracy on the test data.

In [None]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy}")

# Lab | Cross Validation

### Instructions

##### 1. Apply SMOTE for upsampling the data

---------------------------------------------------------------------------------------------------------

- Use logistic regression to fit the model and compute the accuracy of the model.
- Use decision tree classifier to fit the model and compute the accuracy of the model.
- Compare the accuracies of the two models.

----------------------------------------------------------------------------------------------------

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

model = DecisionTreeRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.9971014558943989

##### 2. Apply TomekLinks for downsampling

--------------------------------------------------------------------------------------------------
- It is important to remember that it does not make the two classes equal but only removes the points from the majority class that are close to other points in minority class.
- Use logistic regression to fit the model and compute the accuracy of the model.
- Use decision tree classifier to fit the model and compute the accuracy of the model.
- Compare the accuracies of the two models.
- You can also apply this algorithm one more time and check the how the imbalance in the two classes changed from the last time.

----------------------------------------------------------------------------------------------------------------