##### Topics Being Covered 

*   Balancing the Data using Resampling Methods
*   Fixing a Problem in the Real World
*   Getting Data Ready for Predictive Modeling
*   Making use of Sklearn Logistic Regression
*   Using Sklearn to apply Random Forest
*   Using Imblearn to implement Random Over Sampling
*   Using Imblearn to implement Random Under Sampling
*   Imblearn implementation of synthetic sampling
*   Using Imblearn to implement a Neighbor-based Sampling that Combines Oversampling and Undersampling
*   Ensemble Models for Imbalanced Data Implementation
*   XG Boost for Imbalanced Data Introduction
*   Comparison of the Outcomes

*   What is an Imbalanced Dataset?

        Imagine you have a box of colored candies. Most of them are red, but there are only a few green candies. That's like having an imbalanced dataset! In the world of data and numbers, a dataset is a collection of information. When we say it's imbalanced, it means one type of thing (like red candies) is much more common than the other (like green candies).

        Now, let's create a simple example using Python and generate an imbalanced dataset.

In [3]:
import pandas as pd
from sklearn.datasets import make_classification

# Create an imbalanced dataset
X, y = make_classification(
    n_samples=1000, 
    n_features=2, 
    n_informative=2, 
    n_redundant=0, 
    n_clusters_per_class=1, 
    weights=[0.95], 
    flip_y=0, 
    random_state=1
)

# Convert to DataFrame for easy visualization
df = pd.DataFrame(X, columns=['Feature1', 'Feature2'])
df['Label'] = y

# Let's see the first few rows of the dataset
df.head()
df.to_csv('imbalanced.csv')

We created a dataset with 1000 samples.
There are two features (like the color and size of candies).
We made it imbalanced by setting the weights parameter to [0.95], meaning 95% of the samples belong to one class.
The result is a DataFrame with features ('Feature1' and 'Feature2') and labels ('Label').

In [5]:
from imblearn.over_sampling import RandomOverSampler
print("Class Distribution Before Resampling:")
print(df['Label'].value_counts())

# Balance the dataset using Random Over Sampling
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)

# Convert resampled data to DataFrame
df_resampled = pd.DataFrame(X_resampled, columns=['Feature1', 'Feature2'])
df_resampled['Label'] = y_resampled

# Check the class distribution after resampling
print("\nClass Distribution After Resampling:")
print(df_resampled['Label'].value_counts())

Class Distribution Before Resampling:
0    950
1     50
Name: Label, dtype: int64

Class Distribution After Resampling:
0    950
1    950
Name: Label, dtype: int64


We first print the class distribution before resampling to see the imbalance.
Then, we use RandomOverSampler from the imbalanced-learn library to balance the dataset.
After resampling, we print the class distribution again to see that both classes now have similar counts.

Imagine you have two baskets of candies. In one basket, you have lots of red candies, and in the other basket, you only have a few green candies. Now, let's say you want to make it fair and have a similar number of both red and green candies because you love both colors.

Here's what oversampling is like:

Look at Your Candy Baskets:

You have two baskets, one with many red candies and another with only a few green candies.
Make More Green Candies:

To make it fair, you decide to create more green candies. You don't change the existing red candies; you just add more green candies to the basket with fewer of them.
Now You Have Balanced Baskets:

After adding more green candies, both baskets have a similar number of red and green candies.
In the language of computers and data, oversampling is like making more copies of the less common thing (like green candies) so that you have a fair amount of both. This helps computer programs learn about both red and green candies equally well.

In the code example I provided earlier, we did the same thing with a dataset. We had more of one type of data, and we used a special computer tool (Random Over Sampling) to make more copies of the less common data, so the computer can learn about both types equally. It's like making sure both red and green candies get the attention they deserve!

Now we are ready for modelling the data , we will use LR 

In [6]:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

# Create a logistic regression model
lr_model = LogisticRegression(random_state=42)

# Train the model on the training data
lr_model.fit(X_train, y_train)

# Use the trained model to make predictions on the test data
lr_predictions = lr_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, lr_predictions)
report = classification_report(y_test, lr_predictions)
matrix = confusion_matrix(y_test, lr_predictions)

print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(report)
print("Confusion Matrix:")
print(matrix)

Accuracy: 0.9289473684210526
Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.95      0.93       190
           1       0.95      0.91      0.93       190

    accuracy                           0.93       380
   macro avg       0.93      0.93      0.93       380
weighted avg       0.93      0.93      0.93       380

Confusion Matrix:
[[180  10]
 [ 17 173]]


In [7]:
# Using Imblearn to implement Random Over Sampling

from imblearn.over_sampling import RandomOverSampler

df = pd.read_csv(r'imbalanced.csv')
# Check the class distribution before resampling
print("Class Distribution Before Resampling:")
print(df['Label'].value_counts())

# Balance the dataset using Random Over Sampling
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)

# Convert resampled data to DataFrame
df_resampled = pd.DataFrame(X_resampled, columns=['Feature1', 'Feature2'])
df_resampled['Label'] = y_resampled

# Check the class distribution after resampling
print("\nClass Distribution After Resampling:")
print(df_resampled['Label'].value_counts())

Class Distribution Before Resampling:
0    950
1     50
Name: Label, dtype: int64

Class Distribution After Resampling:
0    950
1    950
Name: Label, dtype: int64


In [8]:
# Using Imblearn to implement Random Under Sampling
from imblearn.under_sampling import RandomUnderSampler
df = pd.read_csv(r'imbalanced.csv')
# Check the class distribution before resampling
print("Class Distribution Before Resampling:")
print(df['Label'].value_counts())

# Balance the dataset using Random Under Sampling
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X, y)

# Convert resampled data to DataFrame
df_resampled = pd.DataFrame(X_resampled, columns=['Feature1', 'Feature2'])
df_resampled['Label'] = y_resampled

# Check the class distribution after resampling
print("\nClass Distribution After Resampling:")
print(df_resampled['Label'].value_counts())

Class Distribution Before Resampling:
0    950
1     50
Name: Label, dtype: int64

Class Distribution After Resampling:
0    50
1    50
Name: Label, dtype: int64


In [9]:
# Imblearn implementation of synthetic sampling
from imblearn.over_sampling import SMOTE
df = pd.read_csv(r'imbalanced.csv')
# Check the class distribution before resampling
print("Class Distribution Before Resampling:")
print(df['Label'].value_counts())

# Balance the dataset using Synthetic Minority Over-sampling Technique (SMOTE)
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Convert resampled data to DataFrame
df_resampled = pd.DataFrame(X_resampled, columns=['Feature1', 'Feature2'])
df_resampled['Label'] = y_resampled

# Check the class distribution after resampling
print("\nClass Distribution After Resampling:")
print(df_resampled['Label'].value_counts())

Class Distribution Before Resampling:
0    950
1     50
Name: Label, dtype: int64

Class Distribution After Resampling:
0    950
1    950
Name: Label, dtype: int64


In [10]:
# Using Imblearn to implement a Neighbor-based Sampling that Combines Oversampling and Undersampling
from imblearn.combine import SMOTEENN
df = pd.read_csv(r'imbalanced.csv')
# Check the class distribution before resampling
print("Class Distribution Before Resampling:")
print(df['Label'].value_counts())

# Balance the dataset using SMOTEENN (combination of SMOTE and Edited Nearest Neighbors)
smote_enn = SMOTEENN(random_state=42)
X_resampled, y_resampled = smote_enn.fit_resample(X, y)

# Convert resampled data to DataFrame
df_resampled = pd.DataFrame(X_resampled, columns=['Feature1', 'Feature2'])
df_resampled['Label'] = y_resampled

# Check the class distribution after resampling
print("\nClass Distribution After Resampling:")
print(df_resampled['Label'].value_counts())

Class Distribution Before Resampling:
0    950
1     50
Name: Label, dtype: int64

Class Distribution After Resampling:
1    853
0    837
Name: Label, dtype: int64


    Random Over Sampling (ROS):
    Imagine you have a box of candies, and most of them are red, but you want more of the rare green candies. So, you decide to make some extra green candies to balance things out. Now, you have more of both colors!

    Random Under Sampling (RUS):
    Picture another box of candies, but this time you have a lot of green candies and just a few red ones. To make it fair, you decide to remove some green candies until you have an equal number of both colors. Now, both colors have the same amount.

    Synthetic Sampling (SMOTE):
    Now, think of a box with mostly red candies and only a few green candies. With synthetic sampling, you imagine new candies in between the existing green candies. It's like magically creating more green candies, so you end up with a balanced mix of red and green candies.

    Neighbor-based Sampling (SMOTEENN):
    In this case, you have red and green candies, but you don't just want more green candies or less green candies. You want to make sure every green candy has some red candies nearby. So, you add new green candies in a way that keeps them close to the existing red candies. This way, the colors are balanced, and each green candy has red friends.

    So, in simple terms:

    Random Over Sampling: Make more of the less common thing.
    Random Under Sampling: Take away some of the more common thing.
    Synthetic Sampling (SMOTE): Create more of the less common thing magically.
    Neighbor-based Sampling (SMOTEENN): Make sure every less common thing has friends nearby.