<font color="Yellow" size="6">Techniques for handling imbalanced datasets</font>
<P><font color="red" size="5">1_RandomOverSampler</font>

In [None]:
#pip install imbalanced-learn

<b><font color="yellow" size=4>RandomOverSampler</font><b>
<p>
The RandomOverSampler from the imbalanced-learn library is a technique used to oversample the minority class in imbalanced datasets. It randomly replicates instances from the minority class to balance the class distribution.

In [2]:
import numpy as np
import pandas as pd
from imblearn.over_sampling import RandomOverSampler
from sklearn.datasets import make_classification
from collections import Counter

# Step 1: Create a synthetic imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2, 
                            n_redundant=10, n_classes=2, weights=[0.9, 0.1], 
                            random_state=42)

# Step 2: Check the class distribution before oversampling
print("Class distribution before oversampling:", Counter(y))

# Step 3: Apply RandomOverSampler
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)

# Step 4: Check the class distribution after oversampling
print("Class distribution after oversampling:", Counter(y_resampled))


Class distribution before oversampling: Counter({0: 898, 1: 102})
Class distribution after oversampling: Counter({0: 898, 1: 898})


<p><b> n_samples=1000:</b>This specifies the total number of samples in the dataset.In this case, it generates 1000 samples (rows).
<p> <b>n_features=20 :</b> This defines the total number of features (or columns) in the dataset.
    In this case, the dataset will have 20 features (columns).
<p><b>n_informative=2 :</b> This specifies how many of the features are informative, i.e., the number of features that contain actual information that can be used to distinguish between classes.
    Here, 2 features are informative, meaning they contribute to the decision boundary of the classifier.
<p> <b>n_redundant=10 : </b>This parameter defines the number of redundant features, which are features that are generated as linear combinations of the informative features.
    In this case, 10 features will be redundant, meaning they don't provide any new information but are created based on the informative features.
<p><b>n_classes=2 :</b>This defines the number of classes (labels) in the dataset.
    Here, the dataset will have 2 classes, which is typical for binary classification problems.
<p><b>weights=[0.9, 0.1] :</b>
    This parameter controls the class distribution.
    The weights argument is a list that defines the proportion of each class in the dataset.
    The first value in the list (0.9) corresponds to the proportion of samples for the majority class (class 0).
    The second value (0.1) corresponds to the proportion of samples for the minority class (class 1).
    In this case, 90% of the samples belong to class 0 (majority class) and 10% belong to class 1 (minority class), which creates an imbalanced dataset.

<p><b>random_state=42 :</b>
<ol>
    <li>This is a random seed that ensures the reproducibility of the dataset generation.</li>
    <li>By setting random_state=42, the dataset will be generated in the same way every time you run the code, ensuring that the dataset is the same across different runs (for consistency in results).</li></ol>

<font color="pink" size=4>Advantages of Using RandomOverSampler:</font>
<ol>
    <li><font color="orange">Balancing the Dataset:</font> It helps to balance the dataset by increasing the number of minority class samples.</li>
    <li><font color="orange">Simple and Effective:</font> It is a simple technique and works well when the class imbalance is not too severe.</li>
    <li><font color="orange">Works Well with Other Algorithms:</font> After balancing, you can apply various machine learning algorithms (e.g., logistic regression, decision trees, etc.) to train the model with a more balanced dataset.</li></ol>

<font color="pink" size=4>Potential Drawbacks:
<ol>
    <li><font color="orange">Overfitting:</font> Since it simply replicates existing samples from the minority class, it can lead to overfitting as the minority class becomes more redundant.</li>
    <li><font color="orange">Loss of Information:</font> In cases of extreme class imbalance, oversampling might not be enough to solve the problem and might still result in poor model performance.</li></ol>