# Imbalanced Dataset

An **imbalanced dataset** refers to a situation in machine learning where the classes (or categories) in the target variable are not distributed equally. One class has significantly more samples than the other(s), which can lead to biased models that favor the majority class.

### Key Characteristics of Imbalanced Data:
- **Majority Class**: One class has far more data points than others.
- **Minority Class**: One class has much fewer data points.
- **Model Bias**: Models trained on imbalanced datasets may have poor performance on the minority class since they tend to predict the majority class more often.

### Example 1: Fraud Detection
- **Problem**: Predicting whether a transaction is fraudulent or not.
- **Imbalance**: In a dataset, 99% of the transactions are **legitimate** (non-fraudulent), while only 1% are **fraudulent**.
- **Effect**: A model trained on this data might predict "non-fraudulent" for almost every transaction because it's the dominant class, leading to high accuracy but poor performance on fraud detection (since it's missing the minority fraud cases).

### Example 2: Disease Prediction
- **Problem**: Predicting whether a person has a particular disease (e.g., cancer).
- **Imbalance**: If 95% of people in the dataset are **healthy** and only 5% have the disease, the dataset is imbalanced.
- **Effect**: The model may predict "healthy" for most cases because it's the overwhelming majority, leading to a high accuracy but failing to identify people who have the disease.

### Example 3: Customer Churn Prediction
- **Problem**: Predicting if a customer will leave a service (churn) or stay.
- **Imbalance**: If 80% of customers **stay** and only 20% **churn**, the dataset is imbalanced.
- **Effect**: A model might predict "stay" for almost every customer, since that is the majority class, leading to good accuracy but poor identification of customers at risk of churning.

### Why is it a Problem?
1. **Skewed Metrics**: Common evaluation metrics like **accuracy** can be misleading. A model predicting only the majority class can achieve high accuracy but fail to recognize the minority class.

2. **Poor Performance on Minority Class**: The model may learn to ignore the minority class, which is often the more important class (e.g., fraud detection, disease prediction).

### Techniques to Handle Imbalanced Datasets:
- **Resampling**:
  - **Oversampling** the minority class (e.g., using SMOTE).
  - **Undersampling** the majority class.
- **Class Weights**: Assign higher weights to the minority class during training.
- **Anomaly Detection**: Treat the minority class as an anomaly and use anomaly detection methods.
- **Evaluation Metrics**: Use metrics like **Precision**, **Recall**, **F1-score**, **ROC-AUC**, which give more insight into how well the model is handling both classes.

Imbalance in datasets is common in many real-world applications, and handling it correctly is crucial for building fair and effective models.


In [37]:
# Creating an imbalanced dataset
import numpy as np
import pandas as pd

# set the random seed for reproducibility
np.random.seed(42)

# create a dataframe with two classes
n_samples=1000
class_0_ration=0.9
n_class_0=int(n_samples*class_0_ration)
n_class_1=(n_samples-n_class_0)

# to check the number of values
n_class_0,n_class_1



(900, 100)

In [38]:
# Creating a imbalanced dataset
class_0=pd.DataFrame({
    'feature1':np.random.normal(0,1,n_class_0),
    'feature2':np.random.normal(0,1,n_class_0),
     'target':[0]* n_class_0
})

class_1=pd.DataFrame({
    'feature1':np.random.normal(0,1,n_class_1),
    'feature2':np.random.normal(0,1,n_class_1),
     'target':[1]* n_class_1
})

In [39]:
df=pd.concat([class_0,class_1]).reset_index(drop=True)

## Up-sampling



In [40]:
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

In [46]:
from sklearn.utils import resample
df_minority_up_sampled=resample(df_minority,replace=True,n_samples=len(df_majority),random_state=42
 )
# this resample function creates extra sample points for the minority class so that it has equal weightage
df_minority_up_sampled.shape # now you can easily see that we have up-sampled the data
df_minority_up_sampled.head()

Unnamed: 0,feature1,feature2,target
951,1.775311,1.261922,1
992,-0.436386,1.188913,1
914,-0.268531,-1.801058,1
971,-0.214921,-2.940389,1
960,-0.134309,-0.054894,1


In [51]:
# now we create the final df_up_sampled
df_up_sampled=pd.concat([df_minority_up_sampled,df_majority]).reset_index(drop=True)
df_up_sampled['target'].value_counts()
# now we can easily see that we have 900 data points for each

target
1    900
0    900
Name: count, dtype: int64

## Down-sampling


In [55]:
# Creating an imbalanced dataset
import numpy as np
import pandas as pd

# set the random seed for reproducibility
np.random.seed(42)

# create a dataframe with two classes
n_samples = 1000
class_0_ration = 0.9
n_class_0 = int(n_samples * class_0_ration)
n_class_1 = (n_samples - n_class_0)

# to check the number of values
n_class_0, n_class_1

# Creating a imbalanced dataset
class_0 = pd.DataFrame({
    'feature1': np.random.normal(0, 1, n_class_0),
    'feature2': np.random.normal(0, 1, n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature1': np.random.normal(0, 1, n_class_1),
    'feature2': np.random.normal(0, 1, n_class_1),
    'target': [1] * n_class_1
})
df = pd.concat([class_0, class_1]).reset_index(drop=True)
## Up-sampling


df_minority = df[df['target'] == 1]
df_majority = df[df['target'] == 0]
from sklearn.utils import resample

df_majority_down_sampled = resample(df_majority, replace=True, n_samples=len(df_minority), random_state=42
                                  )
# this resample function creates extra sample points for the minority class so that it has equal weightage
df_majority_down_sampled.shape  # now you can easily see that we have up-sampled the data
df_majority_down_sampled.head()
# now we create the final df_up_sampled
df_down_sampled = pd.concat([df_majority_up_sampled, df_minority]).reset_index(drop=True)
df_down_sampled['target'].value_counts()
# now we can easily see that we have 900 data points for each

target
0    100
1    100
Name: count, dtype: int64

## Downsampling is bad as we loose a lot of data points
