In [26]:
import numpy as np
import pandas as pd


## Generate a synthetic dataset with two classes and an imbalance between the classes

In [49]:
#set a random seed for reproducibility
np.random.seed(123)   #In this way, the code will generate the same random numbers every time.

#create dataframe with 2 classes
n_samples = 1000
class_0_ratio = 0.9 #Specifies that 90% of the samples belong to class 0.
n_class_0 = int(n_samples * class_0_ratio)    #number of samples in class 0 (900 samples)
n_class_1 = n_samples - n_class_0

In [28]:
n_class_0,n_class_1

(900, 100)

here we have created a imbalanced data set. out of 1000 samples,900 samples are from class 0, and remaining 100 are from class1 

In [30]:
#create DF with imbalanced data set

class_0=pd.DataFrame(
    {
        'feature_1':np.random.normal(loc=0,scale=1,size=n_class_0),
        'feature_2':np.random.normal(loc=0,scale=1,size=n_class_0),
        'target':[0] * n_class_0
    })

class_1=pd.DataFrame(
    {
        'feature_1':np.random.normal(loc=2,scale=1,size=n_class_1),
        'feature_2':np.random.normal(loc=2,scale=1,size=n_class_1),
        'target':[1] * n_class_1

    })

Here we created DataFrames for Class 0 and class 1.

feature_1 and feature_2 for class 0 are generated from a normal distribution with loc=0 (mean) and scale=1 (standard deviation) for class 0.
The target column is filled with 0 to represent class 0.

Similarly, feature_1 and feature_2 for class 1 are generated with loc=2 (mean shifted to 2) and scale=1.
The target column is filled with 1 to represent class 1.

Note:- The loc parameter represents the mean (or "location") of the distribution. 
For a normal (Gaussian) distribution, loc is the center or peak of the bell curve. 
 loc=0 means that the data generated will be centered around 0.

The scale parameter represents the standard deviation (or "spread") of the distribution. It determines how spread out the data is around the mean.
scale=1 means that the standard deviation of the data will be 1, indicating that the data will typically fall within one unit away from the mean.

In [31]:
df=pd.concat([class_0,class_1]).reset_index(drop=True) #reset the index of the DataFrame after concatenation.
df.head()

Unnamed: 0,feature_1,feature_2,target
0,-1.774224,0.285744,0
1,-1.201377,0.333279,0
2,1.096257,0.531807,0
3,0.861037,-0.354766,0
4,-1.520367,-1.120815,0


When you concatenate two DataFrames using pd.concat([class_0, class_1]), the resulting DataFrame retains the original indices of the rows from the individual DataFrames.

After concatenation, the combined DataFrame will have two sets of indices: one from class_0 and one from class_1. 

If the original DataFrames have overlapping indices (e.g., both starting from 0), the concatenated DataFrame will have duplicate indices.

reset_index(drop=True) reassigns a new, continuous index to the concatenated DataFrame, starting from 0 up to the total number of rows minus one.

drop=True parameter ensures that the old index is not added as a new column in the DataFrame


In [32]:
df.tail()

Unnamed: 0,feature_1,feature_2,target
995,2.677156,1.092048,1
996,2.963404,0.181955,1
997,1.621476,1.877267,1
998,3.429559,3.794486,1
999,3.532273,1.67949,1


In [33]:
df['target'].value_counts() #shows the number of samples in each class.

target
0    900
1    100
Name: count, dtype: int64

## upsampling

In [34]:
df_minority=df[df['target']==1]  #includes only the rows where the target column is equal to 1.
df_majority=df[df['target']==0] #ncludes only the rows where the target column is equal to 0.

In [35]:
from sklearn.utils import resample

df_minority_upsampled=resample(df_minority,replace=True, #sample gets added with replacement
         n_samples=len(df_majority), #how many sample you want to increase= no:of majority sample
         random_state=42      # Ensure reproducibility
         )

 the resample function can be used to oversample the minority class to match the number of samples in the majority class.
 
 When replace=True, the same sample from the minority class can be selected more than once in the resampled data.
 
ie,it increase the number of minority samples by creating duplicate rows to match the number of samples in the majority class.



In [36]:
df_minority_upsampled.shape

(900, 3)

In [37]:
df_minority_upsampled.head()


Unnamed: 0,feature_1,feature_2,target
951,2.905343,1.495151,1
992,2.000977,1.814833,1
914,1.927957,2.280911,1
971,2.819483,2.964646,1
960,2.456515,1.833528,1


In [38]:
df_sampled=pd.concat([df_majority,df_minority_upsampled])

In [39]:
df_sampled['target'].value_counts()

target
0    900
1    900
Name: count, dtype: int64

df_minority_upsampled DataFrame will have the same number of samples as df_majority, with some samples being duplicates due to sampling with replacement

## Down Sampling

In [41]:
np.random.seed(123)

#create dataframe with 2 classes
n_samples=1000
class_0_ratio=0.9
n_class_0=int(n_samples * class_0_ratio)
n_class_1=n_samples-n_class_0

class_0=pd.DataFrame(
    {
        'feature_1':np.random.normal(loc=0,scale=1,size=n_class_0),
        'feature_2':np.random.normal(loc=0,scale=1,size=n_class_0),
        'target':[0] * n_class_0
    })

class_1=pd.DataFrame(
    {
        'feature_1':np.random.normal(loc=2,scale=1,size=n_class_1),
        'feature_2':np.random.normal(loc=2,scale=1,size=n_class_1),
        'target':[1] * n_class_1

    })
df=pd.concat([class_0,class_1]).reset_index(drop=True)
print(df['target'].value_counts())

target
0    900
1    100
Name: count, dtype: int64


In [42]:
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

In [43]:
df_majority_downsampled=resample(df_majority,replace=False,
                                 n_samples=len(df_minority),
                                 random_state=42)

The resample function is used to perform downsampling (undersampling) on a dataset to balance classes by reducing the no:of samples in the majority class.
 
When replace=False, each sample from the majority class can only be selected once.

In [44]:
df_majority_downsampled.shape

(100, 3)

majority sample size is reduced

In [48]:
df_downsampled=pd.concat([df_minority,df_majority_downsampled])
df_downsampled['target'].value_counts()

target
1    100
0    100
Name: count, dtype: int64

The df_majority_downsampled DataFrame will have the same number of samples as df_minority, without duplicates.