## Handling Imbalanced Data Set

### 1. Up sampling     2. Down sampling

In [3]:
import numpy as np
import pandas as pd

# Set the random seed for reproducibility
np.random.seed(123)

# Create a dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

In [4]:
n_class_0,n_class_1

(900, 100)

In [5]:
## CREATE MY DATAFRAME WITH IMBALANCED DATASET
class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0,scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0,scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2,scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2,scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

In [8]:
df=pd.concat([class_0,class_1]).reset_index(drop=True)

In [9]:
df.head()

Unnamed: 0,feature_1,feature_2,target
0,-1.085631,0.551302,0
1,0.997345,0.419589,0
2,0.282978,1.815652,0
3,-1.506295,-0.25275,0
4,-0.5786,-0.292004,0


In [10]:
df.tail()

Unnamed: 0,feature_1,feature_2,target
995,1.376371,2.845701,1
996,2.23981,0.880077,1
997,1.13176,1.640703,1
998,2.902006,0.390305,1
999,2.69749,2.01357,1


In [11]:
df['target'].value_counts()

0    900
1    100
Name: target, dtype: int64

In [12]:
## Upsampling
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

In [17]:
from sklearn.utils import resample
df_minority_upsampled=resample(df_minority,replace=True,   #sample withe replacement
                        n_samples=len(df_majority),
                        random_state=42
                        )

In [18]:
df_minority_upsampled.shape

(900, 3)

In [19]:
df_minority_upsampled.head()

Unnamed: 0,feature_1,feature_2,target
951,1.125854,1.843917,1
992,2.19657,1.397425,1
914,1.93217,2.998053,1
971,2.272825,3.034197,1
960,2.870056,1.550485,1


In [22]:
df_upsampled=pd.concat([df_majority,df_minority_upsampled])

In [25]:
df_upsampled['target'].value_counts()

0    900
1    900
Name: target, dtype: int64

In [None]:
# Down SamplingS

In [26]:
class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0,scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0,scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2,scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2,scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

df=pd.concat([class_0,class_1]).reset_index(drop=True)

# Check the class distribution
print(df['target'].value_counts())

0    900
1    100
Name: target, dtype: int64


In [27]:
## Downsampling
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

In [28]:
from sklearn.utils import resample
df_majority_downsampled=resample(df_majority,replace=False,
                        n_samples=len(df_minority),
                        random_state=42
                        )

In [29]:
df_majority_downsampled.shape

(100, 3)

In [30]:
df_downsampled=pd.concat([df_minority,df_majority_downsampled])

In [31]:
df_downsampled.target.value_counts()

1    100
0    100
Name: target, dtype: int64

# Imbalanced Data

In [None]:
A classification data set with skewed class proportions is called imbalanced. Classes that make up a large proportion of the
data set are called majority classes. Those that make up a smaller proportion are minority classes.

What counts as imbalanced? The answer could range from mild to extreme, as the table below shows.

In [None]:
Degree of Imbalance               proportion of minority class
    
    Mild                            20-40% of the data set
    Moderate                         1-20% of the data set
    Exreme                            <1% of the data set

In [None]:
Why look out for imbalanced data? You may need to apply a particular sampling technique if you have a classification task
with an imbalanced data set.

Consider the following example of a model that detects fraud. Instances of fraud happen once per 200 transactions in this
data set, so in the true distribution, about 0.5% of the data is positive.
Why would this be problematic? With so few positives relative to negatives, the training model will spend most of its time
on negative examples and not learn enough from positive ones. For example, if your batch size is 128, many batches will have
no positive examples, so the gradients will be less informative.

If you have an imbalanced data set, first try training on the true distribution. If the model works well and generalizes,
you're done! If not, try the following downsampling and upweighting technique.



# Downsampling

In [None]:
Downsampling (in this context) means training on a disproportionately low subset of the majority class examples.

# Upsampling

In [None]:
Upsampling means adding an example weight to the downsampled class equal to the factor by which you downsampled.

In [None]:
Step 1: Downsample the majority class. Consider again our example of the fraud data set, with 1 positive to 200 negatives. 
    Downsampling by a factor of 20 improves the balance to 1 positive to 10 negatives (10%). Although the resulting 
    training set is still moderately imbalanced, the proportion of positives to negatives is much better than the original
    extremely imbalanced proportion (0.5%).

In [None]:
Step 2: Upweight the downsampled class: The last step is to add example weights to the downsampled class. Since we 
        downsampled by a factor of 20, the example weight should be 20.

In [None]:
You may be used to hearing the term weight when it refers to model parameters, like connections in a neural network. 
Here we're talking about example weights, which means counting an individual example more importantly during training.
An example weight of 10 means the model treats the example as 10 times as important (when computing loss) as it would an
example of weight 1.

The weight should be equal to the factor you used to downsample:

    {example weight} = {orginal example weight} * {downsampling factor}

In [None]:
Why Downsample and Upweight?
It may seem odd to add example weights after downsampling. We were trying to make our model improve on the minority
class -- why would we upweight the majority? These are the resulting changes:

In [None]:
=> Faster convergence: During training, we see the minority class more often, which will help the model converge faster.
=> Disk space: By consolidating the majority class into fewer examples with larger weights, we spend less disk space storing
    them. This savings allows more disk space for the minority class, so we can collect a greater number and a wider range
    of examples from that class.
=> Calibration: Upweighting ensures our model is still calibrated; the outputs can still be interpreted as probabilities.