### Part 1.
##### [Ref: Undersampling and oversampling: An old and a new approach](https://medium.com/analytics-vidhya/undersampling-and-oversampling-an-old-and-a-new-approach-4f984a0e8392)

- Undersampling & Oversampling are techniques used to combat the issue of **unbalanced classes** in a dataset


- This is done in order to avoid overfitting the data with a majority class at the expense of other classes whether it’s one or multiple

In [1]:
import pandas as pd
import numpy as np
np.random.seed(2)

In [2]:
## Create a randomly filled dataset of three columns & 100 rows
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 3)), columns=['f1','f2','f3'])

## The columns in this DataFrame constitute the features
df.head(3)

Unnamed: 0,f1,f2,f3
0,40,15,72
1,22,43,82
2,75,7,34


In [3]:
## Create the label class which named 'l1'
df_label = pd.DataFrame(np.random.randint(1, 4, size=(100,1)), columns=['l1'])
df_label.head(3)

Unnamed: 0,l1
0,2
1,2
2,2


In [4]:
## View the class composition
df_label['l1'].value_counts()

3    42
2    32
1    26
Name: l1, dtype: int64

In [5]:
## Combine DataFrames - DataFrame now with three columns acting as features & one as the class column
df['label'] = df_label['l1']
df.head(3)

Unnamed: 0,f1,f2,f3,label
0,40,15,72,2
1,22,43,82,2
2,75,7,34,2


In [6]:
## View the class composition - combined DataFrame
df['label'].value_counts()

3    42
2    32
1    26
Name: label, dtype: int64

---
### Part 2.  Undersampling

- minority class label = 1 (26 instances)

In [7]:
## Assign respective values
class_3, class_2, class_1 = df['label'].value_counts()

## DataFrame for each class label
c3 = df[df['label'] == 3]
c2 = df[df['label'] == 2]
c1 = df[df['label'] == 1] ## - Minority class label

## Return a random sample where label = 1
df2 = c2.sample(class_1)
df3 = c3.sample(class_1)

In [8]:
## Concatenate new sampled DataFrames with the minority class Dataframe (c1)
undersampled_df = pd.concat([c1, df2, df3], axis=0)
undersampled_df.head(3)

Unnamed: 0,f1,f2,f3,label
5,31,90,20,1
6,37,39,67,1
7,4,42,51,1


In [9]:
## View the class composition - undersampled data
undersampled_df['label'].value_counts()

1    26
2    26
3    26
Name: label, dtype: int64

---
### Part 3. Oversampling

- majority class label = 3 (42 instances)

In [10]:
class_3, class_2, class_1 = df.label.value_counts()
c3 = df[df['label'] == 3] ## - Majority class label
c2 = df[df['label'] == 2]
c1 = df[df['label'] == 1] 

df1 = c1.sample(class_3, replace=True)
df2_over = c2.sample(class_3, replace=True)

In [11]:
## Concatenate new sampled DataFrames with the majority class Dataframe (c3)
oversampled_df = pd.concat([df1, df2_over, c3], axis=0)
oversampled_df.head(3)

Unnamed: 0,f1,f2,f3,label
12,66,80,52,1
5,31,90,20,1
5,31,90,20,1


In [12]:
## View the class composition - oversampled data
undersampled_df['label'].value_counts()
oversampled_df['label'].value_counts()

1    42
2    42
3    42
Name: label, dtype: int64