# Imbalance Dataset

- Consider a dataset containing a categorical label whose value is either Positive or Negative. In a balanced dataset, the number of Positive and Negative labels is about equal. However, if one label is more common than the other label, then the dataset is imbalanced. The predominant label in an imbalanced dataset is called the majority class; the less common label is called the minority class.

|Percentage of data belonging to minority class|	Degree of imbalance|
|--|--|
|20-40% of the dataset|	Mild|
|1-20% of the dataset|	Moderate|
|<1% of the dataset	|Extreme|

## Solving this problem

1. Reduce Majority classes : Undersampling
2. Increase Minority Clasees : Over Sampling

### Over Sampling 
- 1. Random Over Samping
- 2. SMOTE

- First let's see random over sampling and random undersampling

In [1]:
import pandas as pd 
import numpy as np


In [3]:
df = pd.read_csv("./../Dataset/GermanCreditDataset.csv")

In [4]:
df.head()

Unnamed: 0,checking_balance,months_loan_duration,credit_history,purpose,amount,savings_balance,employment_duration,percent_of_income,years_at_residence,age,other_credit,housing,existing_loans_count,job,dependents,phone,default
0,< 0 DM,6,critical,furniture/appliances,1169,unknown,> 7 years,4,4,67,none,own,2,skilled,1,yes,no
1,1 - 200 DM,48,good,furniture/appliances,5951,< 100 DM,1 - 4 years,2,2,22,none,own,1,skilled,1,no,yes
2,unknown,12,critical,education,2096,< 100 DM,4 - 7 years,2,3,49,none,own,1,unskilled,2,no,no
3,< 0 DM,42,good,furniture/appliances,7882,< 100 DM,4 - 7 years,2,4,45,none,other,1,skilled,2,no,no
4,< 0 DM,24,poor,car,4870,< 100 DM,1 - 4 years,3,4,53,none,other,2,skilled,2,no,yes


In [5]:
!pip install imblearn

Defaulting to user installation because normal site-packages is not writeable
Collecting imblearn
  Downloading imblearn-0.0-py2.py3-none-any.whl.metadata (355 bytes)
Collecting imbalanced-learn (from imblearn)
  Downloading imbalanced_learn-0.13.0-py3-none-any.whl.metadata (8.8 kB)
Collecting sklearn-compat<1,>=0.1 (from imbalanced-learn->imblearn)
  Downloading sklearn_compat-0.1.3-py3-none-any.whl.metadata (18 kB)
Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Downloading imbalanced_learn-0.13.0-py3-none-any.whl (238 kB)
Downloading sklearn_compat-0.1.3-py3-none-any.whl (18 kB)
Installing collected packages: sklearn-compat, imbalanced-learn, imblearn
Successfully installed imbalanced-learn-0.13.0 imblearn-0.0 sklearn-compat-0.1.3



[notice] A new release of pip is available: 25.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [6]:
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler, SMOTE

In [10]:
X = df.drop(columns = ["default"])
Y = df[["default"]]

In [11]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.30, random_state = 1)

In [15]:
under = RandomUnderSampler(random_state=2)

In [16]:
Y.value_counts()

default
no         700
yes        300
Name: count, dtype: int64

In [17]:
x_train_under, y_train_under = under.fit_resample(x_train, y_train)

In [None]:
y_train.value_counts()   # we can see there are less value counts

default
no         486
yes        214
Name: count, dtype: int64

Unnamed: 0,checking_balance,months_loan_duration,credit_history,purpose,amount,savings_balance,employment_duration,percent_of_income,years_at_residence,age,other_credit,housing,existing_loans_count,job,dependents,phone
731,< 0 DM,24,good,furniture/appliances,1987,< 100 DM,1 - 4 years,2,4,21,none,rent,1,unskilled,2,no
716,unknown,30,critical,furniture/appliances,3077,unknown,> 7 years,3,2,40,none,own,2,skilled,2,yes
640,< 0 DM,18,good,education,750,< 100 DM,unemployed,4,1,27,none,own,1,unemployed,1,no
804,1 - 200 DM,12,good,car,7472,unknown,unemployed,1,2,24,none,rent,1,unemployed,1,no
737,< 0 DM,18,good,car,4380,100 - 500 DM,1 - 4 years,3,4,35,none,own,1,unskilled,2,yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
767,unknown,10,good,car,2901,unknown,< 1 year,1,4,31,none,rent,1,skilled,1,no
72,< 0 DM,8,critical,car0,1164,< 100 DM,> 7 years,3,4,51,bank,other,2,management,2,yes
908,unknown,15,poor,car,3594,< 100 DM,< 1 year,1,2,46,none,own,2,unskilled,1,no
235,< 0 DM,24,good,furniture/appliances,1823,< 100 DM,unemployed,4,2,30,store,own,1,management,2,no


### SMOTE

In [22]:
smote = SMOTE(random_state = 2)

In [25]:
x_train_smote, y_train_smote = smote.fit_resample(x_train, y_train)

ValueError: could not convert string to float: '< 0 DM'