## Imbalanced data
It is common to come across an imbalanced dataset while working on **classification problems** such fraud detection, spam detection, or mapping natural resource occurrences. 
An imbalanced dataset is one that contains unequal number of samples from each class.

### Severe vs. slight imbalance
Depending on the ratio of samples belonging to the **minority** and **majority** classes, the imbalance can range from slight to severe.

It is reasonable to treat a dataset with a slight imbalance (e.g. 2:3) as if it were normal; however, more severe cases of imbalance (e.g. 1:4 or more) need to be corrected and require extra effort.

While training a model on a datset with slight imbalance, class weighting is a tool for penalizing the model for misclassified minority class samples, causing the model to pay greater attention to the minority class.

### What does an imbalanced dataset do to a model? 
Due to the abundance of samples in the training dataset, a model that is trained on an imbalanced dataset is highly controlled by the characteristics of the majority class. 

Because learning characteristics from a limited number of samples is challenging for the model, it performs poorly for the minority class, which may not even be reflected in a performance metric like accuracy due to the **accuracy paradox**.

For imbalanced datasets, metrics such as accuracy can be misleading. Instead, parameters should be used that are less sensitive to True Negative (Negative represents the dominant class) samples. It is always a good idea to look at the confusion matrix. 

## How to tackle imbalanced dataset?
**This notebook will walk you through the steps for dealing with an imbalanced dataset using an example of a real project that I recently completed.**

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
import os
# os.chdir('...\\imbalanced_data')
%config Completer.use_jedi = False
pd.options.display.float_format = "{:,.2f}".format

In [3]:
df=pd.read_csv('imbalanced_data.csv')

The first  two columns contains the spatial information (coordinates). The next five columns are predictor variables, and the last column is the target variable  

In [4]:
df.head()

Unnamed: 0,x,y,geo_code,distance,vertical_der,analytic_signal,mag_field,Target
0,603412.5,6181212.4,10,4011.23,-0.02,0.16,28.89,0
1,603512.5,6181212.4,10,4005.0,-0.07,0.18,16.62,0
2,603612.5,6181212.4,10,4001.25,-0.13,0.2,2.78,0
3,603712.5,6181212.4,10,4000.0,-0.15,0.2,-8.25,0
4,603812.5,6181212.4,10,4001.25,-0.19,0.22,-16.52,0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38133 entries, 0 to 38132
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   x                38133 non-null  float64
 1   y                38133 non-null  float64
 2   geo_code         38133 non-null  int64  
 3   distance         38133 non-null  float64
 4   vertical_der     38133 non-null  float64
 5   analytic_signal  38133 non-null  float64
 6   mag_field        38133 non-null  float64
 7   Target           38133 non-null  int64  
dtypes: float64(6), int64(2)
memory usage: 2.3 MB


### First step: Determine if imbalance exists and what is causing the imbalance
 
- If the problem is caused by data collection bias or inaccuracy (for example, samples obtained from only one class or samples improperly labelled), improving sampling methods and measurements can fix the problem with obtaining more data from the minority class. 


- In some cases, particularly in the study of natural resources, imbalance is a problem domain characteristic. It means the natural occurrence of one class is more dominant than other classes. Repeating the process or gathering more samples will not solve the problem in this scenario. 

In [6]:
df['Target'].value_counts()

0    37768
1      365
Name: Target, dtype: int64

**Insight:** The dataset is severly imbalanced with less than 1% of the samples belonging to the target (minority class). Since the target class (minority class) represent the location of natural resource occurrences in an area, it is expected to see such sever imbalance in the dataset. As mentioned above, it is a problem domain characteristic. 

In [7]:
print('Percentage of the minority class: {:.2f} %'.format((df['Target'].sum()/len(df))*100))

Percentage of the minority class: 0.96 %


### Second step: Resampling the dataset

### a) Undersampling 
If there are enough samples of the minority class (but they are significantly less than the number of majority class samples), some samples from the dominant class can be removed from the training dataset to achieve a balanced class distribution. This is referred to as under-sampling.

First we need to look at the number of samples from the minority class. If there are enough samples to train an algorithm, we can select the same number of samples from the majority class.

**Analysis:** For my dataset, there is only 365 samples from the target class. If I apply undersampling and select 365 samples from the majority class, I would have 730 samples in total, which does not seem enough to train and test an algorithm (considering the size of the majority class). In other words, 365 samples would not be a reliable represantative dataset for the majority class.

The following cell describes the process, in case undersampling seems a suitable fix for your dataset.

In [8]:
# number of sminority class samples
n_minority=len(df[df['Target']==1])

# selects all samples from the minority class and selects the same number of samples as the minority class

undersampled_balanced_df= df.groupby('Target', as_index=False, group_keys=False ).apply(
    lambda x: x.sample(frac=1) if x.name==1 else x.sample(n=n_minority))

In [9]:
# total number of samples
len(undersampled_balanced_df)

730

In [10]:
# number of samples for each class in the balanced dataset (which is equal to the number of samples of the minority class)
undersampled_balanced_df['Target'].value_counts()

0    365
1    365
Name: Target, dtype: int64

In [11]:
X=undersampled_balanced_df.drop(['x','y','Target'], axis=1)
y=undersampled_balanced_df['Target']
X_train_under, X_test_under, y_train_under, y_test_under= train_test_split(X, y,stratify=y, test_size=0.3)

Now, a model can be trained on **X_train_under, y_train_under** and tested on **X_test_under, y_test_under**

### b) Over-sampling:
The opposite of under-sampling (also known as over-sampling) can also be done by duplicating some samples from the minority class.

It is recommended that random undersampling be used first to reduce the number of samples in the majority class, followed by oversampling the minority class to balance the class distribution. 


**It should be noted that any over-sampling process can only be performed on training datasets, not test data,** becasue creating duplicated samples in the test data artificially increases the accuracy. So, we need to create a test dataset first for evaluating the model, and perform oversampling on the training data only.

RandomOverSampler (from imblearn library) creates an object to over-sample the minority class by picking samples at random with replacement.

In [12]:
from imblearn.over_sampling import RandomOverSampler

In [13]:
# undersampling the majority class-- selects all samples from the minority class and 60 % of samples of the majority class 
# selecting 60 % was entirely arbitrary, and it can be changed to any fraction based on your dataset

df_u_frac= df.groupby('Target', as_index=False, group_keys=False ).apply(
    lambda x: x.sample(frac=1) if x.name==1 else x.sample(frac=0.6))

In [14]:
# X_test and y_test must remain intact during this process, so they must be created prior to oversampling

X= df_u_frac.drop(['x','y','Target'], axis=1)
y= df_u_frac['Target']

X_train, X_test, y_train, y_test= train_test_split(X, y,stratify=y, test_size=0.3)

In [15]:
# sampling_strategy='minority' specifies only the minority class is resampled. 
rovs= RandomOverSampler(sampling_strategy='minority')

# balanced training data
X_train_over, y_train_over= rovs.fit_resample(X_train, y_train)

Now, a model can be trained on **X_train_over, y_train_over** and tested on **X_test and y_test**

In [16]:
# the number of samples for each class in the training data (which is equl to 60% of the majority class samples)
y_train_over.value_counts()

0    15863
1    15863
Name: Target, dtype: int64

### c) Generate synthetic samples:
As an alternetive to oversampling (creating copies), synthetic samples can be created from the minor class. These samples are not duplicates of the original samples, but they have the same characteristics as the minority class.

**SMOTE** (Synthetic Minority Over-sampling Technique) and **SMOTETomek** are two examples of synthetic data generation algorithms. These algorithms sample attributes at random from relatively similar instances in the minority class.

- **SMOTE** works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and generating a new sample at a point along that line. 


- **SMOTETomek** combines over-sampling the minority class using SMOTE followed by under-sampling the majority class using Tomek Links. The latter removes (clean) samples from the majority class that are closest to the minority class in the feature space in order to make the boundaries between classes less noisy or ambiguous. 

In [17]:
from imblearn.combine import SMOTETomek

In [18]:
smt = SMOTETomek(sampling_strategy='auto')

In [19]:
# selects all samples from the minority class and 60 % of samples of the majority class 
df_u_frac= df.groupby('Target', as_index=False, group_keys=False ).apply(
    lambda x: x.sample(frac=1) if x.name==1 else x.sample(frac=0.6))

In [20]:
X= df_u_frac.drop(['x','y','Target'], axis=1)
y= df_u_frac['Target']

# X_test and y_test remain intact during this process
X_train, X_test, y_train, y_test= train_test_split(X, y,stratify=y, test_size=0.3)

In [21]:
X_train_smt, y_train_smt = smt.fit_resample(X_train, y_train)

In [22]:
# the number of samples for each class in the training data (which is less than 60% of the majority class samples) as some 
# samples were removed by Tomek method. 
y_train_smt.value_counts()

0    15650
1    15650
Name: Target, dtype: int64

Now, a model can be trained on **X_train_smt, y_train_smt** and tested on **X_test and y_test**