## HANDELING IMBALANCE DATASETS IN MACHINE LEARNING

Imbalance in datasets occurs when some datasets have more of one sample compared to another, this can lead to inaccurate models despite of very high accuracy scores.

Example take a dataset that has 9000 being false and 1000 being trues, if the model was programmed to output false all the time we would still have a 90% accuracy despite the fact that the model is completely bogus.

There are many techniques to use to address this problem some of the common ones are listed below.

## 1. Under sampling the majority class

This involve reducing the size of the majority class by getting rid of some data points. This can be done by randomly picking a group of samples that match the sample size of the minority class and using it to train the model. Example take the majority class to have 9000 false datapoints and 1000 true datapoints, what you do is you take 1000 of randomly picked samples and combine it with the other 1000 true samples and train the model.

## 2. Over sampling the minority class

This is the opposite of under sampling of minority class, in this technique we simply multiply the minority class samples a number of times to increase their occurance in the overall dataset. Example take the majority class to have 9000 false datapoints and 1000 true datapoints, what you do is you multiply 1000 by 8 or 9 to get a number close or equivalent to 9000 to achiev dataset balance.


## 3. Over sampling the minority class using SMOTE 

SMOTE stands for (Synthetic Minority Over-sampling Technique), this is a way of increasing the minority sample by using algorithms like KNN to artificially generate sample inorder to increase the size or occorance of the minority class. Thier are already built libraries that can be used for this process [Libraries to use](https://pypi.org/project/imbalanced-learn/)

## 4. Ensemble method

This involves dividing the majority class into batches that are class to the size of the minority sample then randomly select a batch and combine it with the minority class the rain the model, repeat this till all batches are used.

## 5. Focal Loss

This is a special loss function that will penilize majority classes and give more wieght to minority classes during loss calculations [Read More](https://medium.com/analytics-vidhya/how-focal-loss-fixes-the-class-imbalance-problem-in-object-detection-3d2e1c4da8d7#:~:text=Focal%20loss%20is%20very%20useful,is%20simple%20and%20highly%20effective)

## Implementation in Python

In [40]:
import pandas as pd
import numpy as np

In [41]:
df = pd.read_csv("./database/combined_cleaned/c_to_h.csv")
df.drop(["index"], inplace=True, axis=1)

In [42]:
df.head()

Unnamed: 0,AGGT,DIV,CIV,HIST,GEO,KISW,ENGL,PHY,CHEM,BIO,B/MATH,Course
0,19,II,C,B,C,B,C,F,C,C,C,CBG
1,20,II,C,C,C,B,C,F,C,C,C,CBG
2,17,I,C,B,C,B,C,C,C,C,A,HGL
3,25,III,D,D,D,C,D,F,D,C,C,COMMUNITY DEVELOPMENT
4,19,II,C,C,D,C,C,D,B,B,C,PCB


In [43]:
pcm_df = df[df["Course"] == "PCM"]

In [44]:
npcm_df = df[df["Course"] != "PCM"]

In [45]:
pcm_df.shape

(881, 12)

In [46]:
npcm_df.shape

(6618, 12)

In [47]:
df["Course"].value_counts()

PCB                                                                         1322
PCM                                                                          881
CBG                                                                          857
HGL                                                                          585
EGM                                                                          455
                                                                            ... 
MASTER FISHERMAN                                                               1
PIPE WORKS,OIL AND GAS ENGINEERING                                             1
STASHAHADA MAALUMU YA UALIMU WA MASOMO YA SAYANSI (FIZIKIA NA BAIOLOJIA)       1
URBAN AND REGIONAL PLANNING                                                    1
MWANZA                                                                         1
Name: Course, Length: 129, dtype: int64

#### Clearly there is an imbalance of the dataset, we need to clean the dataset to get good models during training

# Method 1: Under sampling the majority class

Under sample the majority class.

pd.sample randomly picks the number of samples you provide.

In [48]:
npcm_df = npcm_df.sample(pcm_df.shape[0])
npcm_df.shape

(881, 12)

In [49]:
method_1 = pd.concat([npcm_df, pcm_df])

In [50]:
method_1.shape

(1762, 12)

## Method 2: Over sampling the minority class

in pd.sample() when you provide a number greater than the actual number of samples present, pandas will automatically and randomly pick any samples and add to the sample to get the desired number

In [51]:
pcm_df.shape

(881, 12)

In [52]:
pcm_df = pcm_df.sample(1500, replace=True)
pcm_df.shape

(1500, 12)

## Method 3: SMOTE

For this we'll need a library class imbalanced-learn, [installation process](https://pypi.org/project/imbalanced-learn/)

In [56]:
from imblearn.over_sampling import SMOTE

In [59]:
X = df.drop(["Course"], axis=1)
y = df["Course"]

In [60]:
smote = SMOTE(sampling_strategy='minority')

X_sm, y_sm = smote.fit_sample(X, y)

ValueError: could not convert string to float: 'II'