#### Handling Imbalanced Dataset with Machine Learning

* ``Imbalanced`` concept falls under ``Classification`` not for Regression.

* Imbalanced dataset means output variable has huge difference between ``True`` and ``False`` outputs...

* Imbalance dataset doesn't have a impact huge impact on ``Ensemble`` techniques...

In [1]:
import pandas as pd
df=pd.read_csv('datasets/creditcard.csv')
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [2]:
df.isnull().sum()


Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

In [3]:
df.shape

(284807, 31)

In [4]:
# this shows imbalanced data in output
df['Class'].value_counts()

0    284315
1       492
Name: Class, dtype: int64

#### Independent and Dependent Features

In [2]:

X = df.drop("Class",axis=1)
y = df.Class

#### Cross Validation Like KFOLD and Hyperpaqrameter Tuning

In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
from sklearn.model_selection import KFold
import numpy as np
from sklearn.model_selection import GridSearchCV

In [8]:
10.0 **np.arange(-2,3)

#[10^-2, 10^-1, 10^0, 10^1, 10^2]

array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])

10.0 should be strictly float should not be integer...

In [4]:
log_class=LogisticRegression()
grid={'C':10.0 **np.arange(-2,3),'penalty':['l1','l2']}
cv=KFold(n_splits=5,random_state=None,shuffle=False)

In [5]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.7)

In [6]:
clf=GridSearchCV(log_class,grid,cv=cv,n_jobs=-1,scoring='f1_micro')
clf.fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


GridSearchCV(cv=KFold(n_splits=5, random_state=None, shuffle=False),
             estimator=LogisticRegression(), n_jobs=-1,
             param_grid={'C': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02]),
                         'penalty': ['l1', 'l2']},
             scoring='f1_micro')

In [7]:
y_pred=clf.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print('\n')
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[85269    39]
 [   40    95]]


0.9990754069964772
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85308
           1       0.71      0.70      0.71       135

    accuracy                           1.00     85443
   macro avg       0.85      0.85      0.85     85443
weighted avg       1.00      1.00      1.00     85443



### using Random Forest

In [10]:
y_train.value_counts()

0    199016
1       348
Name: Class, dtype: int64

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()
classifier.fit(X_train,y_train)

In [21]:
y_pred=classifier.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[85292     5]
 [   35   111]]
0.9995318516437859
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85297
           1       0.96      0.76      0.85       146

    accuracy                           1.00     85443
   macro avg       0.98      0.88      0.92     85443
weighted avg       1.00      1.00      1.00     85443



## Under Sampling

* **Under Sampling** is the process of making nearly balanced data by ``Reducing`` data points.

In [11]:
from collections import Counter
Counter(y_train)

Counter({0: 199016, 1: 348})

In [8]:
from collections import Counter
from imblearn.under_sampling import NearMiss
ns = NearMiss(0.8)
X_train_ns,y_train_ns = ns.fit_sample(X_train,y_train)
print("The number of classes before fit {}".format(Counter(y_train)))
print("The number of classes after fit {}".format(Counter(y_train_ns)))



The number of classes before fit Counter({0: 199007, 1: 357})
The number of classes after fit Counter({0: 446, 1: 357})


``432 = (346/0.8)``

In [31]:
0.8*432  

345.6

In [9]:
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier()
classifier.fit(X_train_ns,y_train_ns)

RandomForestClassifier()

In [10]:
y_pred=classifier.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[57091 28217]
 [    9   126]]
0.6696511124375315
              precision    recall  f1-score   support

           0       1.00      0.67      0.80     85308
           1       0.00      0.93      0.01       135

    accuracy                           0.67     85443
   macro avg       0.50      0.80      0.41     85443
weighted avg       1.00      0.67      0.80     85443



### Over Sampling
* **Over Sampling** is the process of making nearly balanced data by ``Increasing`` data points.

In [34]:
from imblearn.over_sampling import RandomOverSampler

In [35]:
os = RandomOverSampler(0.75, random_state=0)
X_train_ns,y_train_ns = os.fit_sample(X_train,y_train)
print("The number of classes before fit {}".format(Counter(y_train)))
print("The number of classes after fit {}".format(Counter(y_train_ns)))



The number of classes before fit Counter({0: 199018, 1: 346})
The number of classes after fit Counter({0: 199018, 1: 149263})


``149263 = 199018* 0.75``

In [36]:
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier()
classifier.fit(X_train_ns,y_train_ns)

RandomForestClassifier()

In [37]:
y_pred=classifier.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[85290     7]
 [   35   111]]
0.9995084442259752
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85297
           1       0.94      0.76      0.84       146

    accuracy                           1.00     85443
   macro avg       0.97      0.88      0.92     85443
weighted avg       1.00      1.00      1.00     85443



## SMOTETomek

It increases the length of the data based on ``nearest`` points.... 

In [38]:
from imblearn.combine import SMOTETomek

In [39]:
os=SMOTETomek(0.75)
X_train_ns,y_train_ns=os.fit_sample(X_train,y_train)
print("The number of classes before fit {}".format(Counter(y_train)))
print("The number of classes after fit {}".format(Counter(y_train_ns)))



The number of classes before fit Counter({0: 199018, 1: 346})
The number of classes after fit Counter({0: 198349, 1: 148594})


In [40]:
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier()
classifier.fit(X_train_ns,y_train_ns)

RandomForestClassifier()

In [41]:
y_pred=classifier.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[85283    14]
 [   27   119]]
0.9995201479348805
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85297
           1       0.89      0.82      0.85       146

    accuracy                           1.00     85443
   macro avg       0.95      0.91      0.93     85443
weighted avg       1.00      1.00      1.00     85443



## Ensemble Techniques

In [42]:
from imblearn.ensemble import EasyEnsembleClassifier

In [44]:
easy=EasyEnsembleClassifier()
easy.fit(X_train,y_train)

EasyEnsembleClassifier()

In [45]:
y_pred=easy.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[82449  2848]
 [   13   133]]
0.9665156888217876
              precision    recall  f1-score   support

           0       1.00      0.97      0.98     85297
           1       0.04      0.91      0.09       146

    accuracy                           0.97     85443
   macro avg       0.52      0.94      0.53     85443
weighted avg       1.00      0.97      0.98     85443



###### Sergey Feldman
Try this and implement it.

Let N be number of samples in the rare class. Cluster the abundant
class into N clusters (agglomerative clustering may be best here), and use the resulting cluster mediods/means as the training data for the abundant class. To be clear, you throw out the original training data from the abundant class, and use the mediods instead. Voila, now your classes are balanced! But your dataset is much smaller, so that might be an issue.