https://www.kaggle.com/mlg-ulb/creditcardfraud

https://pandas-ml.readthedocs.io/en/latest/modelframe.html

https://pandas-ml.readthedocs.io/en/latest/imbalance.html

In [17]:
!pip install -U imbalanced-learn

Collecting imbalanced-learn
[?25l  Downloading https://files.pythonhosted.org/packages/e5/4c/7557e1c2e791bd43878f8c82065bddc5798252084f26ef44527c02262af1/imbalanced_learn-0.4.3-py3-none-any.whl (166kB)
[K     |████████████████████████████████| 174kB 895kB/s eta 0:00:01
Installing collected packages: imbalanced-learn
Successfully installed imbalanced-learn-0.4.3
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [18]:
!pip install -U pandas_ml

Requirement already up-to-date: pandas_ml in /anaconda3/lib/python3.6/site-packages (0.6.1)
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


## Introduction to pandas_ml by handling data

#### `ModelFrame` can call other statistics/ML functions in more simple way
  * Creation `ModelFram` is similar to `Pandas Dataframe`

In [19]:
import pandas as pd
import numpy as np

import pandas_ml as pdml

## Handling `Imbalance` Dataset
#### Creating a modelframe where the `target` values are `0 `, `1` and the propotion of the values is `80%` and `20%` respectively

In [20]:
mf = pdml.ModelFrame(np.random.randn(100, 5),
                     target = np.array([0, 1]).repeat([80, 20]),
                     columns = list('ABCDE'))

In [21]:
mf.head()

Unnamed: 0,.target,A,B,C,D,E
0,0,0.110225,1.841503,-0.019234,1.351317,0.505961
1,0,-0.540562,-0.367921,1.762666,0.034601,1.591332
2,0,-0.85847,0.297982,0.08431,-0.980489,0.923869
3,0,-0.275151,0.761847,1.342142,-0.406492,0.400381
4,0,-0.065462,0.96148,1.072228,1.699816,-1.319097


In [22]:
type(mf)

pandas_ml.core.frame.ModelFrame

In [23]:
mf.target.value_counts()

0    80
1    20
Name: .target, dtype: int64


## Performing Under Sampling
* Under Sampling is a method to reduce the datapoints of the majority class equal to the data points of minority class, this practice results in a loss of information in the newly undersampled dataset

https://pandas-ml.readthedocs.io/en/latest/imbalance.html

In [24]:
sampler = mf.imbalance.under_sampling.ClusterCentroids()

In [25]:
sampler

ClusterCentroids(estimator=None, n_jobs=1, random_state=None, ratio=None,
         sampling_strategy='auto', voting='auto')

In [26]:
sampled = mf.fit_sample(sampler)

In [27]:
sampled.head()

Unnamed: 0,.target,A,B,C,D,E
0,0,0.973738,-0.26802,0.46673,1.181708,-0.590465
1,0,-1.42143,-0.561208,0.612815,-0.835204,-0.0673
2,0,-0.967507,-1.037346,-0.338297,1.231185,-0.655855
3,0,-0.083353,-1.332814,0.985812,-0.567888,0.613643
4,0,0.27363,1.744314,0.250174,0.825882,0.178074


In [28]:
sampled.target.value_counts()

1    20
0    20
Name: .target, dtype: int64

## Performing oversampling

### Oversampling using SMOTE
* Oversampling is the process of creating data points of the minority class that it'll be equal to the data points of the majority class, the creation of new datapoints mainly done by copying which may lead to overfitting. `SMOTE` is a method that creates synthetic points for minority class in a more acceptable way, it evaluates the linear distance between the neighbouring feature vector of minority class points and multiply the value randomly with any point from 0 to 10.

In [29]:
sampler = mf.imbalance.over_sampling.SMOTE()
sampler

SMOTE(k_neighbors=5, kind='deprecated', m_neighbors='deprecated', n_jobs=1,
   out_step='deprecated', random_state=None, ratio=None,
   sampling_strategy='auto', svm_estimator='deprecated')

In [30]:
sampled = mf.fit_sample(sampler)
sampled.head()

Unnamed: 0,.target,A,B,C,D,E
0,0,0.110225,1.841503,-0.019234,1.351317,0.505961
1,0,-0.540562,-0.367921,1.762666,0.034601,1.591332
2,0,-0.85847,0.297982,0.08431,-0.980489,0.923869
3,0,-0.275151,0.761847,1.342142,-0.406492,0.400381
4,0,-0.065462,0.96148,1.072228,1.699816,-1.319097


In [31]:
sampled.target.value_counts()

1    80
0    80
Name: .target, dtype: int64