# Under Sample -- Stacking Model

In [11]:
import pandas as pd
import numpy as np

## 1. Under Smapling Process

Prepare the union bootstrap sample dataset for modeling part.

In [13]:
# Read Prepared dataset
data = pd.read_csv('prepared_data.csv')
data.head()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,type_binary
0,-1.701805,-0.357467,-0.188847,-0.106389,-0.403155,-0.438259,1,1
1,-1.701805,-0.357467,-0.188847,-0.106389,-0.398142,-0.438259,1,0
2,-1.701805,-0.099576,-0.128591,-0.106389,-0.401952,-0.427245,0,0
3,-1.701805,-0.115146,-0.186762,-0.106389,-0.397848,-0.438259,0,1
4,-1.701805,-0.00659,-0.146456,-0.106389,-0.401672,0.143134,0,1


In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2770409 entries, 0 to 2770408
Data columns (total 8 columns):
step              float64
amount            float64
oldbalanceOrg     float64
newbalanceOrig    float64
oldbalanceDest    float64
newbalanceDest    float64
isFraud           int64
type_binary       int64
dtypes: float64(6), int64(2)
memory usage: 169.1 MB


In [28]:
# number of data in the minority class
total_isFraud = np.sum(data.isFraud)
isFraud_indices = data[data.isFraud == 0].index
print("Number of Fraud transactions:", total_isFraud)

Number of Fraud transactions: 8213


In [27]:
# Pick the indices in the majority class
legitimate_indices = data[data.isFraud == 0].index

# Number of under samples
num = int(np.ceil((len(data) - total_isFraud)/total_isFraud))
print("Number of under samples:", num)

Number of under samples: 337


In [21]:
# set random seed 1 
# generate 100 samples

under_samples = []
indices = []
np.random.seed(1)

for i in range(num):
    
    # randomly select 8213 indices from the majority class
    random_legitimate_indices = np.random.choice(legitimate_indices, total_isFraud, replace=True)
    random_legitimate_indices = np.array(random_legitimate_indices)
    
    # append the 2 indices
    under_sample_indices = np.concatenate([isFraud_indices, random_legitimate_indices])
    under_sample = data.iloc[under_sample_indices, :]
    indices.append(under_sample_indices)
    
    under_samples.append(under_sample)

In [26]:
print("Size of one under-sample:", len(under_samples[0]))

Size of one under-sample: 2770409


## 2. Stacking

In this section, we prepare three learning models as our first level classification. These models can all be conveniently invoked via the sk-learn library and are listed as follows:
* Random Forest Classifier
* Extra Trees Classifier
* Grandient Boosting Classifier

There are 337 under samples with a size of 16426 generated. In the first level function, process every under sample. Firstly, the under samole is splited into 80% development and 20% test sets. The 80% development set is used to train 3 classifiers with 5-fold cross-validatation and obtain 5 prediction sets as the new training dataset for the second level model. 

In [31]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import classification_report