<a href="https://colab.research.google.com/github/Chirag314/Stacknet-creditcarddata/blob/main/Stacknet_creditcarddata.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###This notebook is made from exercises from book Ensemble Machine Learning Cookbook.

StackNet is available under the MIT licence. It's a scalable and analytical framework that resembles a feed-forward neural network, and uses Wolpert's stacked-generalization concept to improve accuracy in machine learning predictive tasks. It uses the notion of meta-learners, in that it uses the predictions of some algorithms as features for other algorithms. StackNet can also generalize stacking on multiple levels. It is, however, computationally intensive. It was originally developed in Java, but a lighter Python version of StackNet, named pystacknet, is now available as well.
Let's think about how StackNet works. In the case of a neural network, the output of one layer is inserted as an input to the next layer and an activation function, such as sigmoid, tanh, or relu, is applied. Similarly, in the case of StackNet, the activation functions can be replaced with any supervised machine learning algorithm.

The stacking element can be run on two modes: a normal stacking mode and a re-stacking mode. In the case of a normal stacking mode, each layer uses the predictions of the previous one. In the case of re-stacking mode, each layer uses the neurons and activations of the previous layers.

In [7]:
import os
!git clone https://gitlab.com/YannBerthelot/kaggle_pystacknet.git
print(os.listdir("kaggle_pystacknet/pystacknet"))
!pip install "kaggle_pystacknet/pystacknet"
import pystacknet

fatal: destination path 'kaggle_pystacknet' already exists and is not an empty directory.
['pystacknet', 'LICENSE.txt', 'README.md', 'setup.py']
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Processing ./kaggle_pystacknet/pystacknet
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Building wheels for collected packages: pystacknet
  Building wheel for pystacknet (setup.py) ... [?25l[?25hdone
  Created wheel for pystacknet: filename=pystacknet-0.0.1-py3-none-any.whl size=21925 sha256=0460a9103a33ac6946de328c32bfa232ebdf381d61d4974c01005c2af73b1309
  Stored in directory: /tmp/pip-ephe

In [8]:
#import required libraries

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import roc_auc_score,log_loss
from sklearn.model_selection import StratifiedKFold
import joblib
import sys
sys.modules['sklearn.externals.joblib'] = joblib
from pystacknet.pystacknet import StackNetClassifier, StackNetRegressor
from pystacknet.metrics import rmse,mae

In [9]:
# Read data from github. Use raw format and copy url# Note normal url and raw url will be different.
import pandas as pd
pd.options.display.max_rows=None
pd.options.display.max_columns=None
url = 'https://raw.githubusercontent.com/PacktPublishing/Ensemble-Machine-Learning-Cookbook/master/Chapter08/UCI_Credit_Card.csv'
df_creditcarddata= pd.read_csv(url)
#df = pd.read_csv(url)
print(df_creditcarddata.head(5))

   ID  LIMIT_BAL  SEX  EDUCATION  MARRIAGE  AGE  PAY_0  PAY_2  PAY_3  PAY_4  \
0   1    20000.0    2          2         1   24      2      2     -1     -1   
1   2   120000.0    2          2         2   26     -1      2      0      0   
2   3    90000.0    2          2         2   34      0      0      0      0   
3   4    50000.0    2          2         1   37      0      0      0      0   
4   5    50000.0    1          2         1   57     -1      0     -1      0   

   PAY_5  PAY_6  BILL_AMT1  BILL_AMT2  BILL_AMT3  BILL_AMT4  BILL_AMT5  \
0     -2     -2     3913.0     3102.0      689.0        0.0        0.0   
1      0      2     2682.0     1725.0     2682.0     3272.0     3455.0   
2      0      0    29239.0    14027.0    13559.0    14331.0    14948.0   
3      0      0    46990.0    48233.0    49291.0    28314.0    28959.0   
4      0      0     8617.0     5670.0    35835.0    20940.0    19146.0   

   BILL_AMT6  PAY_AMT1  PAY_AMT2  PAY_AMT3  PAY_AMT4  PAY_AMT5  PAY_AMT6  \
0   

In [10]:
#Drop ID columns
df_creditcarddata.drop(['ID'],axis=1,inplace=True)
#Check shape of data
df_creditcarddata.shape

(30000, 24)

In [11]:
# Create feature and response variables
X=df_creditcarddata.iloc[:,0:23]

Y=df_creditcarddata['default.payment.next.month']
print(X.shape)
print(Y.shape)


(30000, 23)
(30000,)


In [13]:
from sklearn.model_selection import train_test_split
#We first split the dataset into train and test subset
X_train, X_test, Y_train, Y_test=train_test_split(X, Y, test_size=0.2,random_state=1)

#Then we take the train subset and carve out as validation set
#X_train, X_val, Y_train,Y_val=train_test_split(X_train,Y_train,test_size=0.2,random_state=1)

In [14]:
#Define models for base learner and meta learner
models=[[DecisionTreeClassifier(criterion='entropy',max_depth=5,max_features=0.5,random_state=1),
         GradientBoostingClassifier(n_estimators=100,learning_rate=0.1,max_depth=5,max_features=0.5,random_state=1),
         LogisticRegression(random_state=1)],
        [RandomForestClassifier(n_estimators=500,criterion='entropy',max_depth=5,max_features=0.5,random_state=1)]]

We now use StackNetClassifier to build the stacking ensemble. However, note that we use restacking=False, which means that it uses the normal stacking mode

In [17]:
model=StackNetClassifier(models, metric="accuracy",folds=4,restacking=True,use_retraining=True,use_proba=True,random_state=12345,n_jobs=1,verbose=1)
model.fit(X_train,Y_train)
#With restacking=True, StackNetClassifier would use the re-stacking mode to build the models.
# USe the metaleraner model to predict the outcome

preds=model.predict_proba(X_test)[:,-1]
print("Test accuracy without restacking, auc %f " % (roc_auc_score(Y_test, preds)))

Input Dimensionality 23 at Level 0 
3 models included in Level 0 


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Level 0, fold 1/4 , model 0 , accuracy===0.820667 
Level 0, fold 1/4 , model 1 , accuracy===0.818000 
Level 0, fold 1/4 , model 2 , accuracy===0.775833 


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Level 0, fold 2/4 , model 0 , accuracy===0.806500 
Level 0, fold 2/4 , model 1 , accuracy===0.809667 
Level 0, fold 2/4 , model 2 , accuracy===0.774333 


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Level 0, fold 3/4 , model 0 , accuracy===0.829833 
Level 0, fold 3/4 , model 1 , accuracy===0.828333 
Level 0, fold 3/4 , model 2 , accuracy===0.782500 


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Level 0, fold 4/4 , model 0 , accuracy===0.822000 
Level 0, fold 4/4 , model 1 , accuracy===0.822333 
Level 0, fold 4/4 , model 2 , accuracy===0.783833 
Level 0, model 0 , accuracy===0.819750 
Level 0, model 1 , accuracy===0.819583 
Level 0, model 2 , accuracy===0.779125 


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Output dimensionality of level 0 is 3 
 level 0 lasted 36.324414 seconds 
Input Dimensionality 26 at Level 1 
1 models included in Level 1 
Level 1, fold 1/4 , model 0 , accuracy===0.821500 
Level 1, fold 2/4 , model 0 , accuracy===0.807333 
Level 1, fold 3/4 , model 0 , accuracy===0.830833 
Level 1, fold 4/4 , model 0 , accuracy===0.825667 
Level 1, model 0 , accuracy===0.821333 
Output dimensionality of level 1 is 1 
 level 1 lasted 221.006297 seconds 
 fit() lasted 257.339550 seconds 
1 estimators included in Level 0 
1 estimators included in Level 1 


  f"X has feature names, but {self.__class__.__name__} was fitted without"
  f"X has feature names, but {self.__class__.__name__} was fitted without"
  f"X has feature names, but {self.__class__.__name__} was fitted without"


Test accuracy without restacking, auc 0.782402 
