<a href="https://colab.research.google.com/github/Chirag314/Stacked-creditcarddata/blob/main/Stacked_creditcarddata.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###This notebook is made from exercises from book Ensemble Machine Learning Cookbook.

Stacked generalization is an ensemble of a diverse group of models that introduces the concept of a meta-learner. A meta-learner is a second-level machine learning algorithm that learns from an optimal combination of base learners:
The steps for stacking are as follows:

Split your dataset into a training set and a testing set.
Train several base learners on the training set.
Apply the base learners on the testing set to make predictions.
Use the predictions as inputs and the actual responses as outputs to train a higher-level learner.
Stacked generalization is used mainly for minimizing the generalization error of the base learners, and can be seen as a refined version of cross-validation. It uses a strategy that's more sophisticated than cross-validation's winner-takes-all approach for combining the predictions from the base learners.
In this section, we'll look at how to implement stacked generalization from scratch.

We will carry out the following steps to get started:

Build three base learners for stacking.
Combine the predictions from each of the base learners.
Build the meta-learner using another algorithm.

In [2]:
#import required libraries

import seaborn as sns
import numpy as np
import pandas as pd
import seaborn
import matplotlib.pyplot as plt
%matplotlib inline
from xgboost import XGBClassifier
from xgboost import plot_tree
from xgboost import plot_importance
from sklearn.feature_selection import SelectFromModel

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split,KFold,cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import roc_auc_score,r2_score,roc_curve, auc,accuracy_score
from sklearn.preprocessing import MinMaxScaler
import itertools

from sklearn import metrics

In [1]:
# Read data from github. Use raw format and copy url# Note normal url and raw url will be different.
import pandas as pd
pd.options.display.max_rows=None
pd.options.display.max_columns=None
url = 'https://raw.githubusercontent.com/PacktPublishing/Ensemble-Machine-Learning-Cookbook/master/Chapter08/UCI_Credit_Card.csv'
df_creditcarddata= pd.read_csv(url)
#df = pd.read_csv(url)
print(df_creditcarddata.head(5))

   ID  LIMIT_BAL  SEX  EDUCATION  MARRIAGE  AGE  PAY_0  PAY_2  PAY_3  PAY_4  \
0   1    20000.0    2          2         1   24      2      2     -1     -1   
1   2   120000.0    2          2         2   26     -1      2      0      0   
2   3    90000.0    2          2         2   34      0      0      0      0   
3   4    50000.0    2          2         1   37      0      0      0      0   
4   5    50000.0    1          2         1   57     -1      0     -1      0   

   PAY_5  PAY_6  BILL_AMT1  BILL_AMT2  BILL_AMT3  BILL_AMT4  BILL_AMT5  \
0     -2     -2     3913.0     3102.0      689.0        0.0        0.0   
1      0      2     2682.0     1725.0     2682.0     3272.0     3455.0   
2      0      0    29239.0    14027.0    13559.0    14331.0    14948.0   
3      0      0    46990.0    48233.0    49291.0    28314.0    28959.0   
4      0      0     8617.0     5670.0    35835.0    20940.0    19146.0   

   BILL_AMT6  PAY_AMT1  PAY_AMT2  PAY_AMT3  PAY_AMT4  PAY_AMT5  PAY_AMT6  \
0   

In [3]:
#Check shape of data
df_creditcarddata.shape

(30000, 25)

In [4]:
# Create feature and response variables
X=df_creditcarddata.iloc[:,0:23]

Y=df_creditcarddata['default.payment.next.month']
print(X.shape)
print(Y.shape)


(30000, 23)
(30000,)


In [5]:
#Check missing values
df_creditcarddata.isnull().sum()

ID                            0
LIMIT_BAL                     0
SEX                           0
EDUCATION                     0
MARRIAGE                      0
AGE                           0
PAY_0                         0
PAY_2                         0
PAY_3                         0
PAY_4                         0
PAY_5                         0
PAY_6                         0
BILL_AMT1                     0
BILL_AMT2                     0
BILL_AMT3                     0
BILL_AMT4                     0
BILL_AMT5                     0
BILL_AMT6                     0
PAY_AMT1                      0
PAY_AMT2                      0
PAY_AMT3                      0
PAY_AMT4                      0
PAY_AMT5                      0
PAY_AMT6                      0
default.payment.next.month    0
dtype: int64

In [7]:
#We first split the dataset into train and test subset
X_train, X_test, Y_train, Y_test=train_test_split(X, Y, test_size=0.1,random_state=1)

#Then we take the train subset and carve out as validation set
X_train, X_val, Y_train,Y_val=train_test_split(X_train,Y_train,test_size=0.2,random_state=1)

In [8]:
# check dimension of each subset to make sure the split is proper
#Dimension of train subsets
print(X_train.shape)
print(Y_train.shape)

#Dimensions of validation subsets
print(X_val.shape)
print(Y_val.shape)

# Check dimensions of test set
print(X_test.shape)
print(Y_test.shape)

(21600, 23)
(21600,)
(5400, 23)
(5400,)
(3000, 23)
(3000,)


In [9]:
#Import required libraries
#For base learners
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

# For metalearner
from sklearn.linear_model import LogisticRegression

In [10]:
#Create instances of base learners and fit the model on our training dataa
#The base learners
model_1=GaussianNB()
model_2=KNeighborsClassifier(n_neighbors=1)
model_3=DecisionTreeClassifier()

#Now we train a list of models
base_learner_1=model_1.fit(X_train,Y_train)
base_learner_2=model_2.fit(X_train,Y_train)
base_learner_3=model_3.fit(X_train,Y_train)

In [11]:
# We then use the models to make predictions
val_prediction_base_learner_1=base_learner_1.predict(X_val)
val_prediction_base_learner_2=base_learner_2.predict(X_val)
val_prediction_base_learner_3=base_learner_3.predict(X_val)

In [12]:
# And then use the predictions to create a new stacked dataset
import numpy as np
prediction_test_stack=np.dstack([val_prediction_base_learner_1, val_prediction_base_learner_2,val_prediction_base_learner_3])

#Now we stack actual outcomes
final_train_stack=np.dstack([prediction_test_stack,Y_val])

We convert the final_train_stack stacked array to a DataFrame and add column names to each of the columns. Verify the dimensions and take a look at the first few rows:

In [13]:
stacked_train_dataframe=pd.DataFrame(final_train_stack[0,0:5400],columns='NB_val KNN_val DT_val Y_val'.split())

print(stacked_train_dataframe.shape)
print(stacked_train_dataframe.head())

(5400, 4)
   NB_val  KNN_val  DT_val  Y_val
0       1        0       0      0
1       1        1       0      1
2       1        0       0      0
3       1        0       1      1
4       1        0       0      0


In [15]:
#Train the meta-learner using the stacked array
meta_learner=LogisticRegression()
meta_learner_model=meta_learner.fit(stacked_train_dataframe.iloc[:,0:3],stacked_train_dataframe['Y_val'])

In [16]:
# Take the test data (new data)
# Apply the base learners on this new data to make predictions
# We now use the models to make predictions on the test data and create a new stacked dataset

test_prediction_base_learner_1=base_learner_1.predict(X_test)
test_prediction_base_learner_2=base_learner_2.predict(X_test)
test_prediction_base_learner_3=base_learner_3.predict(X_test)

# Create the stacked data
final_test_stack=np.dstack([test_prediction_base_learner_1,test_prediction_base_learner_2,test_prediction_base_learner_3])

Convert the final_test_stack stacked array to a DataFrame and add column names to each of the columns. Verify the dimensions and take a look at the first few rows:

In [18]:
stacked_test_dataframe=pd.DataFrame(final_test_stack[0,0:3000],columns='NB_TEST KNN_TEST DT_TEST'.split())
print(stacked_test_dataframe.shape)
print(stacked_test_dataframe.head())

(3000, 3)
   NB_TEST  KNN_TEST  DT_TEST
0        1         0        0
1        1         1        0
2        0         0        0
3        1         0        1
4        1         0        0


In [19]:
# Check the accuracy of base learner on our original test data
test_prediction_base_learner_1=base_learner_1.predict(X_test)
test_prediction_base_learner_2=base_learner_2.predict(X_test)
test_prediction_base_learner_3=base_learner_3.predict(X_test)

print("Accuracy from GaussianNB :",accuracy_score(Y_test,test_prediction_base_learner_1))
print("Accuracy from KNN :", accuracy_score(Y_test, test_prediction_base_learner_2))
print("Accuracy from Decision tree: ",accuracy_score(Y_test, test_prediction_base_learner_3))

Accuracy from GaussianNB : 0.398
Accuracy from KNN : 0.696
Accuracy from Decision tree:  0.7433333333333333


In [20]:
#Use the meta-learner on the stacked test data and check the accuracy
test_predictions_meta_learner=meta_learner_model.predict(stacked_test_dataframe)
print("Accuracy from meta learner:", accuracy_score(Y_test, test_predictions_meta_learner))
#We see the following output returned by the meta-learner applied on the stacked test data. This accuracy is higher than the individual base learners

Accuracy from meta learner: 0.7746666666666666


Feature names unseen at fit time:
- DT_TEST
- KNN_TEST
- NB_TEST
Feature names seen at fit time, yet now missing:
- DT_val
- KNN_val
- NB_val

