<a href="https://colab.research.google.com/github/Chirag314/OOF-validation/blob/main/Stacked_model_for_binary_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

OOF predictions play an important role in machine learning in both estimating the performance of a model and in development of ensemble model.
During the k-fold cross-validation, redictions are made on test sets comprised of data not used to train the model. These predictions are referred to as out of fold predictions, a type of out of sample predictions.

The k-fold cross-validation procedure involves splitting a training dataset into k groups, then using each of the k groups of examples on a test set while the remaining examples are used as a training set.

This means that k different models are trained and evaluated. The performance of the model is estimated using the predictions by the models made across all k-folds.

Importantly, each observation in the data sample is assigned to an individual group and stays in that group for the duration of the procedure. This means that each sample is given the opportunity to be used in the holdout set 1 time and used to train the model k-1 times.

In [53]:
from numpy import hstack
from numpy import array
from sklearn.datasets import make_blobs
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

#create a meta dataset
def create_meta_dataset(data_x,yhat1,yhat2):
  yhat1=array(yhat1).reshape((len(yhat1),1))
  yhat2=array(yhat2).reshape((len(yhat2),1))
  #stack as separate columns
  meta_X=hstack((data_x,yhat1,yhat2))
  return meta_X

#make predictions with stacked model
def stack_prediction(model1,model2,meta_model,X):
  #make predictions
  yhat1=model1.predict_proba(X)[:,0]
  yhat2=model2.predict_proba(X)[:,0]
  #create impurt dataset
  meta_X=create_meta_dataset(X,yhat1,yhat2)
  return meta_model.predict(meta_X)



In [54]:
#create inputs and outputs dummy dataset
X,y=make_blobs(n_samples=1000,centers=2,n_features=100,cluster_std=20)
print(X.shape)
print(y.shape)

(1000, 100)
(1000,)


In [55]:
#Split
X,X_val,y,y_val=train_test_split(X,y,test_size=0.33)

In [56]:
data_x,data_y,knn_yhat,cart_yhat=list(),list(),list(),list()
kfold=KFold(n_splits=10,shuffle=True)

In [57]:
for train_ix,test_ix in kfold.split(X):
  #get data
  train_X, test_X=X[train_ix],X[test_ix]
  train_y,test_y=y[train_ix],y[test_ix]
  data_x.extend(test_X)
  data_y.extend(test_y)
  #fit and predict with different models
  model1=DecisionTreeClassifier()
  model1.fit(train_X,train_y)
  yhat1=model1.predict_proba(test_X)[:,0]
  cart_yhat.extend(yhat1)
  #Fit and predict with KNN
  model2=KNeighborsClassifier()
  model2.fit(train_X,train_y)
  yhat2=model2.predict_proba(test_X)[:,0]
  knn_yhat.extend(yhat2)

In [58]:
#meta dataset
meta_X=create_meta_dataset(data_x,knn_yhat,cart_yhat)
#fit final submodels
model1=DecisionTreeClassifier()
model1.fit(X,y)
model2=KNeighborsClassifier()
model2.fit(X,y)

In [59]:
#Construct meta classifier
meta_model=LogisticRegression(solver='liblinear')
meta_model.fit(meta_X,data_y)


In [60]:
#Evaluate submodels on holdout
acc1=accuracy_score(y_val,model1.predict(X_val))
acc2=accuracy_score(y_val, model2.predict(X_val))
print('Model1 Accuracy: %.3f, Model2 Accuracy: %.3f' % (acc1, acc2))


Model1 Accuracy: 0.724, Model2 Accuracy: 0.924


In [63]:
#Evaluate meta model on holdout set
yhat=stack_prediction(model1,model2,meta_model,X_val)
yhat.shape

(330,)

In [62]:
acc=accuracy_score(y_val,yhat)
print('Meta  model accuracy : %.3f' %(acc))

Meta  model accuracy : 0.942
