# Out Of Fold Meta-Model Stacking Pipeline 🥳
This notebook is a continuation of my [previous notebook](https://www.kaggle.com/amoghjrules/intro-to-stacking-averaging-base-models).
I use a custom dataset here. If you want to visualise how it was made, checkout my notebook [here](https://www.kaggle.com/amoghjrules/encode-like-there-s-no-tomorrow)

### What are out of fold predictions?
The main idea behind the structure of a stacked generalization is to use one or more first level models, make predictions using these models and then use these predictions as features to fit one or more second level models on top. To avoid overfitting, we use cross-validation to predict the OOF (out-of-fold) part of the training set. [source](https://towardsdatascience.com/automate-stacking-in-python-fc3e7834772e)

![image](https://i.postimg.cc/T32XpjhC/stacking2.png)

## Importing pre-encoded data from my public dataset

Since our data is pre-encoded we can start working on it right away 

In [None]:
import numpy as np
import pandas as pd
test = pd.read_csv("../input/catindat2encoded/test_encoded.csv")
train = pd.read_csv("../input/catindat2encoded/train_encoded.csv")
test_id = pd.read_csv("../input/cat-in-the-dat-ii/sample_submission.csv")['id']

In [None]:
target = train['target']
train.drop(['target'], axis = 1, inplace = True)
# test.drop(['id'], axis = 1, inplace = True)

## Importing libraries

In [None]:
from sklearn.model_selection import StratifiedKFold, KFold, cross_validate
from sklearn.linear_model import LogisticRegression, ElasticNet, SGDClassifier
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier

## Defining the function which will be our pipeline
Here we build a out of fold training set and a testing set 

Dividing the trainset using a stratified kfold split, we are generating a train set. Next, we generate a test set which consists of the predictions of the pre-existing models

In [None]:
def generate_oof_trainset( train, test, target, strat_kfold, models,):

    oof_train = pd.DataFrame() # Initializing empty data frame
    
    count = 0
    print(train.shape, target.shape)

    for train_id, test_id in strat_kfold.split(train, target):
        count += 1
        print("Current fold number is :", count)
        xtrain, xtest = train.iloc[train_id], train.iloc[test_id]
        ytrain, ytest = target.iloc[train_id], target.iloc[test_id]
        
        curr_split = [None]*(len(models)+1) # Initializing list of lists to save all predictions for a split from all models for the current split
        
        for i in tqdm(range(len(models))):
            
            model = models[i]
            model.fit(xtrain, ytrain)
            
            curr_split[i] = model.predict_proba(xtest)[:,1]      
            
        curr_split[-1] = ytest
        oof_train = pd.concat([oof_train,pd.DataFrame(curr_split).T], ignore_index= True)
    
    oof_test = [None]*len(models)
    for i, model in enumerate(models):
        model.fit( train, target)
        oof_test[i] = model.predict_proba(test)[:,1]
    oof_test = pd.DataFrame(oof_test)
    return oof_train, oof_test
    

Now we fit the generated trainset and perform cross validation.

In [None]:
from tqdm import tqdm
strat_kfold = StratifiedKFold( n_splits =2, shuffle = 
              True)

log_reg = LogisticRegression(random_state = 0)
gbr = GradientBoostingClassifier(
        max_depth=6,
        n_estimators=35,
        warm_start=False,
        random_state=42)
adar = AdaBoostClassifier(n_estimators=100, random_state=0)

models = [ log_reg, gbr, adar ]
train_generated, test_generated = generate_oof_trainset( train, test, target, strat_kfold, models)

Then we fit the meta classifier onto the generated testset

In [None]:
lr_clf = LogisticRegression()
target = train_generated[train_generated.columns[-1]]
train_generated.drop([train_generated.columns[-1]], axis = 1 , inplace = True)

cv_results = cross_validate(lr_clf,
                            train_generated.values,
                            target.values,
                            cv = 3,
                            scoring = 'roc_auc',
                            verbose = 1,
                            return_train_score = True,
                            return_estimator = True)

print("Fit time :", cv_results['fit_time'].sum(),"secs")
print("Score time :", cv_results['score_time'].sum(),"secs")
print("Train score :", cv_results['train_score'].mean())
print("Test score :", cv_results['test_score'].mean())   

In [None]:
lr_clf.fit(train_generated.values, target.values)
preds = lr_clf.predict(test_generated.T.values)

## If you learnt something from this notebook, do upvote the kernel 😁