## End-to-End Ensemble Methods

### I design an end-to-end ensemble method below. It can automatically process data without any human involvement at all. As long as you provide the orignal data frame and the position information about categorical and interval features, it can automatically run and return the prediction directly.
### It is actually a customizing pipeline with three phases from feature encoding, dimension deduction to final classifiers(regressors). Also you can plug in more phases into the pipeline if you want. The advantage of this method than Sklearn pipeline(also my innovation) is I create a new ordinal encoder class so that training set and testing set can be processed independently at the very begining!
### There are some annoying issuses of Sklearn Encoding class: 1. Label encoder can only be applied on a single feature and at the trsanform stage, any new value without encoding at fit stage is not allowed, whic means you have to perform label encoder on the entire dataset. Once new samples come in, it will fail to transform. 2. Ordinal encoder is a bit better than label encoder for handling multiple features simultaneously. However, since it do not have the input parameter handl_unknown like Onehot encoder, it is still not able to process new values in features at the transform stage. Onehot encdoer is the most powerful encoder since it is able to handle unknown values and can handle features simultaneously. BUT the most frustrating point is Onehot encoder can be used only on the features with integer or float data type. So you have to use the former two encoder to convert string features to numeric features! This will greatly increase the difficulty of your automation and pipeline.
### My new OrdinalEncoder solves the above isses and make the entire automation come true! Some codes in this notebook are referenced the paper Customer Segmentation based  on Financial Behavior. I am one of the coauthor. 

In [1]:
import pickle
with open('data.pickle','rb') as load:
    data=pickle.load(load)
with open('y.pickle','rb') as load:
    y=pickle.load(load)
with open('train_test_index.pickle','rb') as load:
    train_test_index=pickle.load(load)
with open('feature_final.pickle','rb') as load:
    feature_final=pickle.load(load)

In [2]:
## Customizing Ensemble Method 
## Reference Paper Customer Segemntation based on Financial Behavior, Author: Kecheng Xu, etc.
import numpy as np
from numpy.random import choice
from numpy.random import choice
from sklearn.decomposition import TruncatedSVD as SVD
from sklearn.decomposition import NMF
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import RandomTreesEmbedding
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from tqdm import tqdm_notebook as tqdm 

## Three class we have: OrdinalEncoder, estimator, ensemble.
## You only need to generate the instance of ensemble, input the estimetor amount, dataframe and categorcial features position value into ensemble instance. Then it will directly return the prediction.
## An instance of estimator contains three phases: encoding, dimension deduction and ensemble classifiers.  
## An instance of ensemble is consisted of many independent estimators, it will calculate the mean value of each prediction(probability) generated by a single estimator. 
## You can add more alternatives in each stage(layer) of estimator without any adjustment as long as the estimator align the fit, fit_transorm and transform pattern.

In [3]:
class OrdinalEncoder():
    def __init__(self):
        self.dicts=[]
        
    def fit(self,df,line):
        self.line=line
        df_output=df.copy()
        for i in range(line):
            dic=np.unique(df.iloc[:,i])
            dic=dict([(i,index) for index, i in enumerate(dic)])
            self.dicts.append(dic)
            
    def fit_transform(self,df,line):
        self.line=line
        df_output=df.copy()
        for i in range(line):
            dic=np.unique(df.iloc[:,i])
            dic=dict([(i,index) for index, i in enumerate(dic)])
            self.dicts.append(dic)
            df_output.iloc[:,i]=df.iloc[:,i].apply(lambda x: dic[x])
        return df_output
        
    def transform(self,df):
        df_output=df.copy()
        for i in range(self.line):
            dic=self.dicts[i]
            df_output.iloc[:,i]=df.iloc[:,i].apply(self.unknown_value,args=(dic,))
        return df_output
    
    def unknown_value(self,value,dic): # It will set up a new interger for unknown values!
        try:
            return dic[i]
        except:
            return len(dic)
                
class estimator():
    
    def __init__(self,model_dic):# Choose an alternative in each layer!
        self.layer0=choice(model_dic['Encoding'],1)[0]
        self.layer1=choice(model_dic['Dimdeduct'],1)[0]
        self.layer2=choice(model_dic['Classifer'],1)[0]
        
        
    def fit(self,X,y,K,ord_output,ohe_output):#Three layer fitting process
    
        if self.layer0.__name__=='OrdinalEncoder':
            layer0_output=ord_output
        else:
            layer0_output=ohe_output
        self.layer1_=self.layer1(n_components=K)
        layer1_output=self.layer1_.fit_transform(layer0_output)
        self.layer2_=self.layer2()
        self.layer2_.fit(layer1_output,y)
    
        
    def predict(self,ord_output,ohe_output):#Three layer fitting process
        if self.layer0.__name__=='OrdinalEncoder':
            layer0_output=ord_output
        else:
            layer0_output=ohe_output
        layer1_output=self.layer1_.transform(layer0_output)
        return self.layer2_.predict_proba(layer1_output)
        
class ensemble():

    def __init__(self,n_estimators,model_dic):
        self.n_setimators=n_estimators
        self.estimators_=[estimator(model_dic) for i in range(self.n_setimators)]
        self.cate_int_line=model_dic['categorical_interval_line']
        self.est_complete=0
    
    def fit(self,X,y,K):
        self.OrdinalEncoder_=OrdinalEncoder()
        OrdinalEncoder_output=self.OrdinalEncoder_.fit_transform(X,self.cate_int_line)
        self.OnehotEncoder_=OneHotEncoder(categorical_features=list(range(self.cate_int_line)),handle_unknown='ignore')
        OnehotEncoder_output=self.OnehotEncoder_.fit_transform(OrdinalEncoder_output)
        
        for est in tqdm(self.estimators_):
            est.fit(X,y,K,OrdinalEncoder_output,OnehotEncoder_output)
            self.est_complete+=1
            print('Complete '+str(self.est_complete)+' estimator!')
    
    def predict(self,X):
        prob=[]
        ord_output=self.OrdinalEncoder_.transform(X)
        ohe_output=self.OnehotEncoder_.transform(ord_output)
        for est in tqdm(self.estimators_):
            prob.append(est.predict(ord_output,ohe_output))
        prob=np.array(prob)
        return np.mean(prob,axis=0)
            

## Training and predicting process only need below code

In [4]:
model_dic={'Encoding':[OrdinalEncoder,OneHotEncoder],
            'Dimdeduct':[SVD,NMF],
            'Classifer':[RandomForestClassifier,
                         XGBClassifier,
                         AdaBoostClassifier,
                         GradientBoostingClassifier],
            'categorical_interval_line':22 }

x_train=data.iloc[train_test_index['x_train'],:]
y_train=y[train_test_index['y_train']]
e=ensemble(3,model_dic)
e.fit(x_train,y_train,10)



HBox(children=(IntProgress(value=0, max=3), HTML(value='')))



Complete 1 estimator!
Complete 2 estimator!
Complete 3 estimator!



In [5]:
x_test=data.iloc[train_test_index['x_test'],:]
y_train=y[train_test_index['y_test']]
y_predict=e.predict(x_test)
y_predict

HBox(children=(IntProgress(value=0, max=3), HTML(value='')))




array([[0.9276315 , 0.0723685 ],
       [0.91814008, 0.08185992],
       [0.96044548, 0.03955452],
       ...,
       [0.85566963, 0.14433037],
       [0.85260338, 0.14739662],
       [0.95337461, 0.04662539]])