# Best Question Author Prediction - Enigma CodeFest - Analytics Vidya

## Problem Statement
* An online QnA platform has hired you as a data scientist to **identify the best questioning authors** on the platform. 
* Why? This identification will bring more insight into increasing the user engagement. 
* How? Given the tag of the question, number of views received, number of answers, username and reputation of the question author, the problem requires you to **predict the upvote count that the question will receive**.

## Data Dictionary

  | Variable    | Definition                                        |                             
  |-------------|---------------------------------------------------|
  | ID        	| Question ID                                       |                            
  | Tag       	| Anonymised tags representing question category    | 
  | Reputation	| Reputation score of question author               |      
  | Answers   	| Number of times question has been answered        | 
  | Username  	| Anonymised user id of question author             |    
  | Views     	| Number of times question has been viewed          | 
  | Upvotes   	| (Target) Number of upvotes for the question       | 

## Evaluation Metric

The evaluation metric for this competition is RMSE (root mean squared error)

## Tags

**Regression**

In [158]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

# FunctionTransformer to select specific columns from pandas DataFrame
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion, make_union

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.ensemble import BaggingRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor

In [18]:
sns.set_style('whitegrid')

## Load Data

In [19]:
train = pd.read_csv('data/train_NIR5Yl1.csv', index_col='ID')
print('Train Data Size :',train.shape)
train.head()

Train Data Size : (330045, 6)


Unnamed: 0_level_0,Tag,Reputation,Answers,Username,Views,Upvotes
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
52664,a,3942.0,2.0,155623,7855.0,42.0
327662,a,26046.0,12.0,21781,55801.0,1175.0
468453,c,1358.0,4.0,56177,8067.0,60.0
96996,a,264.0,3.0,168793,27064.0,9.0
131465,c,4271.0,4.0,112223,13986.0,83.0


In [20]:
test = pd.read_csv('data/test_8i3B3FC.csv', index_col='ID')
print('Test Data Size :',test.shape)
test.head()

Test Data Size : (141448, 5)


Unnamed: 0_level_0,Tag,Reputation,Answers,Username,Views
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
366953,a,5645.0,3.0,50652,33200.0
71864,c,24511.0,6.0,37685,2730.0
141692,i,927.0,1.0,135293,21167.0
316833,i,21.0,6.0,166998,18528.0
440445,i,4475.0,10.0,53504,57240.0


## Pre-process Train Data

In [21]:
# Null-sanity check
train.isnull().sum() # It is clean

Tag           0
Reputation    0
Answers       0
Username      0
Views         0
Upvotes       0
dtype: int64

In [22]:
ytrain = train.pop('Upvotes')
xtrain = train
train = None
print(xtrain.head(), '\n\n\n', ytrain.head())

       Tag  Reputation  Answers  Username    Views
ID                                                
52664    a      3942.0      2.0    155623   7855.0
327662   a     26046.0     12.0     21781  55801.0
468453   c      1358.0      4.0     56177   8067.0
96996    a       264.0      3.0    168793  27064.0
131465   c      4271.0      4.0    112223  13986.0 


 ID
52664       42.0
327662    1175.0
468453      60.0
96996        9.0
131465      83.0
Name: Upvotes, dtype: float64


In [23]:
xtrain.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 330045 entries, 52664 to 300553
Data columns (total 5 columns):
Tag           330045 non-null object
Reputation    330045 non-null float64
Answers       330045 non-null float64
Username      330045 non-null int64
Views         330045 non-null float64
dtypes: float64(3), int64(1), object(1)
memory usage: 15.1+ MB


In [24]:
xtrain.Tag.value_counts()

c    72458
j    72232
p    43407
i    32400
a    31695
s    23323
h    20564
o    14546
r    12442
x     6978
Name: Tag, dtype: int64

In [25]:
xtrain.Tag = xtrain.Tag.astype('category')
xtrain.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 330045 entries, 52664 to 300553
Data columns (total 5 columns):
Tag           330045 non-null category
Reputation    330045 non-null float64
Answers       330045 non-null float64
Username      330045 non-null int64
Views         330045 non-null float64
dtypes: category(1), float64(3), int64(1)
memory usage: 12.9 MB


In [26]:
print('Test Data Info:\n',test.info())
test.Tag = test.Tag.astype('category')
print()
print('Test Data Info:\n',test.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 141448 entries, 366953 to 107271
Data columns (total 5 columns):
Tag           141448 non-null object
Reputation    141448 non-null float64
Answers       141448 non-null float64
Username      141448 non-null int64
Views         141448 non-null float64
dtypes: float64(3), int64(1), object(1)
memory usage: 6.5+ MB
Test Data Info:
 None

<class 'pandas.core.frame.DataFrame'>
Int64Index: 141448 entries, 366953 to 107271
Data columns (total 5 columns):
Tag           141448 non-null category
Reputation    141448 non-null float64
Answers       141448 non-null float64
Username      141448 non-null int64
Views         141448 non-null float64
dtypes: category(1), float64(3), int64(1)
memory usage: 5.5 MB
Test Data Info:
 None


In [103]:
# Ref.: https://stackoverflow.com/a/48929642
class ModifiedLabelEncoder(LabelEncoder):

    def fit_transform(self, y, *args, **kwargs):
        return super().fit_transform(y).reshape(-1, 1)

    def transform(self, y, *args, **kwargs):
        return super().transform(y).reshape(-1, 1)

In [28]:
cat_features = ['Tag']
mle = ModifiedLabelEncoder()
tr_values = mle.fit_transform(xtrain[cat_features])
print('LabelEncoder Classes :', list(mle.classes_))
print('LabelEncoder Classes Count :', len(list(mle.classes_)))

  y = column_or_1d(y, warn=True)


LabelEncoder Classes : ['a', 'c', 'h', 'i', 'j', 'o', 'p', 'r', 's', 'x']
LabelEncoder Classes Count : 10


In [29]:
print('LabelEncoder Transform Classes :', tr_values[:5])
print('LabelEncoder Inverse-Transform Classes :', mle.inverse_transform(tr_values)[:5])

LabelEncoder Transform Classes : [[0]
 [0]
 [1]
 [0]
 [1]]
LabelEncoder Inverse-Transform Classes : [['a']
 ['a']
 ['c']
 ['a']
 ['c']]


  if diff:


In [30]:
ohe = OneHotEncoder(sparse=False)
# res = ohe.fit_transform(xtrain[cat_features])
res = ohe.fit_transform(tr_values)
# res.todense().shape
res

array([[1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       ...,
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [32]:
# xtrain

In [66]:
def numeric_columns(df):
    return df.select_dtypes(include=['int64','float64'])

def categorical_columns(df):
    return df.select_dtypes(include=['object', 'category'])

transformer_list = [
    ('numeric_features', make_pipeline(FunctionTransformer(numeric_columns, validate=False),
                                       StandardScaler())
    ),
    ('categorical_features', make_pipeline(FunctionTransformer(categorical_columns, validate = False),
                                          LabelEncoder(),
                                          OneHotEncoder(sparse=False))
    )
]
fu = FeatureUnion(transformer_list=transformer_list, 
                  transformer_weights=None)

steps = [
    # Use FeatureUnion to combine the features
    ('union', fu),
    # Use data-model
    # ('lr_model', LinearRegression())
]

pipe = Pipeline(steps)
pdf = pipe.fit_transform(xtrain)
pdf

TypeError: fit_transform() takes 2 positional arguments but 3 were given

## Model Data

In [64]:
mp2 = make_pipeline(FunctionTransformer(categorical_columns, validate = False),LabelEncoder())
mp2

Pipeline(memory=None,
     steps=[('functiontransformer', FunctionTransformer(accept_sparse=False,
          func=<function categorical_columns at 0x0000018994A3E8C8>,
          inv_kw_args=None, inverse_func=None, kw_args=None,
          pass_y='deprecated', validate=False)), ('labelencoder', LabelEncoder())])

In [65]:
mp2.fit_transform(xtrain, y=ytrain)

TypeError: fit_transform() takes 2 positional arguments but 3 were given

In [51]:
# xtrain.select_dtypes(include=['object', 'category'])

In [88]:
# Ref.: https://stackoverflow.com/questions/48994618/unable-to-use-featureunion-to-combine-processed-numeric-and-categorical-features
from sklearn.base import BaseEstimator, TransformerMixin

# Class that identifies Column type
class ColumnSelector(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None):
        self.columns = columns
    
    def fit (self, X, y=None, **fit_params):
        return self
    
    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        try:
            return X[self.columns]
        except KeyError:
            cols_error = list(set(self.columns) - set(X.columns))
            raise KeyError("The DataFrame does not include the columns: %s" % cols_error)

numeric_cols = ['Reputation', 'Answers', 'Username', 'Views'] # list of numeric column names
categorical_cols = ['Tag'] # list of categorical column names

# Testing
print(ColumnSelector(columns=numeric_cols).fit_transform(xtrain).head())
print(ColumnSelector(columns=categorical_cols).fit_transform(xtrain).head())

        Reputation  Answers  Username    Views
ID                                            
52664       3942.0      2.0    155623   7855.0
327662     26046.0     12.0     21781  55801.0
468453      1358.0      4.0     56177   8067.0
96996        264.0      3.0    168793  27064.0
131465      4271.0      4.0    112223  13986.0
       Tag
ID        
52664    a
327662   a
468453   c
96996    a
131465   c


In [92]:
tmp = ColumnSelector(columns=numeric_cols).fit_transform(xtrain)
StandardScaler().fit_transform(tmp)

array([[-0.14157253, -0.53573597,  1.5072655 , -0.26915833],
       [ 0.67523751,  2.25794312, -1.21226978,  0.32308687],
       [-0.23705919,  0.02299985, -0.51337753, -0.26653963],
       ...,
       [-0.05894553, -0.53573597,  0.20843454, -0.33588566],
       [-0.2839526 , -0.53573597, -0.0243399 , -0.34015957],
       [-0.21329838,  0.02299985,  1.48834852, -0.33463807]])

In [99]:
tmp = ColumnSelector(columns=categorical_cols).fit_transform(xtrain)
print(tmp.head())
LabelEncoder().fit_transform(tmp)

       Tag
ID        
52664    a
327662   a
468453   c
96996    a
131465   c


  y = column_or_1d(y, warn=True)


array([0, 0, 1, ..., 1, 4, 4], dtype=int64)

In [104]:
tmp = ColumnSelector(columns=categorical_cols).fit_transform(xtrain)
print(tmp.head())
ModifiedLabelEncoder().fit_transform(tmp)

       Tag
ID        
52664    a
327662   a
468453   c
96996    a
131465   c


  y = column_or_1d(y, warn=True)


array([[0],
       [0],
       [1],
       ...,
       [1],
       [4],
       [4]], dtype=int64)

In [102]:
tmp = ColumnSelector(columns=categorical_cols).fit_transform(xtrain)
tmp = LabelEncoder().fit_transform(tmp) # Convert String to Number, for OneHotEncoding
OneHotEncoder(sparse=False).fit_transform(tmp.reshape(-1, 1))

  y = column_or_1d(y, warn=True)


array([[1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       ...,
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [107]:
tmp = ColumnSelector(columns=categorical_cols).fit_transform(xtrain)
tmp = ModifiedLabelEncoder().fit_transform(tmp) # Convert String to Number, for OneHotEncoding
tmp = OneHotEncoder(sparse=False).fit_transform(tmp)
print(tmp.shape)

  y = column_or_1d(y, warn=True)


(330045, 10)


In [110]:
numeric_cols_pipe = make_pipeline(ColumnSelector(columns=numeric_cols),StandardScaler())
categorical_cols_pipe = make_pipeline(ColumnSelector(columns=categorical_cols), ModifiedLabelEncoder(), OneHotEncoder(sparse=False))
fu = make_union(numeric_cols_pipe, categorical_cols_pipe)

trans_vec = fu.fit_transform(xtrain)
print(trans_vec.shape)
print(trans_vec[:5])

  y = column_or_1d(y, warn=True)


(330045, 14)
[[-0.14157253 -0.53573597  1.5072655  -0.26915833  1.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.        ]
 [ 0.67523751  2.25794312 -1.21226978  0.32308687  1.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.        ]
 [-0.23705919  0.02299985 -0.51337753 -0.26653963  0.          1.
   0.          0.          0.          0.          0.          0.
   0.          0.        ]
 [-0.27748582 -0.25636806  1.7748667  -0.03188227  1.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.        ]
 [-0.12941498  0.02299985  0.62542101 -0.19342614  0.          1.
   0.          0.          0.          0.          0.          0.
   0.          0.        ]]


In [115]:
# Split train data-set
x_train, x_test, y_train, y_test = train_test_split(trans_vec, ytrain.values, train_size = 0.75, random_state = 42)



## Prediction with LinearRegression

### Util Methods

In [126]:
def conv2df(preds):
    df = pd.DataFrame(data={
        'ID': test.index.values,
        'Upvotes': preds
    })
    df['Upvotes'] = df.Upvotes.astype(int)
    return df

In [137]:
test_fu = fu.transform(test)
test_fu.shape

  y = column_or_1d(y, warn=True)


(141448, 14)

In [127]:
# Predict using Linear Regressor
lr_clf = LinearRegression()
lr_clf.fit(trans_vec, ytrain.values)
print('Base-level Score', lr_clf.score(trans_vec, ytrain.values))
preds = lr_clf.predict(test_fu)
conv2df(preds).to_csv('data/output/lr.csv', index=False) # Got score of 3543.8523122425 in Public LeaderBoard

Base-level Score 0.2556657989092912


  y = column_or_1d(y, warn=True)


In [156]:
sgdr_clf = SGDRegressor(max_iter=101, learning_rate='optimal', alpha=0.11, random_state=42)
sgdr_clf.fit(trans_vec, ytrain.values)
print('SGDR Score : ', sgdr_clf.score(trans_vec, ytrain.values)) # SGDR Score :  0.25240081600832254
preds = sgdr_clf.predict(test_fu)
conv2df(preds).to_csv('data/output/sgdr.csv', index=False)

SGDR Score :  0.25240081600832254


## Prediction with BaggingRegressor

In [134]:
dtr = DecisionTreeRegressor()
br = BaggingRegressor(dtr, n_jobs=-1, random_state=42)
params = {
    'n_estimators': [10, 15, 20, 25, 30],
    'max_samples': [.5, .7, 1],
    'max_features': [.7, .8, .9, 1.0],
#     'warm_start': [True, False]
}
gsv = GridSearchCV(br, params, cv=2, verbose=5, n_jobs=-1)
gsv.fit(trans_vec, ytrain.values)
preds = gsv.predict(fu.transform(test))
conv2df(preds).to_csv('data/output/bagging_regressor.csv', index=False) # Got best score of 1016.7805765708 in Public LeaderBoard

Fitting 2 folds for each of 120 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:  7.0min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed: 20.0min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 30.9min finished
  y = column_or_1d(y, warn=True)


In [135]:
print('Best Score : ', gsv.best_score_)
print('Best Params : ', gsv.best_params_)
#Best Score :  0.8579086577431149, Best Params :  {'max_features': 1.0, 'max_samples': 0.7, 'n_estimators': 25, 'warm_start': False}, LB-Score: 1016.7805765708
# Best Score :  0.8560505987116979, Best Params :  {'max_features': 1.0, 'max_samples': 0.7, 'n_estimators': 25}, LB-Score: 1100.3336222340

gsv.best_estimator_

Best Score :  0.8579086577431149
Best Params :  {'max_features': 1.0, 'max_samples': 0.7, 'n_estimators': 25, 'warm_start': False}


BaggingRegressor(base_estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=0.7, n_estimators=25, n_jobs=-1, oob_score=False,
         random_state=None, verbose=0, warm_start=False)

## Prediction with GradientBoostingRegressor

In [172]:
gbr_est = GradientBoostingRegressor(random_state=42, n_estimators=100, learning_rate=0.25, max_depth=7)
gbr_est.fit(trans_vec, ytrain.values)
print('GBR Score : ', gbr_est.score(trans_vec, ytrain.values)) # GBR Score :  0.9580485461976558
preds = gbr_est.predict(test_fu)
conv2df(preds).to_csv('data/output/gbr.csv', index=False)
# Got  LB Score of 1177.7464239328351

GBR Score :  0.9942304795977295


In [None]:
params = {
    'n_estimators': [100,150],
    'learning_rate': [0.05, 0.1, 0.15, 0.2, 0.25, 0.3],
    'max_depth': [3,5,7,9],
    'alpha': [0.9,0.7,0.5],
}
gsv = GridSearchCV(GradientBoostingRegressor(random_state=42), params, cv=2, verbose=1, n_jobs=-1)
gsv.fit(trans_vec, ytrain.values)
preds = gsv.predict(fu.transform(test))
conv2df(preds).to_csv('data/output/gbr_gsv.csv', index=False)

Fitting 2 folds for each of 144 candidates, totalling 288 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 30.2min


In [None]:
## Prediction with XGRegressor

In [None]:
from sklearn.model_selection import cross_val_score

xgbr_est = XGBRegressor(random_state=42, n_jobs=-1, silent=False, objective='reg:tweedie')
scores = cross_val_score(xgbr_est, trans_vec, ytrain.values, cv=3)
scores
# [0.89959257, 0.85637644, 0.82874587] : default params : gamma=[0-5], max_depth=3
# [0.88980702, 0.85439139, 0.84157185] : max_depth=7, subsample=.6, objective=reg:tweedie
# [0.86491963, 0.87424075, 0.8331381 ] : max_depth=3, subsample=.6, objective='reg:tweedie'
# [0.90543795, 0.8284715 , 0.7976946 ] : max_depth=7
# [0.91093361, 0.84257485, 0.81340752] : max_depth=7, , subsample=.6
# [0.90094576, 0.8222576 , 0.76629597] : max_depth=9, gamma=[0-5] and other defaults
# [0.89565491, 0.81797702, 0.75333005] : max_depth=9, learning_rate=0.25, gamma=[0-7]
# [0.89589926, 0.82686846, 0.76230525] : max_depth=9, learning_rate=0.05, gamma=[0-3]
# [0.64866719, 0.68249395, 0.65918265] : max_depth=9, learning_rate=[0.01,0.02], gamma=0-10

In [72]:
'''
from sklearn.model_selection import StratifiedShuffleSplit
sss_cv = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=42)
# sampleg = sss_cv.split(trans_vec,xtrain.Tag)
sample_in = []
tags = None
# tags = xtrain.Tag.values.tolist()
for train_in, test_in in sss_cv.split(xtrain, xtrain.Tag):
#     print('train_in :', train_in, ' whose length is ', len(train_in))
    print('test_in :', test_in, ' whose length is ', len(test_in))
    sample_in = test_in # 30% of sample
    break

xsample = xtrain.iloc[sample_in]
xsample = fu.transform(xsample)
ysample = ytrain.iloc[sample_in].values.tolist()
print(sample_in)
print()
print(xsample)
print(ysample)
print(len(xsample), len(xsample))
'''

"\nfrom sklearn.model_selection import StratifiedShuffleSplit\nsss_cv = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=42)\n# sampleg = sss_cv.split(trans_vec,xtrain.Tag)\nsample_in = []\ntags = None\n# tags = xtrain.Tag.values.tolist()\nfor train_in, test_in in sss_cv.split(xtrain, xtrain.Tag):\n#     print('train_in :', train_in, ' whose length is ', len(train_in))\n    print('test_in :', test_in, ' whose length is ', len(test_in))\n    sample_in = test_in # 30% of sample\n    break\n\nxsample = xtrain.iloc[sample_in]\nxsample = fu.transform(xsample)\nysample = ytrain.iloc[sample_in].values.tolist()\nprint(sample_in)\nprint()\nprint(xsample)\nprint(ysample)\nprint(len(xsample), len(xsample))\n"

In [65]:
xgbr_est = XGBRegressor(random_state=42, n_jobs=-1)

params = {
    'min_child_weight':[4,5], 
    'gamma':[i/10.0 for i in range(3,6)],  
    'subsample':[i/10.0 for i in range(6,11)],
    'colsample_bytree':[i/10.0 for i in range(6,11)], 
    'max_depth': [3,5,7]
}
gsv = GridSearchCV(xgbr_est, params, cv=3, verbose=1)
gsv.fit(xsample, ysample)

preds = gsv.predict(test_fu)
conv2df(preds).to_csv('data/output/xgbr_gs.csv', index=False)

Fitting 3 folds for each of 450 candidates, totalling 1350 fits


ValueError: could not convert string to float: 'c'

## Prediction using Ensemble of Ensembles

In [20]:
class ModelTransformer(TransformerMixin):

    def __init__(self, model):
        self.model = model

    def fit(self, *args, **kwargs):
        self.model.fit(*args, **kwargs)
        return self

    def transform(self, X, **transform_params):
        return pd.DataFrame(self.model.predict(X))

In [21]:
dtr = DecisionTreeRegressor(random_state=42)
br_est = BaggingRegressor(dtr, n_jobs=-1, random_state=42, max_samples=.7, n_estimators=50)
gbr_est = GradientBoostingRegressor(random_state=42, n_estimators=250, learning_rate=0.15)
rf_est = RandomForestRegressor(random_state=42, n_jobs=-1, verbose=1, n_estimators=100, max_depth=9)
etr_est = ExtraTreesRegressor(random_state=42, n_jobs=-1, verbose=1, n_estimators=100, max_depth=9)
xgbr_est = XGBRegressor(random_state=42, n_jobs=-1)

efu = FeatureUnion([
        ('gbr', ModelTransformer(gbr_est)),
        ('br', ModelTransformer(br_est)),
        ('rfr', ModelTransformer(rf_est)),
        ('etr', ModelTransformer(etr_est)),
        ('xgb', ModelTransformer(xgbr_est))
    ])

pipe_est = Pipeline([
    ('estimators', efu),
    ('estimator', KNeighborsRegressor(n_neighbors=3))
])

In [22]:
pipe_est.fit(trans_vec, ytrain.values)

[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   21.7s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   49.6s finished
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.3s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.8s finished
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   19.4s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   48.3s finished
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.3s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.9s finished


Pipeline(memory=None,
     steps=[('estimators', FeatureUnion(n_jobs=1,
       transformer_list=[('gbr', <__main__.ModelTransformer object at 0x0000018DA22A2518>), ('br', <__main__.ModelTransformer object at 0x0000018DA8E01B38>), ('rfr', <__main__.ModelTransformer object at 0x0000018DA8E01CF8>), ('etr', <__main__.ModelTransfo...nkowski',
          metric_params=None, n_jobs=1, n_neighbors=3, p=2,
          weights='uniform'))])

In [23]:
preds = pipe_est.predict(test_fu)
conv2df(preds).to_csv('data/output/ensembles.csv', index=False)

[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.1s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.3s finished
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.1s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.3s finished
