# Best Question Author Prediction - Enigma CodeFest - Analytics Vidya

## Problem Statement
* An online QnA platform has hired you as a data scientist to **identify the best questioning authors** on the platform. 
* Why? This identification will bring more insight into increasing the user engagement. 
* How? Given the tag of the question, number of views received, number of answers, username and reputation of the question author, the problem requires you to **predict the upvote count that the question will receive**.

## Data Dictionary

  | Variable    | Definition                                        |                             
  |-------------|---------------------------------------------------|
  | ID        	| Question ID                                       |                            
  | Tag       	| Anonymised tags representing question category    | 
  | Reputation	| Reputation score of question author               |      
  | Answers   	| Number of times question has been answered        | 
  | Username  	| Anonymised user id of question author             |    
  | Views     	| Number of times question has been viewed          | 
  | Upvotes   	| (Target) Number of upvotes for the question       | 

## Evaluation Metric

The evaluation metric for this competition is RMSE (root mean squared error)

## Tags

**Regression**

In [28]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion, make_union
from sklearn.decomposition import PCA

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import BaggingRegressor, GradientBoostingRegressor, AdaBoostRegressor, RandomForestRegressor, ExtraTreesRegressor
from sklearn.tree import DecisionTreeRegressor

In [2]:
sns.set_style('whitegrid')

## Load Data

In [3]:
train = pd.read_csv('data/train_NIR5Yl1.csv', index_col='ID')
print('Train Data Size :',train.shape)
train.head()

Train Data Size : (330045, 6)


Unnamed: 0_level_0,Tag,Reputation,Answers,Username,Views,Upvotes
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
52664,a,3942.0,2.0,155623,7855.0,42.0
327662,a,26046.0,12.0,21781,55801.0,1175.0
468453,c,1358.0,4.0,56177,8067.0,60.0
96996,a,264.0,3.0,168793,27064.0,9.0
131465,c,4271.0,4.0,112223,13986.0,83.0


In [4]:
test = pd.read_csv('data/test_8i3B3FC.csv', index_col='ID')
print('Test Data Size :',test.shape)
test.head()

Test Data Size : (141448, 5)


Unnamed: 0_level_0,Tag,Reputation,Answers,Username,Views
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
366953,a,5645.0,3.0,50652,33200.0
71864,c,24511.0,6.0,37685,2730.0
141692,i,927.0,1.0,135293,21167.0
316833,i,21.0,6.0,166998,18528.0
440445,i,4475.0,10.0,53504,57240.0


## Pre-process Train Data

In [5]:
ytrain = train.pop('Upvotes')
xtrain = train
train = None
print(xtrain.head(), '\n\n\n', ytrain.head())

       Tag  Reputation  Answers  Username    Views
ID                                                
52664    a      3942.0      2.0    155623   7855.0
327662   a     26046.0     12.0     21781  55801.0
468453   c      1358.0      4.0     56177   8067.0
96996    a       264.0      3.0    168793  27064.0
131465   c      4271.0      4.0    112223  13986.0 


 ID
52664       42.0
327662    1175.0
468453      60.0
96996        9.0
131465      83.0
Name: Upvotes, dtype: float64


In [6]:
xtrain.Tag = xtrain.Tag.astype('category')
xtrain.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 330045 entries, 52664 to 300553
Data columns (total 5 columns):
Tag           330045 non-null category
Reputation    330045 non-null float64
Answers       330045 non-null float64
Username      330045 non-null int64
Views         330045 non-null float64
dtypes: category(1), float64(3), int64(1)
memory usage: 12.9 MB


In [7]:
test.Tag = test.Tag.astype('category')
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 141448 entries, 366953 to 107271
Data columns (total 5 columns):
Tag           141448 non-null category
Reputation    141448 non-null float64
Answers       141448 non-null float64
Username      141448 non-null int64
Views         141448 non-null float64
dtypes: category(1), float64(3), int64(1)
memory usage: 5.5 MB


In [8]:
# Ref.: https://stackoverflow.com/a/48929642
class ModifiedLabelEncoder(LabelEncoder):

    def fit_transform(self, y, *args, **kwargs):
        return super().fit_transform(y).reshape(-1, 1)

    def transform(self, y, *args, **kwargs):
        return super().transform(y).reshape(-1, 1)

In [9]:
# Ref.: https://stackoverflow.com/questions/48994618/unable-to-use-featureunion-to-combine-processed-numeric-and-categorical-features
from sklearn.base import BaseEstimator, TransformerMixin

# Class that identifies Column type
class ColumnSelector(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None):
        self.columns = columns
    
    def fit (self, X, y=None, **fit_params):
        return self
    
    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        try:
            return X[self.columns]
        except KeyError:
            cols_error = list(set(self.columns) - set(X.columns))
            raise KeyError("The DataFrame does not include the columns: %s" % cols_error)

numeric_cols = ['Reputation', 'Answers', 'Username', 'Views'] # list of numeric column names
categorical_cols = ['Tag'] # list of categorical column names

# Testing
print(ColumnSelector(columns=numeric_cols).fit_transform(xtrain).head())
print(ColumnSelector(columns=categorical_cols).fit_transform(xtrain).head())

        Reputation  Answers  Username    Views
ID                                            
52664       3942.0      2.0    155623   7855.0
327662     26046.0     12.0     21781  55801.0
468453      1358.0      4.0     56177   8067.0
96996        264.0      3.0    168793  27064.0
131465      4271.0      4.0    112223  13986.0
       Tag
ID        
52664    a
327662   a
468453   c
96996    a
131465   c


In [10]:
numeric_cols_pipe = make_pipeline(ColumnSelector(columns=numeric_cols),StandardScaler())
categorical_cols_pipe = make_pipeline(ColumnSelector(columns=categorical_cols), ModifiedLabelEncoder(), OneHotEncoder(sparse=False))

fu = make_union(numeric_cols_pipe, categorical_cols_pipe)

trans_vec = fu.fit_transform(xtrain)
print(trans_vec.shape)
print(trans_vec[:5])

  y = column_or_1d(y, warn=True)


(330045, 14)
[[-0.14157253 -0.53573597  1.5072655  -0.26915833  1.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.        ]
 [ 0.67523751  2.25794312 -1.21226978  0.32308687  1.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.        ]
 [-0.23705919  0.02299985 -0.51337753 -0.26653963  0.          1.
   0.          0.          0.          0.          0.          0.
   0.          0.        ]
 [-0.27748582 -0.25636806  1.7748667  -0.03188227  1.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.        ]
 [-0.12941498  0.02299985  0.62542101 -0.19342614  0.          1.
   0.          0.          0.          0.          0.          0.
   0.          0.        ]]


In [11]:
pca = PCA(n_components=5)
principalComponents = pca.fit(trans_vec)

In [12]:
pca.explained_variance_ratio_

array([0.31133663, 0.21337905, 0.1967692 , 0.10323501, 0.04480853])

In [13]:
pca_vec = pca.transform(trans_vec)

In [14]:
test_fu_vec = fu.transform(test)
test_fu_vec.shape

  y = column_or_1d(y, warn=True)


(141448, 14)

In [15]:
test_pca_vec = pca.transform(test_fu_vec)
test_pca_vec.shape

(141448, 5)

In [16]:
# Split train data-set
x_train, x_test, y_train, y_test = train_test_split(pca_vec, ytrain.values, train_size = 0.75, random_state = 42)



In [17]:
def conv2df(preds):
    df = pd.DataFrame(data={
        'ID': test.index.values,
        'Upvotes': preds
    })
    df['Upvotes'] = df.Upvotes.astype(int)
    return df

In [31]:
dtr = DecisionTreeRegressor()
br_est = BaggingRegressor(dtr, n_jobs=-1, random_state=42)
br_est.fit(pca_vec, ytrain.values)
print('BR(PCA) Score : ', br_est.score(pca_vec, ytrain.values))
preds = br_est.predict(test_pca_vec)
conv2df(preds).to_csv('data/output/bagging_regressor_pca.csv', index=False)

BR(PCA) Score :  0.9450033019853505


In [32]:
br_est = BaggingRegressor(DecisionTreeRegressor(), n_jobs=-1, random_state=42)
params = {
    'n_estimators': [10, 15, 30],
    'max_samples': [.7, .8, 1],
    'max_features': [.8, 1.0],
}
gsv = GridSearchCV(br_est, params, cv=3, verbose=5)
gsv.fit(pca_vec, ytrain.values)
preds = gsv.predict(test_pca_vec)
conv2df(preds).to_csv('data/output/bagging_regressor_gsv_pca.csv', index=False)

Fitting 3 folds for each of 18 candidates, totalling 54 fits
[CV] max_features=0.8, max_samples=0.7, n_estimators=10 ..............
[CV]  max_features=0.8, max_samples=0.7, n_estimators=10, score=0.6720704043027315, total=  16.8s
[CV] max_features=0.8, max_samples=0.7, n_estimators=10 ..............


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   21.2s remaining:    0.0s


[CV]  max_features=0.8, max_samples=0.7, n_estimators=10, score=0.7093606870039354, total=  17.1s
[CV] max_features=0.8, max_samples=0.7, n_estimators=10 ..............


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   43.5s remaining:    0.0s


[CV]  max_features=0.8, max_samples=0.7, n_estimators=10, score=0.690257097801859, total=  19.6s
[CV] max_features=0.8, max_samples=0.7, n_estimators=15 ..............


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  1.1min remaining:    0.0s


[CV]  max_features=0.8, max_samples=0.7, n_estimators=15, score=0.6950455799608812, total=  24.7s
[CV] max_features=0.8, max_samples=0.7, n_estimators=15 ..............


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  1.6min remaining:    0.0s


[CV]  max_features=0.8, max_samples=0.7, n_estimators=15, score=0.7144236074218909, total=  22.6s
[CV] max_features=0.8, max_samples=0.7, n_estimators=15 ..............
[CV]  max_features=0.8, max_samples=0.7, n_estimators=15, score=0.6904127204003673, total=  21.9s
[CV] max_features=0.8, max_samples=0.7, n_estimators=30 ..............
[CV]  max_features=0.8, max_samples=0.7, n_estimators=30, score=0.7119904930690566, total=  43.0s
[CV] max_features=0.8, max_samples=0.7, n_estimators=30 ..............
[CV]  max_features=0.8, max_samples=0.7, n_estimators=30, score=0.7525918443939215, total=  41.8s
[CV] max_features=0.8, max_samples=0.7, n_estimators=30 ..............
[CV]  max_features=0.8, max_samples=0.7, n_estimators=30, score=0.7160691773876554, total=  34.8s
[CV] max_features=0.8, max_samples=0.8, n_estimators=10 ..............
[CV]  max_features=0.8, max_samples=0.8, n_estimators=10, score=0.7555065579376714, total=  16.3s
[CV] max_features=0.8, max_samples=0.8, n_estimators=10 .

[CV]  max_features=1.0, max_samples=1, n_estimators=30, score=-0.003628222544123983, total=   7.5s


[Parallel(n_jobs=1)]: Done  54 out of  54 | elapsed: 25.3min finished


In [33]:
print('Best Score:', gsv.best_score_)
print('Best Params : ', gsv.best_params_)

Best Score: 0.7492535199303325
Best Params :  {'max_features': 1.0, 'max_samples': 0.7, 'n_estimators': 30}


In [34]:
br_est = BaggingRegressor(DecisionTreeRegressor(max_depth=5), n_jobs=-1, random_state=42, n_estimators=100)
br_est.fit(pca_vec, ytrain.values)
print('BR Score : ', br_est.score(pca_vec, ytrain.values))

preds = br_est.predict(test_pca_vec)
conv2df(preds).to_csv('data/output/br1.csv', index=False)

BR Score :  0.8250594799880281


In [39]:
br = BaggingRegressor(DecisionTreeRegressor(), n_jobs=-1, random_state=42)
params = {
    'n_estimators': [25, 50, 75],
    'max_samples': [.7],
}
gsv = GridSearchCV(br, params, cv=3, verbose=5)
gsv.fit(trans_vec, ytrain.values)
print('BR-GSV Score : ', br_est.score(trans_vec, ytrain.values))
preds = gsv.predict(test_fu_vec)
conv2df(preds).to_csv('data/output/bagging_regressor_gsv.csv', index=False)

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV] max_samples=0.7, n_estimators=25 ................................
[CV]  max_samples=0.7, n_estimators=25, score=0.84809742413122, total=  33.3s
[CV] max_samples=0.7, n_estimators=25 ................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   40.8s remaining:    0.0s


[CV]  max_samples=0.7, n_estimators=25, score=0.8596348903958209, total=  30.8s
[CV] max_samples=0.7, n_estimators=25 ................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.3min remaining:    0.0s


[CV]  max_samples=0.7, n_estimators=25, score=0.8330459801276503, total=  28.3s
[CV] max_samples=0.7, n_estimators=50 ................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  1.9min remaining:    0.0s


[CV]  max_samples=0.7, n_estimators=50, score=0.8562313425378281, total=  58.2s
[CV] max_samples=0.7, n_estimators=50 ................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  3.2min remaining:    0.0s


[CV]  max_samples=0.7, n_estimators=50, score=0.8657241411038565, total= 1.0min
[CV] max_samples=0.7, n_estimators=50 ................................
[CV]  max_samples=0.7, n_estimators=50, score=0.8293827773373756, total= 1.0min
[CV] max_samples=0.7, n_estimators=75 ................................
[CV]  max_samples=0.7, n_estimators=75, score=0.8433411604010346, total= 1.4min
[CV] max_samples=0.7, n_estimators=75 ................................
[CV]  max_samples=0.7, n_estimators=75, score=0.86733836975034, total= 1.4min
[CV] max_samples=0.7, n_estimators=75 ................................
[CV]  max_samples=0.7, n_estimators=75, score=0.8291891910901801, total= 1.8min
[CV] max_samples=0.7, n_estimators=100 ...............................
[CV]  max_samples=0.7, n_estimators=100, score=0.8376999560661825, total= 2.2min
[CV] max_samples=0.7, n_estimators=100 ...............................
[CV]  max_samples=0.7, n_estimators=100, score=0.8698878397735448, total= 2.3min
[CV] max_sampl

[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed: 20.1min finished


BR-GSV Score :  -0.02170623333284505


In [41]:
gsv.best_params_

{'max_samples': 0.7, 'n_estimators': 50}

In [22]:
br_est = BaggingRegressor(DecisionTreeRegressor(max_depth=9), n_jobs=-1, random_state=42, n_estimators=25, verbose=1)
abr_est = AdaBoostRegressor(br_est, n_estimators=25, learning_rate=.45)
abr_est.fit(trans_vec, ytrain.values)
print('AdaBoost Score : ', abr_est.score(trans_vec, ytrain.values))
preds = abr_est.predict(test_fu_vec)
conv2df(preds).to_csv('data/output/adaboost_regressor.csv', index=False)

[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:   16.1s remaining:   16.1s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:   16.9s finished
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:    3.3s remaining:    3.3s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:    4.5s finished
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:   16.3s remaining:   16.3s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:   17.0s finished
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:    3.4s remaining:    3.4s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:    4.6s finished
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:   17.1s remaining:   17.1s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:   17.8s finished
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:    3.5s remaining:    3.5s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:    4.9s finished
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:   16.4s remaining:   16.4s
[Parallel(n_jobs=4)]

[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:    3.4s remaining:    3.4s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:    4.6s finished
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:    3.6s remaining:    3.6s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:    4.7s finished
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:    3.3s remaining:    3.3s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:    4.4s finished
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:    3.3s remaining:    3.3s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:    4.4s finished
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:    3.3s remaining:    3.3s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:    4.4s finished
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:    3.1s remaining:    3.1s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:    4.3s finished
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:    3.2s remaining:    3.2s
[Parallel(n_jobs=4)]

AdaBoost Score :  0.9897399847860748


[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:    2.8s remaining:    2.8s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:    3.9s finished
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:    2.8s remaining:    2.8s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:    3.9s finished
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:    2.7s remaining:    2.7s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:    3.8s finished
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:    2.7s remaining:    2.7s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:    3.9s finished
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:    2.9s remaining:    2.9s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:    4.0s finished
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:    2.6s remaining:    2.6s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:    3.8s finished
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:    2.7s remaining:    2.7s
[Parallel(n_jobs=4)]

In [32]:
# Ref.: http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html
class ModelTransformer(TransformerMixin):

    def __init__(self, model):
        self.model = model

    def fit(self, *args, **kwargs):
        self.model.fit(*args, **kwargs)
        return self

    def transform(self, X, **transform_params):
        return pd.DataFrame(self.model.predict(X))

In [52]:
dtr = DecisionTreeRegressor(random_state=42)
br_est = BaggingRegressor(dtr, n_jobs=-1, random_state=42)
gbr_est = GradientBoostingRegressor(random_state=42, n_estimators=100, learning_rate=0.25, max_depth=7)
knr = KNeighborsRegressor(n_neighbors=5)

efu = FeatureUnion([
        ('knn', ModelTransformer(knr)),
        ('gbr', ModelTransformer(gbr_est)),
        ('br', ModelTransformer(br_est)),
#         ('etr', ModelTransformer(ExtraTreesRegressor())),
#         ('rfr', ModelTransformer(RandomForestRegressor())),
#         ('par', ModelTransformer(PassiveAggressiveRegressor())),
#         ('en', ModelTransformer(ElasticNet())),
#         ('cluster', ModelTransformer(KMeans(n_clusters=2)))
    ])

pipe_est = Pipeline([
    ('estimators', efu),
    ('estimator', KNeighborsRegressor(n_neighbors=1))
])

In [53]:
pipe_est.fit(trans_vec, ytrain.values)

Pipeline(memory=None,
     steps=[('estimators', FeatureUnion(n_jobs=1,
       transformer_list=[('knn', <__main__.ModelTransformer object at 0x00000196BF1EDE48>), ('gbr', <__main__.ModelTransformer object at 0x00000196BF1ED6A0>), ('br', <__main__.ModelTransformer object at 0x00000196BF1EDE80>)],
       transformer_weights=None)), ('estimator', KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=1, p=2,
          weights='uniform'))])

In [51]:
preds = pipe_est.predict(test_fu_vec)
conv2df(preds).to_csv('data/output/ensembles.csv', index=False)