# Best Question Author Prediction - Enigma CodeFest - Analytics Vidya

## Problem Statement
* An online QnA platform has hired you as a data scientist to **identify the best questioning authors** on the platform. 
* Why? This identification will bring more insight into increasing the user engagement. 
* How? Given the tag of the question, number of views received, number of answers, username and reputation of the question author, the problem requires you to **predict the upvote count that the question will receive**.

## Data Dictionary

  | Variable    | Definition                                        |                             
  |-------------|---------------------------------------------------|
  | ID        	| Question ID                                       |                            
  | Tag       	| Anonymised tags representing question category    | 
  | Reputation	| Reputation score of question author               |      
  | Answers   	| Number of times question has been answered        | 
  | Username  	| Anonymised user id of question author             |    
  | Views     	| Number of times question has been viewed          | 
  | Upvotes   	| (Target) Number of upvotes for the question       | 

## Evaluation Metric

The evaluation metric for this competition is RMSE (root mean squared error)

## Tags

**Regression**

In [61]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion, make_union
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import BaggingRegressor, GradientBoostingRegressor, RandomForestRegressor, ExtraTreesRegressor
from sklearn.tree import DecisionTreeRegressor

from xgboost import XGBRegressor

In [62]:
sns.set_style('whitegrid')

## Load Data

In [63]:
train = pd.read_csv('data/train_NIR5Yl1.csv', index_col='ID')
# train = pd.read_csv('data/preprocessed/train.csv', index_col='ID')
print('Train Data Size :',train.shape)
train.head()

Train Data Size : (330045, 6)


Unnamed: 0_level_0,Tag,Reputation,Answers,Username,Views,Upvotes
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
52664,a,3942.0,2.0,155623,7855.0,42.0
327662,a,26046.0,12.0,21781,55801.0,1175.0
468453,c,1358.0,4.0,56177,8067.0,60.0
96996,a,264.0,3.0,168793,27064.0,9.0
131465,c,4271.0,4.0,112223,13986.0,83.0


In [64]:
test = pd.read_csv('data/test_8i3B3FC.csv', index_col='ID')
print('Test Data Size :',test.shape)
test.head()

Test Data Size : (141448, 5)


Unnamed: 0_level_0,Tag,Reputation,Answers,Username,Views
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
366953,a,5645.0,3.0,50652,33200.0
71864,c,24511.0,6.0,37685,2730.0
141692,i,927.0,1.0,135293,21167.0
316833,i,21.0,6.0,166998,18528.0
440445,i,4475.0,10.0,53504,57240.0


In [60]:
'''
# Removing Outliers
# train.groupby('Tag').describe().T
print('Before Removing Outliers, Shape :', train.shape)
outliers = train[
    (train.Upvotes > train.Views) | 
    (train.Answers > 70) | 
    (train.Views > 500000) |
    (train.Reputation > 1000000)
]
train = train[~train.index.isin(outliers.index)]
print('After Removing Outliers, Shape :', train.shape)
'''

"\n# Removing Outliers\n# train.groupby('Tag').describe().T\nprint('Before Removing Outliers, Shape :', train.shape)\noutliers = train[\n    (train.Upvotes > train.Views) | \n    (train.Answers > 70) | \n    (train.Views > 500000) |\n    (train.Reputation > 1000000)\n]\ntrain = train[~train.index.isin(outliers.index)]\nprint('After Removing Outliers, Shape :', train.shape)\n"

## Pre-process Train Data

In [65]:
ytrain = train.pop('Upvotes')
xtrain = train
train = None
print(xtrain.head(), '\n\n\n', ytrain.head())

       Tag  Reputation  Answers  Username    Views
ID                                                
52664    a      3942.0      2.0    155623   7855.0
327662   a     26046.0     12.0     21781  55801.0
468453   c      1358.0      4.0     56177   8067.0
96996    a       264.0      3.0    168793  27064.0
131465   c      4271.0      4.0    112223  13986.0 


 ID
52664       42.0
327662    1175.0
468453      60.0
96996        9.0
131465      83.0
Name: Upvotes, dtype: float64


In [48]:
# xtrain = xtrain.drop(columns=['Username'])
# xtrain.head(3)

In [49]:
# test = test.drop(columns=['Username'])
# test.head(3)

In [66]:
xtrain.Reputation = xtrain.Reputation.astype(int)
xtrain.Answers = xtrain.Answers.astype(int)
xtrain.Views = xtrain.Views.astype(int)
xtrain.Username = xtrain.Username.astype('category')
print(xtrain.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 330045 entries, 52664 to 300553
Data columns (total 5 columns):
Tag           330045 non-null object
Reputation    330045 non-null int32
Answers       330045 non-null int32
Username      330045 non-null category
Views         330045 non-null int32
dtypes: category(1), int32(3), object(1)
memory usage: 16.2+ MB
None


In [67]:
test.Reputation = test.Reputation.astype(int)
test.Answers = test.Answers.astype(int)
test.Views = test.Views.astype(int)
test.Username = test.Username.astype('category')
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 141448 entries, 366953 to 107271
Data columns (total 5 columns):
Tag           141448 non-null object
Reputation    141448 non-null int32
Answers       141448 non-null int32
Username      141448 non-null category
Views         141448 non-null int32
dtypes: category(1), int32(3), object(1)
memory usage: 7.4+ MB


In [68]:
xtrain.Tag = xtrain.Tag.astype('category')
xtrain.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 330045 entries, 52664 to 300553
Data columns (total 5 columns):
Tag           330045 non-null category
Reputation    330045 non-null int32
Answers       330045 non-null int32
Username      330045 non-null category
Views         330045 non-null int32
dtypes: category(2), int32(3)
memory usage: 14.0 MB


In [69]:
test.Tag = test.Tag.astype('category')
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 141448 entries, 366953 to 107271
Data columns (total 5 columns):
Tag           141448 non-null category
Reputation    141448 non-null int32
Answers       141448 non-null int32
Username      141448 non-null category
Views         141448 non-null int32
dtypes: category(2), int32(3)
memory usage: 6.5 MB


In [70]:
class ModifiedLabelEncoder(LabelEncoder):

    def fit_transform(self, y, *args, **kwargs):
        return super().fit_transform(y).reshape(-1, 1)

    def transform(self, y, *args, **kwargs):
        return super().transform(y).reshape(-1, 1)

In [71]:
from sklearn.base import BaseEstimator, TransformerMixin

class ColumnSelector(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None):
        self.columns = columns
    
    def fit (self, X, y=None, **fit_params):
        return self
    
    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        try:
            return X[self.columns]
        except KeyError:
            cols_error = list(set(self.columns) - set(X.columns))
            raise KeyError("The DataFrame does not include the columns: %s" % cols_error)

numeric_cols = ['Reputation', 'Answers', 'Username', 'Views'] # list of numeric column names
# numeric_cols = ['Reputation', 'Answers', 'Views']
categorical_cols = ['Tag'] # list of categorical column names

In [72]:
numeric_cols_pipe = make_pipeline(ColumnSelector(columns=numeric_cols),StandardScaler())
categorical_cols_pipe = make_pipeline(ColumnSelector(columns=categorical_cols), ModifiedLabelEncoder(), OneHotEncoder(sparse=False))
fu = make_union(numeric_cols_pipe, categorical_cols_pipe)

trans_vec = fu.fit_transform(xtrain)
print(trans_vec.shape)
print(trans_vec[:5])

  y = column_or_1d(y, warn=True)


(330045, 14)
[[-0.14157253 -0.53573597  1.5072655  -0.26915833  1.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.        ]
 [ 0.67523751  2.25794312 -1.21226978  0.32308687  1.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.        ]
 [-0.23705919  0.02299985 -0.51337753 -0.26653963  0.          1.
   0.          0.          0.          0.          0.          0.
   0.          0.        ]
 [-0.27748582 -0.25636806  1.7748667  -0.03188227  1.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.        ]
 [-0.12941498  0.02299985  0.62542101 -0.19342614  0.          1.
   0.          0.          0.          0.          0.          0.
   0.          0.        ]]


In [16]:
# Split train data-set
# x_train, x_test, y_train, y_test = train_test_split(trans_vec, ytrain.values, train_size = 0.75, random_state = 42)

In [73]:
test_fu = fu.transform(test)
test_fu.shape

  y = column_or_1d(y, warn=True)


(141448, 14)

In [74]:
def conv2df(preds):
    df = pd.DataFrame(data={
        'ID': test.index.values,
        'Upvotes': preds
    })
    df['Upvotes'] = df.Upvotes.astype(int)
    return df

In [75]:
dtr = DecisionTreeRegressor(random_state=42)
br = BaggingRegressor(dtr, n_jobs=-1, random_state=42)
br.fit(trans_vec, ytrain.values)
preds = br.predict(test_fu)
conv2df(preds).to_csv('data/output/bagging_regressor.csv', index=False)

In [27]:
dtr = DecisionTreeRegressor(random_state=42)
br = BaggingRegressor(dtr, n_jobs=-1, random_state=42)
params = {
    'n_estimators': [50,100],
    'max_samples': [.7],
    'max_features': [1.0],
}
gsv = GridSearchCV(br, params, cv=3, verbose=1, n_jobs=-1)
gsv.fit(trans_vec, ytrain.values)
preds = gsv.predict(test_fu)
conv2df(preds).to_csv('data/output/bagging_regressor.csv', index=False)

In [32]:
knr_est = KNeighborsRegressor(n_neighbors=5, n_jobs=-1)
knr_est.fit(trans_vec, ytrain.values)
preds = knr_est.predict(test_fu)
conv2df(preds).to_csv('data/output/knr5.csv', index=False)

In [42]:
gbr_est = GradientBoostingRegressor(random_state=42, n_estimators=250, learning_rate=0.15)
gbr_est.fit(trans_vec, ytrain.values)
preds = gbr_est.predict(test_fu)
conv2df(preds).to_csv('data/output/gbr.csv', index=False)

In [55]:
rf_est = RandomForestRegressor(random_state=42, n_jobs=-1, verbose=1, n_estimators=100, max_depth=7)
rf_est.fit(trans_vec, ytrain.values)
preds = rf_est.predict(test_fu)
conv2df(preds).to_csv('data/output/rf.csv', index=False)

[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   20.2s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   45.2s finished
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.3s finished


In [59]:
etr_est = ExtraTreesRegressor(random_state=42, n_jobs=-1, verbose=1, n_estimators=100, max_depth=9)
etr_est.fit(trans_vec, ytrain.values)
preds = etr_est.predict(test_fu)
conv2df(preds).to_csv('data/output/etr.csv', index=False)

[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   20.5s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   47.2s finished
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.1s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.3s finished


In [19]:
xgbr_est = XGBRegressor(random_state=42, n_jobs=-1)
xgbr_est.fit(trans_vec, ytrain.values)
preds = xgbr_est.predict(test_fu)
preds = preds.astype(np.int64)
preds = preds + np.abs(np.min(preds))
np.min(preds), np.max(preds)
conv2df(preds).to_csv('data/output/xgbr.csv', index=False) # LB Score 1266