## a) Given a user’s activity, predict the likelihood of the user clicking on a search result.

In [95]:
import pandas as pd
import numpy as np

In [110]:
df = pd.read_table('user-ct-test-collection-02.txt')

In [111]:
# make columns in a lower case. Mixed cases are hard to type. '_' added to query for it collides with pandas method 'query.'
df.columns = ['anonid', 'query_', 'querytime', 'itemrank', 'clickurl']

In [112]:
df.querytime = pd.to_datetime(df.querytime)

In [113]:
df.head(5)

Unnamed: 0,anonid,query_,querytime,itemrank,clickurl
0,479,family guy,2006-03-01 16:01:20,,
1,479,also sprach zarathustra,2006-03-02 14:48:55,,
2,479,family guy movie references,2006-03-03 22:37:46,1.0,http://www.familyguyfiles.com
3,479,top grossing movies of all time,2006-03-03 22:42:42,1.0,http://movieweb.com
4,479,top grossing movies of all time,2006-03-03 22:42:42,2.0,http://www.imdb.com


## b) Feature engineering is a big part of a data science role. You will need to identify/derive additional features for this task.

Features generated as below.

In [114]:
def feature_engineer(x):
    # get a data frame and generate features from it, return features as pd.Series
    ret = dict()
    ret['n_click'] = len(x)
    ret['target'] = x.clickurl.notnull().all()
    ret['target'] = 1 if ret['target'] else 0
    return pd.Series(ret)

# information given in advnace is anonid, query_, querytime. Group the dataframe by those features and generates additional features.
# the only feature which can be generated from itemrank and clickurl is target.
# This is because of leakeage. The two features are always null if the user didn't click
df = df.groupby(['anonid','query_','querytime']).apply(feature_engineer).reset_index()

In [115]:
# generate features from features such as querytime
# features are generated by subgroups of anonid and merged into the original dataframe later

def get_tally(x):
    ret = dict()
    ret['n_query'] = len(x)
    q_len = x.query_.apply(lambda y: len(y))
    ret['query_len_sum'] = q_len.sum()
    ret['query_len_mean'] = q_len.mean()
    q_word_len = x.query_.apply(lambda y: len(y.split(' ')))
    ret['query_word_sum'] = q_word_len.sum()
    ret['query_word_mean'] = q_word_len.mean()
    ret['time_span'] = (x.querytime.max() - x.querytime.min()).seconds
    ret['weekday_min'] = x.querytime.min().weekday()
    ret['day_min'] = x.querytime.min().day
    ret['month_min'] = x.querytime.min().month
    ret['weekday_max'] = x.querytime.max().weekday()
    ret['day_max'] = x.querytime.max().day
    ret['month_max'] = x.querytime.max().month
    return pd.Series(ret)

tally = df.groupby('anonid').apply(get_tally)
df = df.merge(tally, on='anonid', how='left')

In [116]:
df['url_in_query'] = df.query_.str.contains('www.').astype(int)

In [117]:
df

Unnamed: 0,anonid,query_,querytime,n_click,target,n_query,query_len_sum,query_len_mean,query_word_sum,query_word_mean,time_span,weekday_min,day_min,month_min,weekday_max,day_max,month_max,url_in_query
0,479,6 6 06,2006-04-28 22:19:18,1,0,88.0,1635.0,18.579545,254.0,2.886364,31588.0,2.0,1.0,3.0,6.0,28.0,5.0,0
1,479,allegory of the cave,2006-03-06 22:03:19,3,1,88.0,1635.0,18.579545,254.0,2.886364,31588.0,2.0,1.0,3.0,6.0,28.0,5.0,0
2,479,also sprach zarathustra,2006-03-02 14:48:55,1,0,88.0,1635.0,18.579545,254.0,2.886364,31588.0,2.0,1.0,3.0,6.0,28.0,5.0,0
3,479,average tax refund in 2005,2006-04-07 01:54:56,1,1,88.0,1635.0,18.579545,254.0,2.886364,31588.0,2.0,1.0,3.0,6.0,28.0,5.0,0
4,479,bose,2006-03-03 23:30:11,1,1,88.0,1635.0,18.579545,254.0,2.886364,31588.0,2.0,1.0,3.0,6.0,28.0,5.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2868603,24969423,my space. com,2006-05-31 19:03:32,1,1,7.0,155.0,22.142857,24.0,3.428571,809.0,2.0,31.0,5.0,2.0,31.0,5.0,0
2868604,24969423,my space. com 3131560415,2006-05-31 19:02:36,1,0,7.0,155.0,22.142857,24.0,3.428571,809.0,2.0,31.0,5.0,2.0,31.0,5.0,0
2868605,24969423,my space. com 3131560415,2006-05-31 19:03:16,1,0,7.0,155.0,22.142857,24.0,3.428571,809.0,2.0,31.0,5.0,2.0,31.0,5.0,0
2868606,24969423,my space.com,2006-05-31 19:12:00,1,0,7.0,155.0,22.142857,24.0,3.428571,809.0,2.0,31.0,5.0,2.0,31.0,5.0,0


## c) How will you choose an appropriate model?

I just used lightgbm. It is a gradient boosted decision tree which has quite good performance in data competition platform Kaggle. Also, it doesn't require scaling and imputation of features.

In [94]:
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import precision_score

In [118]:
X = df.drop(['anonid', 'query_', 'querytime', 'target'], axis=1)
y = df.target

## d) Cross-validation is an important part of developing models. How will you cross-validate the model?
By splitting the dataset by StratifiedKFold function, training the model and evaluate it for each split and consolidating the evaluation into one list (out of fold).

## e) What metrics will you consider when reporting the reliability of the model?

Precision. I'm not sure what the objective of predicting the likelihood of the user clicking on a search result, however, if it is for having users click advertisements, the users predicted to click on a search result should actually click it. In this case, precision is the best metric.

In [119]:
skf = StratifiedKFold(n_splits=5)
oof = np.zeros(len(y))
for train_index, test_index in skf.split(X,y):
    print(type(train_index))
    X_train, y_train = X.iloc[train_index], y.iloc[train_index]
    X_test, y_test = X.iloc[test_index], y.iloc[test_index]

    lgb_train = lgb.Dataset(X_train, y_train)
    lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

    # LightGBM のハイパーパラメータ
    lgbm_params = {
        # 多値分類問題
        'objective': 'binary',

    }

    model = lgb.train(lgbm_params, lgb_train, 
                      valid_sets=lgb_eval,
                     num_boost_round=10000,
                     early_stopping_rounds=100)
    
    y_pred = model.predict(X_test, num_iteration=model.best_iteration)
    oof[test_index] = y_pred

oof_label = (oof > 0.5).astype(int)
score = precision_score(y, oof_label)

<class 'numpy.ndarray'>
[LightGBM] [Info] Number of positive: 961197, number of negative: 1333689
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1636
[LightGBM] [Info] Number of data points in the train set: 2294886, number of used features: 14
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.418843 -> initscore=-0.327525
[LightGBM] [Info] Start training from score -0.327525
[1]	valid_0's binary_logloss: 0.662311
Training until validation scores don't improve for 100 rounds
[2]	valid_0's binary_logloss: 0.649562
[3]	valid_0's binary_logloss: 0.638912
[4]	valid_0's binary_logloss: 0.630717
[5]	valid_0's binary_logloss: 0.624077
[6]	valid_0's binary_logloss: 0.618363
[7]	valid_0's binary_logloss: 0.614543
[8]	valid_0's binary_logloss: 0.610594
[9]	valid_0's binary_logloss: 0.607319
[10]	valid_0's binary_logloss: 0.605476
[11]	valid_0's binary_logloss: 0.603046
[12]	valid_0's binary_lo

In [120]:
score

0.7642989173488249

We got a moderate score.