# Notebook for producing Kaggle submissions

Notebook is just used for producing Kaggle predictions - not many comments because explanations are in other notebooks with the same code.

In [1]:
from PrepareData import prepare_data
from lightgbm.sklearn import LGBMRanker
from sklearn.ensemble import AdaBoostClassifier, AdaBoostRegressor
from sklearn.naive_bayes import GaussianNB
from rankers.Stacker import Stacker
from rankers.Ranker import Ranker
import pandas as pd

In [2]:
#10 weeks in total for training, validation is done on last 5 weeks
#this is due to memory contstraints. Ideally, nr_validation_weeks would equal nr_training_weeks so validation scores represent abilities of ranker when trained.
nr_training_weeks = 10
nr_validation_weeks = 5

In [3]:
train, test, train_baskets, bestsellers_previous_week = prepare_data(kaggle_submission=True, nr_training_weeks=nr_training_weeks)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [4]:
columns_to_use = ['article_id', 'product_type_no', 'graphical_appearance_no', 'colour_group_code', 'perceived_colour_value_id',
'perceived_colour_master_id', 'department_no', 'index_code',
'index_group_no', 'section_no', 'garment_group_no', 'FN', 'Active',
'club_member_status', 'fashion_news_frequency', 'age', 'postal_code', 'bestseller_rank']

In [5]:
test_X = test#[columns_to_use]

In [6]:
lgbm_ranker = LGBMRanker(
    objective="lambdarank",
    metric="ndcg",
    boosting_type="dart",
    n_estimators=1,
    importance_type='gain',
    verbose=0
)

In [7]:
adaboost_ranker = Ranker(AdaBoostClassifier())

In [8]:
gnb_ranker = Ranker(GaussianNB())

## With GNB

### Metamodel (using AdaBoost ranker as metamodel)

In [9]:
stacker = Stacker([lgbm_ranker, gnb_ranker, adaboost_ranker], Ranker(AdaBoostClassifier()), use_groups=[True, False, False])

In [10]:
stacker.fit(train, columns_to_use, nr_validation_weeks=nr_validation_weeks)

computing validation predictions for each of the base rankers...
training metamodel


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_no_val[f"train{i}"] = train.groupby(['week', 'customer_id'])[f"ranker{i}"].rank(ascending=False)              #ascending so "best rank" is always the same number (1) - same done when predicting
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_no_val[f"train{i}"] = train.groupby(['week', 'customer_id'])[f"ranker{i}"].rank(ascending=False)              #ascending so "best rank" is always the same number (1) - same done when predicting
A value is trying to be set on a copy of a slice from a DataFrame.
Try

metamodel training shape: (5049557, 3)
Computing scores on validatation...
retraining base rankers on full training set...


<rankers.Stacker.Stacker at 0x7fc0b07bfbb0>

In [11]:
test['ranker_meta_model'] = stacker.predict(test_X, columns_to_use, weighting="metamodel")

Predicting with metamodel
Prediction matrix shape: (6610150, 3)
prediction matrix:
[[ 1.   6.   1. ]
 [ 3.5  9.   2. ]
 [ 3.5  8.   4. ]
 ...
 [12.5  1.  15. ]
 [12.5  8.  13. ]
 [12.5  4.  16. ]]


### Metamodel using AdaBoost Regressor as metamodel

In [12]:
stacker = Stacker([lgbm_ranker, gnb_ranker, adaboost_ranker], AdaBoostRegressor(), use_groups=[True, False, False])

In [13]:
stacker.fit(train, columns_to_use, nr_validation_weeks=nr_validation_weeks)

computing validation predictions for each of the base rankers...
training metamodel


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_no_val[f"train{i}"] = train.groupby(['week', 'customer_id'])[f"ranker{i}"].rank(ascending=False)              #ascending so "best rank" is always the same number (1) - same done when predicting
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_no_val[f"train{i}"] = train.groupby(['week', 'customer_id'])[f"ranker{i}"].rank(ascending=False)              #ascending so "best rank" is always the same number (1) - same done when predicting
A value is trying to be set on a copy of a slice from a DataFrame.
Try

metamodel training shape: (5049557, 3)
Computing scores on validatation...
retraining base rankers on full training set...


<rankers.Stacker.Stacker at 0x7fbf5f26b5b0>

In [14]:
test['regressor_meta_model'] = stacker.predict(test_X, columns_to_use, weighting="metamodel")

Predicting with metamodel
Prediction matrix shape: (6610150, 3)
prediction matrix:
[[ 1.   6.   1. ]
 [ 3.5  9.   2. ]
 [ 3.5  8.   4. ]
 ...
 [12.5  1.  15. ]
 [12.5  8.  13. ]
 [12.5  4.  16. ]]


### Weighted rank aggregation (no metamodel)

In [15]:
#predicting with rankers, unweighted (all rankers considered equally)
test['unweighted'] = stacker.predict(test_X, columns_to_use, weighting=None)

Predicting with None weighting


In [16]:
test['MRR_weighted'] = stacker.predict(test_X, columns_to_use, weighting="MRR")

Predicting with MRR weighting


In [17]:
test['MAPk_weighted'] = stacker.predict(test_X, columns_to_use, weighting="MAPk")

Predicting with MAPk weighting


In [18]:
test['naive_bayes'] = gnb_ranker.predict(test_X[columns_to_use])

## Without GNB

### Metamodel (using AdaBoost ranker as metamodel)

In [19]:
stacker = Stacker([lgbm_ranker, adaboost_ranker], Ranker(AdaBoostClassifier()), use_groups=[True, False])

In [20]:
stacker.fit(train, columns_to_use, nr_validation_weeks=nr_validation_weeks)

computing validation predictions for each of the base rankers...
training metamodel


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_no_val[f"train{i}"] = train.groupby(['week', 'customer_id'])[f"ranker{i}"].rank(ascending=False)              #ascending so "best rank" is always the same number (1) - same done when predicting
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_no_val[f"train{i}"] = train.groupby(['week', 'customer_id'])[f"ranker{i}"].rank(ascending=False)              #ascending so "best rank" is always the same number (1) - same done when predicting


metamodel training shape: (5049557, 2)
Computing scores on validatation...
retraining base rankers on full training set...


<rankers.Stacker.Stacker at 0x7fbf90d007f0>

In [21]:
test['no_GNB_ranker_meta_model'] = stacker.predict(test_X, columns_to_use, weighting="metamodel")

Predicting with metamodel
Prediction matrix shape: (6610150, 2)
prediction matrix:
[[ 1.   1. ]
 [ 3.5  2. ]
 [ 3.5  4. ]
 ...
 [12.5 15. ]
 [12.5 13. ]
 [12.5 16. ]]


### Metamodel (using AdaBoost ranker as metamodel)

In [22]:
stacker = Stacker([lgbm_ranker, adaboost_ranker], AdaBoostRegressor(), use_groups=[True, False])

In [23]:
stacker.fit(train, columns_to_use, nr_validation_weeks=nr_validation_weeks)

computing validation predictions for each of the base rankers...
training metamodel


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_no_val[f"train{i}"] = train.groupby(['week', 'customer_id'])[f"ranker{i}"].rank(ascending=False)              #ascending so "best rank" is always the same number (1) - same done when predicting
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_no_val[f"train{i}"] = train.groupby(['week', 'customer_id'])[f"ranker{i}"].rank(ascending=False)              #ascending so "best rank" is always the same number (1) - same done when predicting


metamodel training shape: (5049557, 2)
Computing scores on validatation...
retraining base rankers on full training set...


<rankers.Stacker.Stacker at 0x7fbf7420d840>

In [24]:
test['no_GNB_unweighted'] = stacker.predict(test_X, columns_to_use, weighting=None)

Predicting with None weighting


In [25]:
test['no_GNB_regressor_meta_model'] = stacker.predict(test_X, columns_to_use, weighting="metamodel")

Predicting with metamodel
Prediction matrix shape: (6610150, 2)
prediction matrix:
[[ 1.   1. ]
 [ 3.5  2. ]
 [ 3.5  4. ]
 ...
 [12.5 15. ]
 [12.5 13. ]
 [12.5 16. ]]


In [26]:
test['no_GNB_MRR_weighted'] = stacker.predict(test_X, columns_to_use, weighting="MRR")

Predicting with MRR weighting


In [27]:
test['no_GNB_MAPk_weighted'] = stacker.predict(test_X, columns_to_use, weighting="MAPk")

Predicting with MAPk weighting


In [28]:
pred_cols = ['unweighted', 'MRR_weighted', 'MAPk_weighted', 'ranker_meta_model', 'regressor_meta_model']
pred_cols = pred_cols + [f'no_GNB_{i}' for i in pred_cols] + ['naive_bayes']

# Create submissions

In [30]:
bestsellers_last_week = \
    bestsellers_previous_week[bestsellers_previous_week.week == bestsellers_previous_week.week.max()]['article_id'].tolist()

In [31]:
def customer_hex_id_to_int(series):
    return series.str[-16:].apply(hex_id_to_int)

def hex_id_to_int(str):
    return int(str[-16:], 16)

In [32]:
for preds_name in pred_cols:
    sub = pd.read_csv('../../../Data/sample_submission.csv')

    c_id2predicted_article_ids = test \
        .sort_values(['customer_id', preds_name], ascending=False) \
        .groupby('customer_id')['article_id'].apply(list).to_dict()

    preds = []
    for c_id in customer_hex_id_to_int(sub.customer_id):
        pred = c_id2predicted_article_ids.get(c_id, [])
        pred = pred + bestsellers_last_week
        preds.append(pred[:12])

    preds = [' '.join(['0' + str(p) for p in ps]) for ps in preds]
    sub.prediction = preds

    sub_name = f'../../Submissions_EnsembleOfEnsembles/stacker/{preds_name}'
    sub.to_csv(f'{sub_name}.csv.gz', index=False)


In [34]:
for preds_name in pred_cols:
    !kaggle competitions submit -c h-and-m-personalized-fashion-recommendations -f '../../Submissions_EnsembleOfEnsembles/stacker/{preds_name}.csv.gz' -m {preds_name}

100%|███████████████████████████████████████| 59.0M/59.0M [02:19<00:00, 443kB/s]
100%|███████████████████████████████████████| 58.6M/58.6M [02:22<00:00, 431kB/s]
100%|███████████████████████████████████████| 58.4M/58.4M [02:17<00:00, 445kB/s]
100%|███████████████████████████████████████| 58.4M/58.4M [02:17<00:00, 444kB/s]
100%|███████████████████████████████████████| 57.9M/57.9M [02:17<00:00, 441kB/s]
100%|███████████████████████████████████████| 57.8M/57.8M [02:17<00:00, 439kB/s]
100%|███████████████████████████████████████| 57.8M/57.8M [02:21<00:00, 428kB/s]
100%|███████████████████████████████████████| 57.8M/57.8M [02:22<00:00, 426kB/s]
100%|███████████████████████████████████████| 57.8M/57.8M [02:16<00:00, 443kB/s]
100%|███████████████████████████████████████| 57.9M/57.9M [02:16<00:00, 445kB/s]
100%|███████████████████████████████████████| 58.3M/58.3M [02:15<00:00, 451kB/s]
Successfully submitted to H&M Personalized Fashion Recommendations