# Carts Model
A model predicting which aids user is going to add to cart is trained in this notebook. This same notebook was also used for cross-validating the carts model. Unlike the clicks model, for carts, prediction is made in a separate notebook. On kaggle platform, notebooks with GPU have less memory available, and it was hard to fit all the required data into 13 GB of available RAM, so I had to move prediction to a different notebook without GPU support, but with 30Gb RAM available. This notebook uses input from two "parallel" notebooks that produce w2vec features for carts, one for cross-validation set and half of the test set and the other one for the other half of the test set. I've tried both catboost and LGBM models to predict carts, and LGBM showed better results. So, I've used the LGBM model to produce final results. But unlike clicks, for carts model I've removed the catboost code, to make the notebook shorter and clearer.
## Imports and definitions

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import gc
from humanize import naturalsize
from sklearn.model_selection import GroupKFold
from lightgbm.sklearn import LGBMRanker
import joblib

# functions and classes common for several notebooks of current project
import otto_common

In [2]:
# This function was used to test new features before adding them to the pipeline.
# Now it only deletes the day_of_week column, which is used to construct some features.
def prepare_df(df):
    del df['day_of_week']
    return df

## Load and prepare data

In [3]:
# Load the train/cross-validation data.
df_train = pd.read_parquet('/kaggle/input/otto-carts-w2vec/train_features_with_w2v_cv1.parquet')

In [4]:
# A few checks and preparations.
df_train = prepare_df(df_train)
gc.collect()

assert len(df_train[df_train.duplicated(subset=['session','cart_predictions'], keep=False)]) == 0

size = df_train.memory_usage(deep='True').sum()
print(naturalsize(size))

1.7 GB


In [5]:
# Set the LGBM model's parameters.
parameters = {
    "objective" : "lambdarank",
    "metric" : "ndcg",
    "boosting_type" : "gbdt",
    'min_child_samples' : 100,
    "n_estimators" : 250,
    "num_leaves" : 128,
    "importance_type" : 'gain',
    'max_depth' : 8,
    'learning_rate' : 0.07,
    'random_state' : 22,
    'device': 'gpu',
    'gpu_platform_id': 0,
    'gpu_device_id': 0,    
}
model = LGBMRanker(**parameters)

print('model_defined')


model_defined


In [6]:
# A few global parameters, used both for creating submission and cross-validation.
CROSS_VALIDATE = False # Should be changed to False to produce submission.
frac = 0.65 # fracture of records with target==False to be dropped from train to reduce memory usage
x_cols = list(df_train.columns[3:])

## Cross-validation

In [7]:
%%time
# Cell for cross-validation.

if CROSS_VALIDATE:
    # Define the splits and prepare a column to save results.
    n_splits = 4
    groups_by_session = df_train['session'].copy().tolist()
    group_kfold = GroupKFold(n_splits=n_splits)
    df_importances = pd.DataFrame({'columns':x_cols})
    df_train['cv_prediction'] = -1
    df_train['cv_prediction'] = df_train['cv_prediction'].astype(np.float32)
    # Fit the model and save the results.
    for i, (train_index, test_index) in enumerate(group_kfold.split(df_train[x_cols], df_train['target'], groups_by_session)):
        train_index = otto_common.remove_frac(train_index, df_train, frac)
        gc.collect()
        print('start_fitting')

        model.fit(
            df_train[x_cols].iloc[train_index],
            df_train.iloc[train_index, 2].astype(np.int8),
            group=df_train.iloc[train_index].groupby('session').size(),
        )
        column_name = 'imp_' + str(i)
        df_importances[column_name] = model.feature_importances_
        df_train['cv_prediction'].iloc[test_index] = model.predict(df_train[x_cols].iloc[test_index])
        gc.collect()
    del groups_by_session
    gc.collect()
    df_importances['imp_avg'] = df_importances.mean(axis=1)

CPU times: user 6 µs, sys: 0 ns, total: 6 µs
Wall time: 12.6 µs


In [8]:
# View feature_importances. Two cells were used to print feature importances so that it would be possible to compare values between two runs.
#df_importances

In [9]:
#df_importances

In [10]:
# Check the cross-validation results.
if CROSS_VALIDATE:
    otto_common.calculate_recall(df_train, 'cv_prediction', 567353)

## Fit the test model and save it to file

In [11]:
# Remove a fraction of negative samples.
if frac > 0:
    remove_index = df_train.loc[df_train['target'] == False].sample(frac=frac, random_state=25).index
    df_train = df_train.drop(remove_index)
    del remove_index
    gc.collect()

# Fit the model.
model.fit(df_train[x_cols],
          df_train.iloc[:,2].astype(np.int8),
          group=df_train.groupby('session').size())

del df_train
gc.collect()

8

In [12]:
# Save the model to file.
joblib.dump(model, 'lgb.pkl')

['lgb.pkl']