# Making and combining predictions for orders
The prediction of aids user is going order is made in this notebook. The notebook uses input from "Orders Model" notebook, where the orders models are fitted, and two "parallel" notebooks that produce w2vec features for orders, each preparing features for one of the cross-validation datasets and one chunk of the test dataset.

It was impossible to fit the model and make predictions in the same notebook, because of limitations of kaggle platform. On kaggle platform, notebooks with GPU have less memory available, and it was hard to fit all the required data into 13 GB of available RAM, so I had to move prediction to a different notebook without GPU support, but with 30Gb RAM available.

Two models are used to predict the orders, LGBM and catboost, trained on different cross-validation datasets. Both models rank the candidates, then those ranks evaluations are scaled and combined to produce the final prediction.
## Imports and definitions

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import gc
from lightgbm.sklearn import LGBMRanker
from catboost import CatBoostRanker, Pool    
from sklearn.preprocessing import StandardScaler
import joblib

# functions and classes common for several notebooks of current project
import otto_common

In [2]:
# This function was used to test new features before adding them to the pipeline.
# Now it only deletes the day_of_week column, which is used to construct some features.
def prepare_df(df):
    del df['day_of_week']
    return df

## Load the models and make predictions

In [3]:
# Load the LGBM model.
model = joblib.load('/kaggle/input/otto-model-orders/lgb.pkl')

In [4]:
# Make predictions using the LGBM model.
file_path_part_0 = '/kaggle/input/otto-orders-w2vec/train_features_with_w2v_part_0.parquet'
file_path_part_1 = '/kaggle/input/otto-orders-w2vec-part1/train_features_with_w2v_part_1.parquet'

# Load and prepare the data.
for i in range(2):
    print('Start predicting '+ str(i))
    j_max = 3
    for j in range(j_max):
        if i == 0:
            df_test = pd.read_parquet(file_path_part_0)
        else:
            df_test = pd.read_parquet(file_path_part_1)
        df_test = otto_common.divide_df_by_column(df_test, j_max, j, 'session')
        df_test = prepare_df(df_test)
        x_cols = list(df_test.columns[2:])
        
        # Prediction itself.
        df_test['gbdt_prediction'] = model.predict(df_test[x_cols])
        
        # Remove the features and combine the predictions for chunks of test data into a single dataframe.
        df_test = df_test[['session','order_predictions','gbdt_prediction']]
        gc.collect()
        if (i == 0) & (j == 0):
            df_cv1 = df_test
        else:
            df_cv1 = pd.concat([df_cv1, df_test])
    print('Predictions made '+ str(i))
del df_test, model
gc.collect()

Start predicting 0
Predictions made 0
Start predicting 1
Predictions made 1


0

In [5]:
# Load the catboost model.

model_catboost = CatBoostRanker()

model_catboost.load_model("/kaggle/input/otto-model-orders/model")

<catboost.core.CatBoostRanker at 0x7fa1363032d0>

In [6]:
# Make predictions using the catboost model.

# Load and prepare the data.
for i in range(2):
    print('Start predicting '+ str(i))
    j_max = 5
    for j in range(j_max):
        if i == 0:
            df_test = pd.read_parquet(file_path_part_0)
        else:
            df_test = pd.read_parquet(file_path_part_1)
        df_test = otto_common.divide_df_by_column(df_test, j_max, j, 'session')
        df_test = prepare_df(df_test)
        x_cols = list(df_test.columns[2:])
        test_pool = Pool(
            data=df_test[x_cols],
            group_id=df_test['session']
        )
        gc.collect()
        
        # Prediction itself.
        df_test['from_cv2_prediction'] = model_catboost.predict(test_pool)
        
        # Remove the features and combine the predictions for chunks of test data into a single dataframe.
        df_test = df_test[['session','order_predictions','from_cv2_prediction']]
        gc.collect()
        if (i == 0) & (j == 0):
            df_cv2 = df_test
        else:
            df_cv2 = pd.concat([df_cv2, df_test])
    print('Predictions made '+ str(i))
del df_test, test_pool, model_catboost
gc.collect()

Start predicting 0
Predictions made 0
Start predicting 1
Predictions made 1


0

In [7]:
# Merge the predictions made by both models into a single dataframe
df_total = pd.merge(df_cv1, df_cv2, how='outer', on=['session', 'order_predictions'])

del df_cv1, df_cv2
gc.collect()

21

## Scale the results and calculate the average prediction

In [8]:
# Scale the results.
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df_total[['gbdt_prediction', 'from_cv2_prediction']].values)
scaled_results = pd.DataFrame(scaled_data)
scaled_results = scaled_results.rename(columns={0:'lgbm', 1:'cat'})
df_total = pd.concat([df_total, scaled_results], axis=1)

In [9]:
# Calculate the combined prediction using hand-picked coefficients.
df_total['sum'] = 0.7 * df_total['lgbm'] + 0.3 * df_total['cat']
df_total = df_total[['session','order_predictions','sum']]

## Final formatting and export to file

In [10]:
# Select top 20 candidates and format the prediction as required by organizers.
df_total = otto_common.select_top_20_and_format(df_total, 'order_predictions','sum')

In [11]:
df_total.to_parquet('gbdt_predictions_from_both_cvs.parquet')