# 0.Overview
**Tutorial**: sequential-model-fixed-missing-last-item

**Author**: [astrung](https://github.com/astrung)

**Original link**: [notebook](https://www.kaggle.com/code/astrung/sequential-model-fixed-missing-last-item)

**Edit**:
* In my previous notebooks([here](https://www.kaggle.com/code/astrung/lstm-sequential-modelwith-item-features-tutorial) and [here](https://www.kaggle.com/code/astrung/lstm-sequential-modelwith-item-features-tutorial)), we have used test_data with `full_sort_topk`,but due to the limit of full_sort_topk we have missed last item for submited recommendation. Someone asked me about how can use all items as input features for recommendation in this [comment](https://www.kaggle.com/code/astrung/recbole-lstm-sequential-for-recomendation-tutorial/comments#1723707). 
* So i created a notebook [here](https://www.kaggle.com/code/astrung/recbole-using-all-items-for-prediction) for address there questions in detail, and this notebook is an improved of my [previous notebook](https://www.kaggle.com/code/astrung/lstm-sequential-modelwith-item-features-tutorial), applying our new function (using all item as input features without `full_sort_topk`) for this competition.
* I also create a improved version for adding item features into model in this [notebook](https://www.kaggle.com/astrung/lstm-model-with-item-infor-fix-missing-last-item). It improved a little score when add item features for recommendation

- - -


This notebook demonstrate how to use LSTM for recomendation system.
I am using Recbole as an open source, as it has so many built-in models for recommendation(CNN, GRU-LSTM, Context-aware, Graph). In this notebook, we tried to use GRU/LSTM model for testing effect of sequential model for recommendation.

Due to memory limit and faster testing purpose, we will just use data in 2020.

If you want to use with all of interactions in all time, i have created a new atomic dataset here for you: https://www.kaggle.com/astrung/hm-atomic-interation

We also have other limit: we only train model and predict with users who buy more than 40 items and items which is bought by more than 40 people.

We will follow below steps for creating model:

1. In order to use Recbole, we create atomic file from interaction data
2. Because we only use Recbole model for predicting with users who buy more than 40 items, other users will need to fill by default recomendation items. We create most viewed items in last month as defautl recomendation
3. We create dataset and train model in recbole.
4. We create prediction result by trained model
5. We combine recomendation result from most viewed items in last month and Recbole predicted model.

I will explain more detail in following cells.



In [1]:
!pip install recbole

Collecting recbole
  Downloading recbole-1.0.1-py3-none-any.whl (2.0 MB)
     |████████████████████████████████| 2.0 MB 618 kB/s            
Collecting colorlog==4.7.2
  Downloading colorlog-4.7.2-py2.py3-none-any.whl (10 kB)
Collecting scipy==1.6.0
  Downloading scipy-1.6.0-cp37-cp37m-manylinux1_x86_64.whl (27.4 MB)
     |████████████████████████████████| 27.4 MB 125 kB/s             
Installing collected packages: scipy, colorlog, recbole
  Attempting uninstall: scipy
    Found existing installation: scipy 1.7.3
    Uninstalling scipy-1.7.3:
      Successfully uninstalled scipy-1.7.3
  Attempting uninstall: colorlog
    Found existing installation: colorlog 6.6.0
    Uninstalling colorlog-6.6.0:
      Successfully uninstalled colorlog-6.6.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
yellowbrick 1.3.post1 requires numpy<1.20,>=1.16.

# 1. Create atomic file

In [2]:
import pandas as pd
import gc
df = pd.read_csv(r"/kaggle/input/h-and-m-personalized-fashion-recommendations/transactions_train.csv", 
                 dtype={'article_id': 'str'})
df.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2


In [3]:
df['t_dat'] = pd.to_datetime(df['t_dat'], format="%Y-%m-%d")
df

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0663713001,0.050831,2
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0541518023,0.030492,2
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,0505221004,0.015237,2
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,0685687003,0.016932,2
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,0685687004,0.016932,2
...,...,...,...,...,...
31788319,2020-09-22,fff2282977442e327b45d8c89afde25617d00124d0f999...,0929511001,0.059305,2
31788320,2020-09-22,fff2282977442e327b45d8c89afde25617d00124d0f999...,0891322004,0.042356,2
31788321,2020-09-22,fff380805474b287b05cb2a7507b9a013482f7dd0bce0e...,0918325001,0.043203,1
31788322,2020-09-22,fff4d3a8b1f3b60af93e78c30a7cb4cf75edaf2590d3e5...,0833459002,0.006763,1


In [4]:
import numpy as np
df['timestamp'] = df.t_dat.values.astype(np.int64) // 10 ** 9
df.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,timestamp
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2,1537401600
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2,1537401600
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2,1537401600
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2,1537401600
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2,1537401600


**We fill with data in only 2020(timestapm > > 1585620000) and create inter file**
For anyone need instruction about inter file, please check below links:
* https://recbole.io/docs/user_guide/data_intro.html
* https://recbole.io/docs/user_guide/data/atomic_files.html

In [5]:
temp = df[df['timestamp'] > 1585620000][['customer_id', 'article_id', 'timestamp']].rename(
    columns={'customer_id': 'user_id:token', 'article_id': 'item_id:token', 'timestamp': 'timestamp:float'})
temp

Unnamed: 0,user_id:token,item_id:token,timestamp:float
23934157,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0727808001,1585699200
23934158,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0727808007,1585699200
23934159,000563485cbb7850b0a93c6606f89c5b961c6647d1bd48...,0567532015,1585699200
23934160,000563485cbb7850b0a93c6606f89c5b961c6647d1bd48...,0706104009,1585699200
23934161,00083cda041544b2fbb0e0d2905ad17da7cf1007526fb4...,0783504004,1585699200
...,...,...,...
31788319,fff2282977442e327b45d8c89afde25617d00124d0f999...,0929511001,1600732800
31788320,fff2282977442e327b45d8c89afde25617d00124d0f999...,0891322004,1600732800
31788321,fff380805474b287b05cb2a7507b9a013482f7dd0bce0e...,0918325001,1600732800
31788322,fff4d3a8b1f3b60af93e78c30a7cb4cf75edaf2590d3e5...,0833459002,1600732800


We save atomic file in dataset format for using with recbole

In [6]:
!mkdir /kaggle/working/recbox_data
temp.to_csv('/kaggle/working/recbox_data/recbox_data.inter', index=False, sep='\t')
del temp
gc.collect()

160

# 2. We create defautl recomendation for user who can not be predicted by sequential model.
I use this approach in notebook: https://www.kaggle.com/hervind/h-m-faster-trending-products-weekly You can check it for more detail information. I will juse copy only code here

In [7]:
import os
import numpy as np
import pandas as pd

In [8]:
sub0 = pd.read_csv('../input/hm-pre-recommendation/submissio_byfone_chris.csv').sort_values('customer_id').reset_index(drop=True)
sub1 = pd.read_csv('../input/hm-pre-recommendation/submission_trending.csv').sort_values('customer_id').reset_index(drop=True)
sub2 = pd.read_csv('../input/hm-pre-recommendation/submission_exponential_decay.csv').sort_values('customer_id').reset_index(drop=True)

sub0.shape, sub1.shape, sub2.shape

((1371980, 2), (1371980, 2), (1371980, 2))

In [9]:
sub0.columns = ['customer_id', 'prediction0']
sub0['prediction1'] = sub1['prediction']
sub0['prediction2'] = sub2['prediction']
del sub1, sub2
gc.collect()
sub0.head()

Unnamed: 0,customer_id,prediction0,prediction1,prediction2
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0568601043 0568601006 0656719005 0745232001 07...,0568601043 0568601006 0656719005 0745232001 07...,0568601043 0924243001 0924243002 0918522001 07...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0826211002 0800436010 0739590027 0723529001 08...,0826211002 0800436010 0739590027 0723529001 08...,0924243001 0924243002 0918522001 0751471001 04...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0794321007 0852643001 0852643003 0858883002 07...,0794321007 0852643001 0852643003 0858883002 07...,0794321007 0924243001 0924243002 0918522001 07...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0448509014 0573085028 0751471001 0706016001 06...,0448509014 0573085028 0751471001 0706016001 06...,0924243001 0924243002 0918522001 0751471001 04...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0730683050 0791587015 0896152002 0818320001 09...,0730683050 0791587015 0896152002 0818320001 09...,0924243001 0924243002 0918522001 0751471001 04...


In [10]:
def cust_blend(dt, W = [1,1,1]):
    #Global ensemble weights
    #W = [1.15,0.95,0.85]
    
    #Create a list of all model predictions
    REC = []
    REC.append(dt['prediction0'].split())
    REC.append(dt['prediction1'].split())
    REC.append(dt['prediction2'].split())
    
    #Create a dictionary of items recommended. 
    #Assign a weight according the order of appearance and multiply by global weights
    res = {}
    for M in range(len(REC)):
        for n, v in enumerate(REC[M]):
            if v in res:
                res[v] += (W[M]/(n+1))
            else:
                res[v] = (W[M]/(n+1))
    
    # Sort dictionary by item weights
    res = list(dict(sorted(res.items(), key=lambda item: -item[1])).keys())
    
    # Return the top 12 itens only
    return ' '.join(res[:12])

sub0['prediction'] = sub0.apply(cust_blend, W = [1.05,1.00,0.95], axis=1)
sub0.head()

Unnamed: 0,customer_id,prediction0,prediction1,prediction2,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0568601043 0568601006 0656719005 0745232001 07...,0568601043 0568601006 0656719005 0745232001 07...,0568601043 0924243001 0924243002 0918522001 07...,0568601043 0568601006 0656719005 0745232001 09...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0826211002 0800436010 0739590027 0723529001 08...,0826211002 0800436010 0739590027 0723529001 08...,0924243001 0924243002 0918522001 0751471001 04...,0826211002 0800436010 0924243001 0739590027 07...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0794321007 0852643001 0852643003 0858883002 07...,0794321007 0852643001 0852643003 0858883002 07...,0794321007 0924243001 0924243002 0918522001 07...,0794321007 0852643001 0852643003 0858883002 09...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0448509014 0573085028 0751471001 0706016001 06...,0448509014 0573085028 0751471001 0706016001 06...,0924243001 0924243002 0918522001 0751471001 04...,0448509014 0573085028 0924243001 0751471001 07...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0730683050 0791587015 0896152002 0818320001 09...,0730683050 0791587015 0896152002 0818320001 09...,0924243001 0924243002 0918522001 0751471001 04...,0730683050 0791587015 0924243001 0896152002 08...


In [11]:
del sub0['prediction0']
del sub0['prediction1']
del sub0['prediction2']
gc.collect()
sub0.to_csv(f'submission.csv', index=False)

In [12]:
del sub0
del df
gc.collect()

21

# 3. Create dataset and train model with Recbole

For anyone need instruction document, please check this link: https://recbole.io/docs/user_guide/usage/use_modules.html

In [13]:
import logging
from logging import getLogger
from recbole.config import Config
from recbole.data import create_dataset, data_preparation
from recbole.model.sequential_recommender import GRU4Rec
from recbole.trainer import Trainer
from recbole.utils import init_seed, init_logger

In [14]:
parameter_dict = {
    'data_path': '/kaggle/working',
    'USER_ID_FIELD': 'user_id',
    'ITEM_ID_FIELD': 'item_id',
    'TIME_FIELD': 'timestamp',
    'user_inter_num_interval': "[30,inf)",
    'item_inter_num_interval': "[40,inf)",
    'load_col': {'inter': ['user_id', 'item_id', 'timestamp']},
    'neg_sampling': None,
    'epochs': 50,
    'eval_args': {
        'split': {'RS': [10, 0, 0]},
        'group_by': 'user',
        'order': 'TO',
        'mode': 'full'}
}

config = Config(model='GRU4Rec', dataset='recbox_data', config_dict=parameter_dict)

# init random seed
init_seed(config['seed'], config['reproducibility'])

# logger initialization
init_logger(config)
logger = getLogger()
# Create handlers
c_handler = logging.StreamHandler()
c_handler.setLevel(logging.INFO)
logger.addHandler(c_handler)

# write config info into log
logger.info(config)


General Hyper Parameters:
gpu_id = 0
use_gpu = True
seed = 2020
state = INFO
reproducibility = True
data_path = /kaggle/working/recbox_data
checkpoint_dir = saved
show_progress = True
save_dataset = False
dataset_save_path = None
save_dataloaders = False
dataloaders_save_path = None
log_wandb = False

Training Hyper Parameters:
epochs = 50
train_batch_size = 2048
learner = adam
learning_rate = 0.001
neg_sampling = None
eval_step = 1
stopping_step = 10
clip_grad_norm = None
weight_decay = 0.0
loss_decimal_place = 4

Evaluation Hyper Parameters:
eval_args = {'split': {'RS': [10, 0, 0]}, 'group_by': 'user', 'order': 'TO', 'mode': 'full'}
repeatable = True
metrics = ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision']
topk = [10]
valid_metric = MRR@10
valid_metric_bigger = True
eval_batch_size = 4096
metric_decimal_place = 4

Dataset Hyper Parameters:
field_separator = 	
seq_separator =  
USER_ID_FIELD = user_id
ITEM_ID_FIELD = item_id
RATING_FIELD = rating
TIME_FIELD = timestamp
seq_len = None


In [15]:
dataset = create_dataset(config)
logger.info(dataset)

recbox_data
The number of users: 38916
Average actions of users: 47.47241423615572
The number of items: 10962
Average actions of items: 168.54201259009216
The number of inters: 1847389
The sparsity of the dataset: 99.56694768867584%
Remain Fields: ['user_id', 'item_id', 'timestamp']


In [16]:
# dataset splitting
train_data, valid_data, test_data = data_preparation(config, dataset)

[Training]: train_batch_size = [2048] negative sampling: [None]
[Evaluation]: eval_batch_size = [4096] eval_args: [{'split': {'RS': [10, 0, 0]}, 'group_by': 'user', 'order': 'TO', 'mode': 'full'}]


In [17]:
# model loading and initialization
model = GRU4Rec(config, train_data.dataset).to(config['device'])
logger.info(model)

# trainer loading and initialization
trainer = Trainer(config, model)

# model training
best_valid_score, best_valid_result = trainer.fit(train_data)

GRU4Rec(
  (item_embedding): Embedding(10962, 64, padding_idx=0)
  (emb_dropout): Dropout(p=0.3, inplace=False)
  (gru_layers): GRU(64, 128, bias=False, batch_first=True)
  (dense): Linear(in_features=128, out_features=64, bias=True)
  (loss_fct): CrossEntropyLoss()
)
Trainable parameters: 783552
epoch 0 training [time: 29.21s, train loss: 7608.2684]
Saving current: saved/GRU4Rec-Mar-20-2022_02-28-47.pth
epoch 1 training [time: 26.44s, train loss: 7102.8474]
Saving current: saved/GRU4Rec-Mar-20-2022_02-28-47.pth
epoch 2 training [time: 25.82s, train loss: 6864.3110]
Saving current: saved/GRU4Rec-Mar-20-2022_02-28-47.pth
epoch 3 training [time: 25.81s, train loss: 6658.3106]
Saving current: saved/GRU4Rec-Mar-20-2022_02-28-47.pth
epoch 4 training [time: 25.81s, train loss: 6516.7922]
Saving current: saved/GRU4Rec-Mar-20-2022_02-28-47.pth
epoch 5 training [time: 25.82s, train loss: 6418.4797]
Saving current: saved/GRU4Rec-Mar-20-2022_02-28-47.pth
epoch 6 training [time: 25.55s, train loss

# 4. Create recommendation result from trained model

I note document here for any one want to customize it: https://recbole.io/docs/user_guide/usage/case_study.html

In [18]:
external_user_ids = dataset.id2token(
    dataset.uid_field, list(range(dataset.user_num)))[1:]#fist element in array is 'PAD'(default of Recbole) ->remove it 

In [19]:
import torch
from recbole.data.interaction import Interaction

def add_last_item(old_interaction, last_item_id, max_len=50):
    new_seq_items = old_interaction['item_id_list'][-1]
    if old_interaction['item_length'][-1].item() < max_len:
        new_seq_items[old_interaction['item_length'][-1].item()] = last_item_id
    else:
        new_seq_items = torch.roll(new_seq_items, -1)
        new_seq_items[-1] = last_item_id
    return new_seq_items.view(1, len(new_seq_items))

def predict_for_all_item(external_user_id, dataset, model):
    model.eval()
    with torch.no_grad():
        uid_series = dataset.token2id(dataset.uid_field, [external_user_id])
        index = np.isin(dataset[dataset.uid_field].numpy(), uid_series)
        input_interaction = dataset[index]
        test = {
            'item_id_list': add_last_item(input_interaction, 
                                          input_interaction['item_id'][-1].item(), model.max_seq_length),
            'item_length': torch.tensor(
                [input_interaction['item_length'][-1].item() + 1
                 if input_interaction['item_length'][-1].item() < model.max_seq_length else model.max_seq_length])
        }
        new_inter = Interaction(test)
        new_inter = new_inter.to(config['device'])
        new_scores = model.full_sort_predict(new_inter)
        new_scores = new_scores.view(-1, test_data.dataset.item_num)
        new_scores[:, 0] = -np.inf  # set scores of [pad] to -inf
    return torch.topk(new_scores, 10)

In [20]:
predict_for_all_item('0109ad0b5a76924a1b58be677409bb601cc8bead9a87b8ce5b08a4a1f5bc71ef', 
                     dataset, model)

torch.return_types.topk(
values=tensor([[7.9712, 7.7557, 6.3152, 6.0824, 6.0296, 5.8736, 5.8550, 5.8297, 5.8106,
         5.7406]], device='cuda:0'),
indices=tensor([[6713, 6663, 8766,  496, 8749, 2763, 3097, 2117,  643, 2838]],
       device='cuda:0'))

In [21]:
topk_items = []
for external_user_id in external_user_ids:
    _, topk_iid_list = predict_for_all_item(external_user_id, dataset, model)
    last_topk_iid_list = topk_iid_list[-1]
    external_item_list = dataset.id2token(dataset.iid_field, last_topk_iid_list.cpu()).tolist()
    topk_items.append(external_item_list)
print(len(topk_items))

38915


In [22]:
external_item_str = [' '.join(x) for x in topk_items]
result = pd.DataFrame(external_user_ids, columns=['customer_id'])
result['prediction'] = external_item_str
result.head()

Unnamed: 0,customer_id,prediction
0,0010e8eb18f131e724d6997909af0808adbba057529edb...,0372860001 0706016003 0706016001 0610776002 08...
1,0064cd1ee810d4caabd1182a8f177479b82b18961bd76b...,0894956001 0907527001 0905957001 0769748014 07...
2,00ce4f170d9fe36d0aacca94addfc3b07f70f81dc7bde3...,0881244001 0867966009 0889652001 0750422039 07...
3,00d7ebd46f6a6d53630d41386b6ef6a505cdc4c80011ff...,0918522001 0915526001 0751592001 0924243001 09...
4,00eebac2c2e37626461e74e8395711964c4e01a7afa643...,0866731001 0875350003 0915526001 0933891001 08...


In [23]:
del external_item_str
del topk_items
del external_user_ids
del train_data
del valid_data
del test_data
del model
del Trainer
del logger
gc.collect()

21

In [24]:
del dataset
gc.collect()

21

# 5. Combine result from most bought items and GRU model

In [25]:
submit_df = pd.read_csv('submission.csv')
submit_df.shape

(1371980, 2)

In [26]:
submit_df.head()

Unnamed: 0,customer_id,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0568601043 0568601006 0656719005 0745232001 09...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0826211002 0800436010 0924243001 0739590027 07...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0794321007 0852643001 0852643003 0858883002 09...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0448509014 0573085028 0924243001 0751471001 07...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0730683050 0791587015 0924243001 0896152002 08...


In [27]:
submit_df = pd.merge(submit_df, result, on='customer_id', how='outer')
submit_df.head()

Unnamed: 0,customer_id,prediction_x,prediction_y
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0568601043 0568601006 0656719005 0745232001 09...,
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0826211002 0800436010 0924243001 0739590027 07...,
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0794321007 0852643001 0852643003 0858883002 09...,
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0448509014 0573085028 0924243001 0751471001 07...,
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0730683050 0791587015 0924243001 0896152002 08...,


In [28]:
submit_df = submit_df.fillna(-1)
submit_df['prediction'] = submit_df.apply(
    lambda x: x['prediction_y'] if x['prediction_y'] != -1 else x['prediction_x'], axis=1)
submit_df.head()

Unnamed: 0,customer_id,prediction_x,prediction_y,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0568601043 0568601006 0656719005 0745232001 09...,-1,0568601043 0568601006 0656719005 0745232001 09...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0826211002 0800436010 0924243001 0739590027 07...,-1,0826211002 0800436010 0924243001 0739590027 07...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0794321007 0852643001 0852643003 0858883002 09...,-1,0794321007 0852643001 0852643003 0858883002 09...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0448509014 0573085028 0924243001 0751471001 07...,-1,0448509014 0573085028 0924243001 0751471001 07...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0730683050 0791587015 0924243001 0896152002 08...,-1,0730683050 0791587015 0924243001 0896152002 08...


In [29]:
submit_df[submit_df['prediction_y'] != -1]

Unnamed: 0,customer_id,prediction_x,prediction_y,prediction
13,00009d946eec3ea54add5ba56d5210ea898def4b46c685...,0891899004 0562245099 0797892001 0516859008 07...,0568597007 0568601007 0831450002 0881244001 05...,0568597007 0568601007 0831450002 0881244001 05...
38,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,0734592001 0888024005 0572998013 0909869004 08...,0891591001 0933706001 0919499007 0911214001 09...,0891591001 0933706001 0919499007 0911214001 09...
169,0006d3ff0caf0cb4d4e0615ee5cb7d268622364d483335...,0930829001 0915529001 0870525005 0751471041 05...,0884319006 0928088001 0915529001 0832307007 09...,0884319006 0928088001 0915529001 0832307007 09...
175,00075ef36696a7b4ed8c83e22a4bf7ea7c90ee110991ec...,0860285001 0863595006 0824526004 0751471022 08...,0824526004 0874819002 0893059005 0893059004 08...,0824526004 0874819002 0893059005 0893059004 08...
195,00080403a669b3b89d1bef1ec73ea466d95e39698d6dde...,0825771007 0784053005 0924243001 0914886003 08...,0914319001 0876147001 0868038003 0914319002 08...,0914319001 0876147001 0868038003 0914319002 08...
...,...,...,...,...
1371778,fff624f63f0279200646a4f8bf27e5150096212d50fdd0...,0399256001 0873045001 0842755001 0885870002 08...,0842755001 0865917002 0869397001 0811900002 08...,0842755001 0865917002 0869397001 0811900002 08...
1371787,fff673307d4cdbf688e4a0bcfe7f671036033dbe7eba01...,0865086004 0881916001 0711053003 0791587015 07...,0865086004 0754238024 0894956001 0852584001 09...,0865086004 0754238024 0894956001 0852584001 09...
1371876,fffabaebcc10efa0e613b58de37901e04fa25a2f90a0a8...,0652924004 0894400002 0924243001 0573937001 08...,0756904015 0854777001 0739533002 0844874012 05...,0756904015 0854777001 0739533002 0844874012 05...
1371879,fffae8eb3a282d8c43c77dd2ca0621703b71e90904dfde...,0865624003 0396135007 0797892001 0817472007 07...,0729928025 0817472004 0729928001 0865624003 08...,0729928025 0817472004 0729928001 0865624003 08...


In [30]:
submit_df = submit_df.drop(columns=['prediction_y', 'prediction_x'])
submit_df.head()

Unnamed: 0,customer_id,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0568601043 0568601006 0656719005 0745232001 09...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0826211002 0800436010 0924243001 0739590027 07...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0794321007 0852643001 0852643003 0858883002 09...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0448509014 0573085028 0924243001 0751471001 07...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0730683050 0791587015 0924243001 0896152002 08...


In [31]:
submit_df.to_csv('submission.csv', index=False)