# W2vec model for clicks

In this notebook, the known sessions from all the history are used to build a w2vec model. Information about event type and event time is removed, so the sequence of aids is the only information kept. As generation of a w2vec model takes time (more than two and a half hours for both cross-validation and test) it is done in a separate notebook. For the OTTO project, two w2vec models are build with slightly different parameters. This model uses a shorter window (window = 3) and is only used to generate features for the clicks model, while another w2vec model with window=4 is used to build features both for carts and orders models.

Hash function is the same for both models, it has been moved to otto_common, as it is used in every notebook that somehow uses any of the models.
## Imports and definitions

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import gc
from gensim.test.utils import common_texts
from gensim.models import Word2Vec

# functions and classes common for several notebooks of current project
import otto_common

In [2]:
!pip install polars
import polars as pl

Collecting polars
  Downloading polars-0.16.14-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.2/16.2 MB[0m [31m41.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: polars
Successfully installed polars-0.16.14
[0m

In [3]:
def prepare_sentences(sessions_path):
    df = pl.read_parquet(sessions_path)
    df = df.groupby('session').agg(pl.col('aid').alias('sentence'))
    return df['sentence'].to_list()

## W2vec model for cross-validation dataset

In [4]:
# Load the sessions available for cross-validation and transform them into sequence of aids.
sessions_path_cv = '/kaggle/input/otto-prepare-cv/cv_train.parquet'

sentences = prepare_sentences(sessions_path_cv)

In [5]:
%%time
# Train and save the w2vec model for cross-validation.
w2vec = Word2Vec(sentences=sentences, vector_size= 64, window = 3, negative = 8, ns_exponent = 0.2, sg = 1,
                 min_count=1, workers=4, seed = 1, hashfxn=otto_common.simple_hash_function)

w2vec.save("word2vec_cv.wordvectors")
del sentences, w2vec; gc.collect() 

CPU times: user 5h 33min 7s, sys: 31.1 s, total: 5h 33min 38s
Wall time: 1h 28min 24s


0

## W2vec model for test dataset

In [6]:
# Load the sessions available for test (this means full data) and transform them into sequence of aids.
sessions_path_test = '/kaggle/input/otto-prepare-cv/train_full.parquet'

sentences = prepare_sentences(sessions_path_test)

In [7]:
%%time
# Train and save the w2vec model for test.

w2vec_test = Word2Vec(sentences=sentences, vector_size= 64, window = 3, negative = 8, ns_exponent = 0.2, sg = 1,
                 min_count=1, workers=4, seed = 1, hashfxn=otto_common.simple_hash_function)
w2vec_test.save("word2vec_test.wordvectors")

CPU times: user 7h 46min 35s, sys: 41.8 s, total: 7h 47min 17s
Wall time: 2h 2min 58s
