# W2vec features for clicks

Here, the w2vec features for clicks model are built using w2vec model from "W2vec model for clicks" notebook. There are total 2 w2vec features used in clicks model - w2vec similarity between candidate and last aid in session (also called "first_feature") and w2vec similarity between candidate and aid before last in session (also called "second feature").
## Imports and definitions

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import gc
from humanize import naturalsize
from gensim.models import Word2Vec
from pandarallel import pandarallel

# functions and classes common for several notebooks of current project
import otto_common

In [2]:
# Calculate w2vec similarity between 2 columns.
def w2v_similarity(features_path, model_w2v_path, col_name, col_name_result):
    df_w2v = pd.read_parquet(features_path)
    df_w2v = df_w2v[[col_name, 'click_predictions']]
    gc.collect()
    model = Word2Vec.load(model_w2v_path)
    pandarallel.initialize(nb_workers=4)
    df_w2v[col_name_result] = df_w2v.parallel_apply(
        lambda x: model.wv.similarity(x[col_name], x.click_predictions) if x[col_name] >= 0 else -1, axis=1
    )
    del df_w2v[col_name], df_w2v['click_predictions']
    return df_w2v

## W2vec features for cross-validation dataset

In [3]:
# Define the paths, calculate the w2vec features and join them to the dataframe with all the other features.
features_train = '/kaggle/input/otto-feature-engineering-clicks/cv1_features.parquet'
model_train = '/kaggle/input/otto-word2vec/word2vec_cv.wordvectors'

first_feature = w2v_similarity(features_train, model_train, 'first_aid', 'similarity_first')
second_feature = w2v_similarity(features_train, model_train, 'second_aid', 'similarity_second')
df_train = pd.read_parquet(features_train)
df_train = pd.concat([df_train, first_feature, second_feature], axis=1)


INFO: Pandarallel will run on 4 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
INFO: Pandarallel will run on 4 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


In [4]:
# Check file size and export to file.
size = df_train.memory_usage(deep='True').sum()
print(naturalsize(size))
df_train.to_parquet('cv1_features_with_w2v.parquet')

del df_train, first_feature, second_feature
gc.collect()

4.6 GB


0

## W2vec features for the test dataset

In [5]:
# The features for the test dataset are split between who chunks. So, we built the features for each chunk.
# First chunk is processed in this cell.
features_test1 = '/kaggle/input/otto-feature-engineering-clicks/test_features_cart_part_0.parquet'
features_test2 = '/kaggle/input/otto-feature-engineering-clicks/test_features_cart_part_1.parquet'
model_test = '/kaggle/input/otto-word2vec/word2vec_test.wordvectors'

first_feature = w2v_similarity(features_test1, model_test, 'first_aid', 'similarity_first')
second_feature = w2v_similarity(features_test1, model_test, 'second_aid', 'similarity_second')
df_test1 = pd.read_parquet(features_test1)
df_test1 = pd.concat([df_test1, first_feature, second_feature], axis=1)

INFO: Pandarallel will run on 4 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
INFO: Pandarallel will run on 4 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


In [6]:
# Check file size and export to file.
size = df_test1.memory_usage(deep='True').sum()
print(naturalsize(size))
df_test1.to_parquet('test_features_with_w2v_part_0.parquet')

del df_test1, first_feature, second_feature
gc.collect()

3.6 GB


21

In [7]:
# Same for the second chunk.
first_feature = w2v_similarity(features_test2, model_test, 'first_aid', 'similarity_first')
second_feature = w2v_similarity(features_test2, model_test, 'second_aid', 'similarity_second')
df_test2 = pd.read_parquet(features_test2)
df_test2 = pd.concat([df_test2, first_feature, second_feature], axis=1)

INFO: Pandarallel will run on 4 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
INFO: Pandarallel will run on 4 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


In [8]:
# Check file size and export to file.
size = df_test2.memory_usage(deep='True').sum()
print(naturalsize(size))
df_test2.to_parquet('test_features_with_w2v_part_1.parquet')

3.6 GB
