# Feature: Word Mover's Distance

Based on the pre-trained word embeddings, we'll compute the Word Mover's Distance between each tokenized question pair.

## Imports

This utility package imports `numpy`, `pandas`, `matplotlib` and a helper `kg` module into the root namespace.

In [1]:
from pygoose import *

In [2]:
from gensim.models.wrappers.fasttext import FastText

In [3]:
from gensim.models import KeyedVectors

## Config

Automatically discover the paths to various data folders and compose the project structure.

In [4]:
project = kg.Project.discover()

Identifier for storing these features on disk and referring to them later.

In [5]:
feature_list_id = 'wmd'

## Read Data

Preprocessed and tokenized questions.

In [6]:
tokens_train = kg.io.load(project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_no_stopwords_train.pickle')
tokens_test = kg.io.load(project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_no_stopwords_test.pickle')

In [7]:
tokens = tokens_train + tokens_test

Pretrained word vector database.

In [8]:
#embedding_model = FastText.load_word2vec_format(project.aux_dir + 'fasttext_vocab.vec')

In [9]:
embedding_model = KeyedVectors.load_word2vec_format(project.aux_dir + "fasttext_vocab.vec")

## Build Features

In [10]:
def wmd(pair):
    return embedding_model.wmdistance(pair[0], pair[1])

In [11]:
wmds = kg.jobs.map_batch_parallel(
    tokens,
    item_mapper=wmd,
    batch_size=1000,
)

Batches:  28%|██▊       | 759/2751 [1:02:14<2:43:20,  4.92s/it]



Batches:  58%|█████▊    | 1606/2751 [2:07:48<1:31:07,  4.77s/it]



Batches: 100%|██████████| 2751/2751 [3:36:11<00:00,  4.72s/it]  


In [12]:
wmds = np.array(wmds).reshape(-1, 1)

In [13]:
X_train = wmds[:len(tokens_train)]
X_test = wmds[len(tokens_train):]

In [14]:
print('X_train:', X_train.shape)
print('X_test: ', X_test.shape)

X_train: (404290, 1)
X_test:  (2345796, 1)


## Save features

In [15]:
feature_names = [
    'wmd',
]

In [16]:
project.save_features(X_train, X_test, feature_names, feature_list_id)