# Word2Vec experiments

## Introduction

For this project we need to learn the embeddings of a Word2Vec algorithm.
The embeddings are the weights of a single layer sequential neural network.
On this notebook we will only focus on the methods used to create these embeddings



## Skip-gram

Skip-Gram, as opposed to CBOW, is used to predict the context of a word with the word as input.

TODO: explain method

## One-Hot encoding

TODO: explain

## Used dataset

TODO: find a dataset

## First approach : use Gensim to learn the embeddings

TODO: gensim

## Using Keras to fit a hand-crafted Word2Vec model

To better understand the underlying structure of the algorithm, we decided to implement our own neural network using Keras.

We plan on building the vocabulary and the context from our dataset and use the context to train a Keras Neural Network that will be our Word2Vec model, and compare it with other Word2Vec models.

In [None]:
from text_preprocessing import NLTKTokenizer

import keras
from keras import Sequential
from keras.layers.core import Dense, Activation
from itertools import chain
import pandas as pd
import numpy as np

### Preparing the context

In this part, we will build the context that will be used for training our word2vec. The first step is to transform our texts as token. Once it is done, we have a list of words to process as a stream in our build_context function.
This function will also build the vocabulary while processing the context of each word.

Here, we only look at the word before and the word after our current word to define its context, the output represent each word from its index in the vocabulary.

For example, with the sentence "This is a test. a", the vocabulary will look like "{'This': 0, 'is': 1, 'a': 2, 'test': 3, '.': 4}" and the context "[(0, 1), (1, 0, 2), (2, 1, 3), (3, 2, 2), (2, 3, 4), (4, 2)]" meaning that the word "This" have the index 1 as context, which is the word "is" in our vocabulary.

In [None]:
PATH = './data'

train_3 = f'{PATH}/data_train_3.csv'
test_3 = f'{PATH}/data_test_3.csv'
train_7 = f'{PATH}/data_train_7.csv'
train_16m_3 = f'{PATH}/training.1600000.processed.noemoticon.csv'

tweets = pd.read_csv(train_3, sep='\t', names=['ID', 'Class', 'Tweet'])
sample = tweets.Tweet.head(500)

In [None]:
str1 = 'a a This is a test. a'
str2 = 'The quick brown fox jumped over the lazy dog.'
str3 = 'Another text with a dog.'

tweets = pd.Series(str1).append(pd.Series(str2)).append(pd.Series(str3))

tokenizer = NLTKTokenizer(str2)

dataset = pd.Series(iter(tokenizer))
onehot = pd.get_dummies(dataset)


def build_context(stream, queue = []):
    for i in range(2):
        token = next(stream)
        if token not in build_context.vocab:
            build_context.vocab[token] = build_context.count
            build_context.count += 1
        if len(queue) > 2:
            queue.pop(0)
            queue.pop(0)
        queue.append(build_context.vocab[token])
        
    yield tuple(queue)

    for token in stream:
        if token not in build_context.vocab:
            build_context.vocab[token] = build_context.count
            build_context.count += 1
        queue.append(build_context.vocab[token])
        if len(queue) > 3:
            queue.pop(0)
        yield tuple([queue[i] for i in [1, 0, 2]])
    queue.pop(0)
    if (len(queue) < 2):
        print(queue[0])
    yield((queue[1], queue[0]))


build_context.vocab = {}
build_context.count = 0

contexts = []
for t in sample:
    currTok = NLTKTokenizer(t)
    contexts.append(list(build_context(iter(currTok))))
    
contexts = list(chain.from_iterable(contexts))
print(sample)
print('current context: ', contexts)
print('vocabulary: ', build_context.vocab)
print('voc size: ', build_context.count)

### Keras NN model

Arbitrary values for the moment, probably can still be optimized.

TODO: desc

In [None]:
model = Sequential()
model.add(Dense(300, input_dim=build_context.count))
model.add(Activation("linear"))
model.add(Dense(build_context.count))
model.add(Activation("softmax"))


model.compile(optimizer=keras.optimizers.RMSprop(), loss='binary_crossentropy', metrics=['accuracy'])


### Training dataset

In [None]:
def index_to_onehot(index, n):
    res = np.zeros((1, n))
    res[0, index] = 1
    return res


def context_to_onehot(neighbors, n):
    res = np.zeros((1, n))
    for index in neighbors:
        res[0, index] = 1
    return res / len(set(neighbors))


X_train = np.array([index_to_onehot(x[0], build_context.count)[0] for x in contexts])
Y_train = np.array([context_to_onehot(x[1:], build_context.count)[0] for x in contexts])


print(Y_train)


### Performing the train

In [None]:
epoch = 15
batch_size = 32

model.summary()
model.fit(X_train, Y_train, epochs=epoch, batch_size=batch_size)

In [None]:
def context_of_word(word, nb):
    '''
    Looks for the K closests element in the context of a word and prints the result
    '''
    if word in build_context.vocab:
        index = build_context.vocab[word]
    else:
        print('No such word in vocabulary')
        return
    
    x_test = index_to_onehot(index, build_context.count)
    y_test = model.predict(x_test, verbose=1)
    context_index = np.argmax(y_test)
    closest_words = np.argpartition(y_test[0], -nb)[-nb:]
    closest_words.sort()
    
    i = 0
    for k, v in build_context.vocab.items():
        if v == closest_words[i]:
            i += 1
            print('closest word in context is: ', k)
            if i == nb:
                break

context_of_word('how', 4)