# Word2Vec experiments

## Introduction

For this project we need to learn the embeddings of a Word2Vec algorithm.
The embeddings are the weights of a single layer sequential neural network.
On this notebook we will only focus on the methods used to create these embeddings



## Skip-gram

Skip-Gram, as opposed to CBOW, is used to predict the context of a word with the word as input.

TODO: explain method

## One-Hot encoding

TODO: explain

## Used dataset

TODO: find a dataset

## First approach : use Gensim to learn the embeddings

TODO: gensim

## Using Keras to fit a hand-crafted Word2Vec model

To better understand the underlying structure of the algorithm, we decided to implement our own neural network using Keras.

We plan on building the vocabulary and the context from our dataset and use the context to train a Keras Neural Network that will be our Word2Vec model, and compare it with other Word2Vec models.

In [33]:
from text_preprocessing import NLTKTokenizer

import keras
from keras import Sequential
from keras.layers.core import Dense, Activation
from itertools import chain
import pandas as pd
import numpy as np
from gensim.models import Word2Vec
import json

In [34]:
# %load decompress_dataset.py
import json
import subprocess
import glob
#import mysql.connector
#from tweet_scraper import insert_db


def tweet_generator():
    tar_files = glob.glob('./*.tar')

    for tar_file in tar_files:
        subprocess.call(['tar', '-xf', tar_file])
        bz2_files = glob.glob('./*/*/*/*/*.json.bz2')
        for bz2_file in bz2_files:
            subprocess.call(['bzip2', '-d', bz2_file])
            file = bz2_file.split('.')
            file = '.'.join(file[:-1])
            with open(file, "r") as ins:
                for line in ins:
                    yield json.loads(line)
            subprocess.call(["rm", file])


def get_tweet_text(tweet):
    if "extended_tweet" in tweet:
        return tweet["extended_tweet"]["full_text"]
    return tweet["text"]


class Tweet:
    def __init__(self, id, text):
        self.id = id
        self.text = text

def decompress():
    '''
    mySQLdb = mysql.connector.connect(
        host="localhost",
        user="nicolas",
        passwd="nicolas",
        database="tweets",
    )
'''
    tweets = filter(lambda x: "lang" in x, tweet_generator())
    tweets = filter(lambda x: x["lang"] == "en", tweets)
    tweets = filter(lambda x: "retweeted_status" not in x, tweets)
    tweets = map(lambda x: Tweet(x["id"], get_tweet_text(x)), tweets)
    tweets = filter(lambda x: '…' not in x.text, tweets)
    return tweets
    #insert_db(mySQLdb, tweets)

tweets = decompress()


In [35]:
val = []
for _ in range(100):
    try:
        val.append(NLTKTokenizer(next(tweets).text))
    except StopIteration:
        break

In [36]:
tweet_list = []

for tweet in val:
    currTok = NLTKTokenizer(tweet)
    sentence = []
    for token in currTok:
        sentence.append(token)
    tweet_list.append(sentence)

model = Word2Vec(tweet_list, size=300, sg=1, window=1, min_count=1)

TypeError: expected string or bytes-like object

### Preparing the context

In this part, we will build the context that will be used for training our word2vec. The first step is to transform our texts as token. Once it is done, we have a list of words to process as a stream in our build_context function.
This function will also build the vocabulary while processing the context of each word.

Here, we only look at the word before and the word after our current word to define its context, the output represent each word from its index in the vocabulary.

For example, with the sentence "This is a test. a", the vocabulary will look like "{'This': 0, 'is': 1, 'a': 2, 'test': 3, '.': 4}" and the context "[(0, 1), (1, 0, 2), (2, 1, 3), (3, 2, 2), (2, 3, 4), (4, 2)]" meaning that the word "This" have the index 1 as context, which is the word "is" in our vocabulary.

In [2]:
PATH = './data'

train_3 = f'{PATH}/data_train_3.csv'
test_3 = f'{PATH}/data_test_3.csv'
train_7 = f'{PATH}/data_train_7.csv'
train_16m_3 = f'{PATH}/training.1600000.processed.noemoticon.csv'

tweets = pd.read_csv(train_3, sep='\t', names=['ID', 'Class', 'Tweet'])
sample = tweets.Tweet.head(500)

tweets_dir = f'{PATH}/2017'
tweets = []
for line in  open('./data/2017/00.json'):
    tweets.append(json.loads(line))

for key in tweets[0]:
    print(key)

print(tweets[0]['text'])

created_at
id
id_str
text
source
truncated
in_reply_to_status_id
in_reply_to_status_id_str
in_reply_to_user_id
in_reply_to_user_id_str
in_reply_to_screen_name
user
geo
coordinates
place
contributors
is_quote_status
quote_count
reply_count
retweet_count
favorite_count
entities
favorited
retweeted
filter_level
lang
timestamp_ms
捜しやすいリハ着ありがとー(๑´ლ`๑)ﾌ°ﾌ°♡


In [3]:
tweet_list = []

for tweet in sample:
    currTok = NLTKTokenizer(tweet)
    sentence = []
    for token in currTok:
        sentence.append(token)
    tweet_list.append(sentence)

model = Word2Vec(tweet_list, size=300, sg=1, window=1, min_count=1)
print(model.predict_output_word(['How', 'are', 'you']))

[('eyes\\u002c', 0.00026797055), ('tube', 0.00026796898), ('Reason', 0.00026796872), ('biscuit', 0.00026796845), ('Anybody', 0.000267968), ('delete', 0.00026796773), ('summary', 0.00026796758), ('drinking', 0.00026796758), ('MMFlint', 0.00026796694), ('Ethan_Hammer', 0.0002679666)]


In [3]:
str1 = 'a a This is a test. a'
str2 = 'The quick brown fox jumped over the lazy dog.'
str3 = 'Another text with a dog.'

tweets = pd.Series(str1).append(pd.Series(str2)).append(pd.Series(str3))

tokenizer = NLTKTokenizer(str2)

dataset = pd.Series(iter(tokenizer))
onehot = pd.get_dummies(dataset)


def build_context(stream, queue = []):
    for i in range(2):
        token = next(stream)
        if token not in build_context.vocab:
            build_context.vocab[token] = build_context.count
            build_context.count += 1
        if len(queue) > 2:
            queue.pop(0)
            queue.pop(0)
        queue.append(build_context.vocab[token])
        
    yield tuple(queue)

    for token in stream:
        if token not in build_context.vocab:
            build_context.vocab[token] = build_context.count
            build_context.count += 1
        queue.append(build_context.vocab[token])
        if len(queue) > 3:
            queue.pop(0)
        yield tuple([queue[i] for i in [1, 0, 2]])
    queue.pop(0)
    if (len(queue) < 2):
        print(queue[0])
    yield((queue[1], queue[0]))


build_context.vocab = {}
build_context.count = 0

contexts = []
for t in sample:
    currTok = NLTKTokenizer(t)
    contexts.append(list(build_context(iter(currTok))))
    
contexts = list(chain.from_iterable(contexts))

print(sample)
print('current context: ', contexts)
print('vocabulary: ', build_context.vocab)
print('voc size: ', build_context.count)


The
quick
brown
fox
jumped
over
the
lazy
dog
.


In [None]:
model = Word2Vec(size=300, sg=1, window=1)

### Keras NN model

Arbitrary values for the moment, probably can still be optimized.

TODO: desc

In [4]:
model = Sequential()
model.add(Dense(300, input_dim=build_context.count))
model.add(Activation("linear"))
model.add(Dense(build_context.count))
model.add(Activation("softmax"))


model.compile(optimizer=keras.optimizers.RMSprop(), loss='binary_crossentropy', metrics=['accuracy'])


### Training dataset

In [5]:
def index_to_onehot(index, n):
    res = np.zeros((1, n))
    res[0, index] = 1
    return res


def context_to_onehot(neighbors, n):
    res = np.zeros((1, n))
    for index in neighbors:
        res[0, index] = 1
    return res / len(set(neighbors))


X_train = np.array([index_to_onehot(x[0], build_context.count)[0] for x in contexts])
Y_train = np.array([context_to_onehot(x[1:], build_context.count)[0] for x in contexts])


print(Y_train)


[[0.  1.  0.  ... 0.  0.  0. ]
 [0.5 0.  0.5 ... 0.  0.  0. ]
 [0.  0.5 0.  ... 0.  0.  0. ]
 ...
 [0.  0.  0.  ... 0.  0.  0. ]
 [0.  0.  0.  ... 0.  0.  0.5]
 [0.  0.  0.  ... 0.  0.  0. ]]


### Performing the train

In [6]:
epoch = 15
batch_size = 32

model.summary()
model.fit(X_train, Y_train, epochs=epoch, batch_size=batch_size)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 300)               1119900   
_________________________________________________________________
activation_1 (Activation)    (None, 300)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 3732)              1123332   
_________________________________________________________________
activation_2 (Activation)    (None, 3732)              0         
Total params: 2,243,232
Trainable params: 2,243,232
Non-trainable params: 0
_________________________________________________________________
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7f3b3aea4d68>

In [7]:
def context_of_word(word, nb):
    '''
    Looks for the K closests element in the context of a word and prints the result
    '''
    if word in build_context.vocab:
        index = build_context.vocab[word]
    else:
        print('No such word in vocabulary')
        return
    
    x_test = index_to_onehot(index, build_context.count)
    y_test = model.predict(x_test, verbose=1)
    context_index = np.argmax(y_test)
    closest_words = np.argpartition(y_test[0], -nb)[-nb:]
    closest_words.sort()
    
    i = 0
    for k, v in build_context.vocab.items():
        if v == closest_words[i]:
            i += 1
            print('closest word in context is: ', k)
            if i == nb:
                break

context_of_word('how', 4)

closest word in context is:  a
closest word in context is:  and
closest word in context is:  you
closest word in context is:  I
