## Introduction to Word2Vec
#### The model Word2Vec is a simple word embedding neural network with a single hidden layer, based on the study of Le & Mikolov (2014).

In [None]:
import seaborn as sb
import numpy as np
import pandas as pd
%matplotlib inline

Word embeddings from plain text.

The model assumes the *Distributional Hypothesis* that words are characterized by words they hang out with. this idea is used to estimate the probability of two words occurring near each other.

## Introduction

In NLP popular fixed-length features are **bag-of-words**.

However bag-of-words features have two major weaknesses: 
- they lose the ordering of the words
- they also ignore semantics of the words. 

In the paper by **Le & Mikolov (2014)**, they propose a distributed representations of sentences and documents, which they call *Paragraph Vector*. 

It's an unsupervised algorithm that learns fixed-length feature representations from **variable-length pieces of texts**, such as sentences, paragraphs, and documents. The algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives the algorithm the potential to overcome the weaknesses of bag-of- words models. 

Empirical results show that **Paragraph Vectors outperform bag-of-words models** as well as other techniques for text representations. The paper then shows that they were abke to achieve new **state-of-the-art** results on several **text classification** and **sentiment analysis** tasks.


## Motivation

after training words with similar meaning are mapped to a similar position in the vector space. The difference between word vectors also carry meaning. For ex-ample, the word vectors can be used to answer analogy questions using simple vector algebra: “King” - “man” + “woman” = “Queen”.
Theses properties of word vectors are useful in many NLP task, such as language modelling and understanding and machine translation.

## Language Model

we concatenate the paragraph vector with several word vec- tors from a paragraph and predict the following word in the given context. Both word vectors and paragraph vectors are trained by the stochastic gradient descent and backpropaga- tion

neural network language model proposed:

 - each word is represented by a one -hot  vector
 - if we use a multi-word context, then the vectors are averaged 
 - this context vector is the input of a neural network, and tries to predict the next word.

### Results

After training, the word vectors are mapped into a vector space such that semantically similar words have similar vector representations.

In the study they extend the model to go beyond word level to achieve phrase-level or sentence-level representations. 

Paragraph Vectors is less complex and outperforms other methods that have tried to achieve similar representations, such as **word weighting functions** (which requires task-specific tuning) and **parse trees**. 

Paragraph Vector takes a general approach. It is capable of constructing representations of input sequences of any length.

## One-Word Context

As mentioned before every word is mapped to a unique vector, represented by a column in a matrix W. It is then used as features for prediction of the next word in a sentence.

maximize the average log probability

[EQ]

In the paper hierarchical softmax is used for faster training, but for simpliticy I will just use a regular softmax regression as the multiclass classifier

[EQ]


#### Softmax Regression

Word2Vec is a very simple neural network with a single hidden layer.

In [1]:
sentences = ['<s> the prince loves skateboarding in the park </s>', 
             '<s> the princess loves the prince but the princess hates skateboarding </s>',
             '<s> skateboarding in the park is popular </s>',
             '<s> the prince is popular but the prince hates attention </s>',
             '<s> the princess loves attention but the princess hates the park </s>']

In [None]:
import numpy as np
from collections import OrderedDict, defaultdict

class Word2Vec_1WordContext(object):
    def __init__(self, sentences, learning_rate = 1.0, nodes_HL = 3):
        self.sentences = sentences
        self.N = nodes_HL # number of nodes in Hidden Layer
        self.V = None # Vocabulary size
        self.WI = None
        self.WO = None
        self.vocabulary = None
        self.learning_rate = learning_rate
    
    def Vocabulary(self):
        dictionary = defaultdict()
        # len of dictionary gives a unique integer to each new word
        dictionary.default_factory = lambda: len(dictionary) 
        return dictionary

    def docs2bow(self, docs, dictionary):
        """Transforms a list of strings into a list of lists where 
        each unique item is converted into a unique integer."""
        for doc in docs:
            yield [dictionary[word] for word in doc.split()] # returns a generator
    
    def sentences2bow(self):
        self.vocabulary = self.Vocabulary()
        bow = list(self.docs2bow(self.sentences, self.vocabulary))
        return bow
    
    def random_init(self):
        self.V = len(self.vocabulary)
        
        # random initialization of weights between [-0.5 , 0.5] normalized by number of nodes mapping to.
        self.WI =(np.random.random((self.V, self.N)) - 0.5) / self.N # input weights
        self.WO =(np.random.random((self.N, self.V)) - 0.5) / self.V # output weights
    
    def softmax_regression(self, word, h):
        # posterior probability P(word | context)
        return (np.exp(h.dot(self.WO.T[self.vocabulary[word]])) / 
                sum(np.exp(h.dot(self.WO.T[self.vocabulary[w]])) for w in self.vocabulary))
    
    def backprop(self, context, target):

        for word in self.vocabulary:

            h = self.WI[self.vocabulary[context]]
            P_word_context = self.softmax_regression(word, h)
            
            if word == target:
                t = 1
                #print "P(target|context)", P_word_context
            else:
                t = 0

            err = t - P_word_context # error

            # weight update using stochastic gradient descent
            self.WO.T[self.vocabulary[word]] -= self.learning_rate * err * h
            # update brings word vector closer in the feature space if word = target, and push them apart otherwise.

        self.WI[self.vocabulary[context]] -= self.learning_rate * self.WO.sum(axis=1) # update only weights for input word
    
    def train(self):

        bow = self.sentences2bow()
        # visualize bag-of-word sentence conversion
        # print bow
        
        self.random_init()
        
        for sentence in self.sentences:
            prev_word = None
            for word in sentence.split():
                if prev_word != None:
                    target = word
                    context = prev_word
                    self.backprop(context, target)
                prev_word = word
        return self.WI, OrderedDict(sorted(self.vocabulary.items(), key=lambda t: t[1]))

In [None]:
model = Word2Vec_1WordContext(sentences)
W, vocab = model.train()

Using the dot product $W_I \cdot W'^T_O$ we compute the distance between the input word *dwarf* and the output word *hates*:

Now using softmax regression, we can compute the posterior probability $P(w_O|w_I)$:

$$ P(w_O|w_I) = y_i = \frac{exp(W_I \cdot W'^T_O)}{\sum^V_{j=1} exp(W_I \cdot W'^T_j)} $$

### Updating the hidden-to-output layer weights

loss function to minimize is: $E = -\log P(w_O|w_I)$

The error is computed with $t_j - P(w_O|w_I) = e_j$, where $t_j$ is 1 if $w_j$ is the actual output word, otherwise $t_j$ is 0.

To obtain the gradient on the hidden-to-output weights, we compute $e_j \cdot h_i$, where $h_i$ is a copy of the vector corresponding to the input word (only holds with a context of a single word). Finally, using stochastic gradient descent, with a learning rate $\nu$ we obtain the weight update equation for the hidden to output layer weights:

$$W'^{T (t)}_j = W'^{T (t-1)}_j - \nu \cdot e_j \cdot h_j$$.


### Updating the input-to-hidden layer weights

backpropagate the prediction errors to the input-to-hidden weights

### Multi-word context

In [None]:
h = (WI[vocabulary[context[0]]] + WI[vocabulary[context[1]]]) / 2.0

In [None]:
import numpy as np
from collections import OrderedDict, defaultdict

class Word2Vec_nWordContext(object):
    def __init__(self, sentences, learning_rate = 1.0, context_size = 3, nodes_HL = 3):
        self.sentences = sentences
        self.N = nodes_HL # number of nodes in Hidden Layer
        self.V = None # Vocabulary size
        self.WI = None # input weight matrix
        self.WO = None # output weight matrix
        self.vocabulary = None
        self.learning_rate = learning_rate
        self.context_size = context_size
    
    def Vocabulary(self):
        dictionary = defaultdict()
        # len of dictionary gives a unique integer to each new word
        dictionary.default_factory = lambda: len(dictionary) 
        return dictionary

    def docs2bow(self, docs, dictionary):
        """Transforms a list of strings into a list of lists where 
        each unique item is converted into a unique integer."""
        for doc in docs:
            yield [dictionary[word] for word in doc.split()] # returns a generator
    
    def sentences2bow(self):
        self.vocabulary = self.Vocabulary()
        bow = list(self.docs2bow(self.sentences, self.vocabulary))
        return bow
    
    def random_init(self):
        self.V = len(self.vocabulary)
        
        # random initialization of weights between [-0.5 , 0.5] normalized by number of nodes mapping to.
        self.WI =(np.random.random((self.V, self.N)) - 0.5) / self.N # input weights
        self.WO =(np.random.random((self.N, self.V)) - 0.5) / self.V # output weights
        
    def average_context_vec(self, context):
        c = len(context)
        context_weights = map(lambda word: self.WI[self.vocabulary[word]], context)
        return reduce(lambda a, b: a + b, context_weights ) / float(c)
    
    def softmax_regression(self, word, h):
        # posterior probability P(word | context)
        return (np.exp(h.dot(self.WO.T[self.vocabulary[word]])) / 
                sum(np.exp(h.dot(self.WO.T[self.vocabulary[w]])) for w in self.vocabulary))
    
    def backprop(self, context, target):

        for word in self.vocabulary:
            
            h = self.average_context_vec(context)
            P_word_context = self.softmax_regression(word, h)

            if word == target:
                t = 1
                #print "P(target|context)", P_word_context
            else:
                t = 0

            err = t - P_word_context # error

            # weight update using stochastic gradient descent
            self.WO.T[self.vocabulary[word]] -= self.learning_rate * err * h
            # update brings word vector closer in the feature space if word = target, and push them apart otherwise.

        for input_word in context:
            # update only weights for context words
            self.WI[self.vocabulary[input_word]] -= (1. / len(context)) * self.learning_rate * self.WO.sum(axis=1) 
    
    def train(self):

        bow = self.sentences2bow()
        self.random_init()
        
        for sentence in self.sentences:
            word_tuple =  tuple(sentence.split())
            count = 1
            context = []
            for i, word in enumerate(word_tuple):
                if word != '<s>':
                    target = word
                    if count > self.context_size:
                        context = context[1:]
                    context.append(word_tuple[i-1])
                    self.backprop(context, target)
                    if word == '</s>':
                        for n in range(len(context) - 1, 0, -1):
                            context = context[-n:]
                            self.backprop(context, target)
                    count += 1
                    
        return self.WI, OrderedDict(sorted(self.vocabulary.items(), key=lambda t: t[1]))

In [None]:
word_tuple = tuple(['<s>', 'a', 'b', 'c', 'd', 'e', '</s>'])
print word_tuple
context_size = 4
count = 1
context = []

for i, word in enumerate(word_tuple):
    if word != '<s>':
        target = word
        if count > context_size:
            context = context[1:]
        context.append(word_tuple[i-1])         
        print "context: {}, target: {}".format(context, target)
        if word == '</s>':
            for n in range(len(context)-1, 0, -1):
                context = context[-n:]
                print "context: {}, target: {}".format(context, target)
        count += 1

In [None]:
model = Word2Vec(sentences, learning_rate = 1.0, context_size = 4)
W, vocab = model.train()

In [None]:
len(vocab)

In [None]:
W =(np.random.random((14, 3)) - 0.5) / 3.0

In [None]:
import plotly.plotly as py
from plotly.graph_objs import *

In [None]:
trace1 = Scatter3d(
    x= W.T[0],
    y= W.T[1],
    z= W.T[2],
    mode='markers+text',
    text = vocab.keys(),
    marker=Marker(
        size=8,
        line=Line(
            color='rgba(217, 217, 217, 0.14)',
            width=0.5
        ),
        opacity=0.8
    )
)
data = Data([trace1])
layout = Layout(
    margin=Margin(
        l=0,
        r=0,
        b=0,
        t=0
    )
)
fig = Figure(data=data, layout=layout)
py.iplot(fig)

### Paragraph Vector

In [2]:
import numpy as np
from collections import OrderedDict, defaultdict

class Doc2Vec_nWordContext(object):
    def __init__(self, sentences, learning_rate = 1.0, context_size = 3, nodes_HL = 3):
        self.sentences = sentences
        self.N = nodes_HL # number of nodes in Hidden Layer
        self.V = None # Vocabulary size
        self.P = None # number of paragraph/sentence in text
        self.WI = None # input weight matrix
        self.WO = None # output weight matrix
        self.D = None # paragraph/sentence weight matrix
        self.vocabulary = None
        self.learning_rate = learning_rate
        self.context_size = context_size # number of words in context vector
    
    def Vocabulary(self):
        """ Instantiates a default dictionary with its 
        length as default factory """
        dictionary = defaultdict()
        # len of dictionary gives a unique integer to each new word
        dictionary.default_factory = lambda: len(dictionary) 
        return dictionary

    def docs2bow(self, docs, dictionary):
        """Transforms a list of strings into a list of lists where 
        each unique item is converted into a unique integer."""
        for doc in docs:
            yield [dictionary[word] for word in doc.split()] # returns a generator
    
    def sentences2bow(self):
        """ Creates the dictionary of the text's vocabulary 
        and returns the text with each words replaced by their unique integer"""
        self.vocabulary = self.Vocabulary()
        bow = list(self.docs2bow(self.sentences, self.vocabulary))
        return bow
    
    def random_init(self):
        """ initializes  weight matrices for neural network """
        self.V = len(self.vocabulary)
        self.P = len(self.sentences)
        
        # random initialization of weights between [-0.5 , 0.5] normalized by number of nodes mapping to.
        self.WI = (np.random.random((self.V, self.N)) - 0.5) / self.N # input weights
        self.WO = (np.random.random((self.N, self.V)) - 0.5) / self.V # output weights
        self.D = (np.random.random((self.P, self.N)) - 0.5) / self.N # paragraph/sentence weights
        
    def average_context_vec(self, context, num):
        """ Takes the average of the context word vectors plus the paragraph/sentence vector
        and returns a new vector for the context"""
        c = len(context)
        context_weights = map(lambda word: self.WI[self.vocabulary[word]], context)
        return (reduce(lambda a, b: a + b, context_weights ) + self.D[num])/ float(c + 1)
    
    def softmax_regression(self, word, h):
        """ returns posterior probability P(word | context) """
        return (np.exp(h.dot(self.WO.T[self.vocabulary[word]])) / 
                sum(np.exp(h.dot(self.WO.T[self.vocabulary[w]])) for w in self.vocabulary))
    
    def backprop(self, context, target, num):
        """ Computes backpropagation of errors to weight matrices,
        using stochastic gradient descent """

        for word in self.vocabulary:
            
            h = self.average_context_vec(context, num) # context word weight vector
            P_word_context = self.softmax_regression(word, h)  # posterior probability P(word | context)

            if word == target:
                t = 1
                #print "P(target|context)", P_word_context
            else:
                t = 0

            err = t - P_word_context # error

            # weight update using stochastic gradient descent
            self.WO.T[self.vocabulary[word]] -= self.learning_rate * err * h
            # update brings word vector closer in the feature space if word = target, and push them apart otherwise.

        EH = self.WO.sum(axis = 1)
        for input_word in context:
            # update only weights for context words
            self.WI[self.vocabulary[input_word]] -= (1. / len(context)) * self.learning_rate * EH
            self.D[num] -= (1. / len(context)) * self.learning_rate * EH
    
    def train(self):
        """ trains text and returns trained word matrix
        and ordered dictionary of vocabulary"""

        bow = self.sentences2bow()
        self.random_init()
        
        # runs context window across sentence
        # applies window expansion and reduction 
        # at the begin and end of sentence respectively
        for num, sentence in enumerate(self.sentences):
            word_tuple =  tuple(sentence.split())
            count = 1
            context = []
            for i, word in enumerate(word_tuple):
                if word != '<s>':
                    target = word
                    if count > self.context_size:
                        context = context[1:]
                    context.append(word_tuple[i-1])
                    self.backprop(context, target, num)
                    if word == '</s>':
                        for n in range(len(context) - 1, 0, -1):
                            context = context[-n:]
                            self.backprop(context, target)
                    count += 1
                    
        return self.WI, OrderedDict(sorted(self.vocabulary.items(), key=lambda t: t[1]))

Backpropagation

In [3]:
model = Doc2Vec_nWordContext(sentences, learning_rate = 1.0, context_size = 2)
W, vocab = model.train()

TypeError: backprop() takes exactly 4 arguments (3 given)