# Intent Detection : Sum of Word2Vec

The purpose of this python notebook is to study the use of word vector representations to categorize sentences.  
My idea is to sum the vectors of every word of a sentence, and then compute the similarity between 2 sentences in order to test if they belong to the same category.

Firstly, in order to use vector representations of words, we need to import the [gensim library](https://radimrehurek.com/gensim/index.html)

In [1]:
import gensim

Instead of training models on huge data sets, we will use a a pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in (Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.).  
The archive is available here: [GoogleNews-vectors-negative300.bin.gz](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing).

In [2]:
model = gensim.models.Word2Vec.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)

To verify if all is ok, just compute the similarity between words 'woman' and 'man'

In [3]:
model.similarity('woman', 'man')

0.76640122309953518

To get the similarity between 2 vectors, we need a function compute the cosine similarity (Cf. https://en.wikipedia.org/wiki/Cosine_similarity).

In [4]:
import math

def cosine_similarity(v1,v2):
    "compute cosine similarity of v1 to v2: (v1 dot v2)/(||v1||*||v2||)"
    sumxx, sumxy, sumyy = 0, 0, 0
    for i in range(len(v1)):
        x = v1[i]
        y = v2[i]
        sumxx += x*x
        sumyy += y*y
        sumxy += x*y
    return sumxy/math.sqrt(sumxx*sumyy)

Let's test this function in comparison with model.similarity():

In [5]:
model.similarity('woman', 'man') - cosine_similarity(model['woman'], model['man'])

-1.1102230246251565e-16

The results are not exactly the same, but for now, I don't have time to find out the reason why...

Then, let's compare 3 sentences in order to find out if they belong to the same category: 
1. I'm starving
2. I want to eat pizza
3. What are the news today?

First, we need to sum the vectors of each words to get a vector representing each sentence.

In [6]:
sentence1 = model['I']+model['m']+model['starving']


*Note:* 'to' is a stopword, so it has to be removed.

In [7]:
sentence2 = model['I']+model['want']+model['eat']+model['pizza']

Then, let's compute the similarity between "I'm starving" and "I want to eat pizza".

In [8]:
cosine_similarity(sentence1, sentence2)

0.54241427587268343

One last sentence:

In [9]:
sentence3 = model['What']+model['are']+model['the']+model['news']+model['today']

Compute the similarity between the last sentence and the two previous ones:

In [10]:
cosine_similarity(sentence1, sentence3)

0.315094763327604

In [11]:
cosine_similarity(sentence2, sentence3)

0.3448005429570985

**Conclusion:** "I'm starving" is closer to "I want to eat pizza" than "What are the news today?".