## Word vectors and sentence vectors

Natively fasttext only computes sentence vectors during supervised learning.

Depending on the task, simply average word embeddings of all words in the sentence should suffc. (If doing so, you should normalize the word vectors first, so that they all have a norm equal to one.)

According to Kenter et al. 2016, this approach "has proven to be a strong baseline or feature across a multitude of tasks", such as short text similarity tasks.

However, according to Le and Mikolov, this method performs poorly for sentiment analysis tasks and/or long texts, because it "loses the word order in the same way as the standard bag-of-words models do" and "fail[s] to recognize many sophisticated linguistic phenomena, for instance sarcasm".

In fasttext unsupervised case they average the normalized word embeddings (not sure what you mean by element-wise normalization, but they use plain L2 normalization of the vector as you can see in Vector::norm() of fastText/src/vector.cc). https://github.com/facebookresearch/fastText/blob/d647be03243d2b83d0b4659a9dbfb01e1d1e1bf7/src/vector.cc#L28

So what you can do is that take the word vectors. COmpute the l2 normalisation and find the documents vector for this case. Please keep in mind that I do not really recommend this method right now because of unavailable benchmarks right now so use this method right now at your own discretion. The aim is to show you how such methods can be coded and hence with experience you will be able to create and implement your own methods.

In [1]:
import numpy as np
from gensim.models.fasttext import FastText
from numpy.linalg import norm
from scipy.spatial.distance import cosine
import numpy as np
from sklearn import preprocessing

In [2]:
def sentence_vector(sentence, ft_model):
    sentence = 'night is black'
    sentence = sentence.lower().split()
    if len(sentence) == 1:
        return ft_model[sentence[0]]
    vecs = [ft_model.wv[x] for x in sentence]
    X = np.asarray(vecs, dtype=np.float) # Float is needed.
    X_normalized = preprocessing.normalize(X, norm='l2') # l2-normalize the samples (rows). 
    return np.mean(X_normalized, axis=0)

Download if file not present

In [3]:
%%bash
wget -nc https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.simple.zip
unzip -o wiki.simple.zip

Archive:  wiki.simple.zip
  inflating: wiki.simple.vec         
  inflating: wiki.simple.bin         


File 'wiki.simple.zip' already there; not retrieving.



Load the fasttext model and compute the sentence vectors.

In [4]:
print("Loading word2vec model ...\n")
modelpath = "wiki.simple.bin"
ft_model = FastText.load_fasttext_format(modelpath)
pattern_1 = 'founder and ceo'
pattern_2 = 'co-founder and former chairman'

p1 = sentence_vector(pattern_1, ft_model)
p2 = sentence_vector(pattern_2, ft_model)
print ("\nSUM")
print ("dot(vec1,vec2)", np.dot(p1,p2))
print ("norm(p1)", norm(p1))
print ("norm(p2)", norm(p2))
print ("dot((norm)vec1,norm(vec2))", np.dot(norm(p1),norm(p2)))
print ("cosine(vec1,vec2)", np.divide(np.dot(p1,p2),np.dot(norm(p1),norm(p2))))

Loading word2vec model ...


SUM
dot(vec1,vec2) 0.45278367773655714
norm(p1) 0.672892025317998
norm(p2) 0.672892025317998
dot((norm)vec1,norm(vec2)) 0.4527836777365572
cosine(vec1,vec2) 0.9999999999999999


In [5]:
sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()

p1 = sentence_vector(sentence_obama, ft_model)
p2 = sentence_vector(sentence_president, ft_model)
print ("\nSUM")
print ("dot(vec1,vec2)", np.dot(p1,p2))
print ("norm(p1)", norm(p1))
print ("norm(p2)", norm(p2))
print ("dot((norm)vec1,norm(vec2))", np.dot(norm(p1),norm(p2)))
print ("cosine(vec1,vec2)",     np.divide(np.dot(p1,p2),np.dot(norm(p1),norm(p2))))


SUM
dot(vec1,vec2) 0.45278367773655714
norm(p1) 0.672892025317998
norm(p2) 0.672892025317998
dot((norm)vec1,norm(vec2)) 0.4527836777365572
cosine(vec1,vec2) 0.9999999999999999


In [6]:
# Word Movers distance
sentence_obama = 'founder and ceo'.lower().split()
sentence_president = 'co-founder and former chairman'.lower().split()

# Remove their stopwords.
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
sentence_obama = [w for w in sentence_obama if w not in stopwords]
sentence_president = [w for w in sentence_president if w not in stopwords]

# Compute WMD.
distance = ft_model.wv.wmdistance(sentence_obama, sentence_president)
distance

3.9839733449425827

The cosing to be 1 means that the angle between them is 0. which means have the same meaning.

In [7]:
# Word Movers distance
sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()

# Remove their stopwords.
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
sentence_obama = [w for w in sentence_obama if w not in stopwords]
sentence_president = [w for w in sentence_president if w not in stopwords]

# Compute WMD.
distance = ft_model.wv.wmdistance(sentence_obama, sentence_president)
print(distance)

4.969142709901333


As a final note the supervised case takes the vector of EOS : </s> also as a bias. That is not done here. If you want you can include that in your analysis.