# Semantics and Word Vectors with Spacy

In [1]:
import spacy

nlp = spacy.load('en_core_web_lg')

The larger spacy model already contains a 300dim word vectors vocabulary that we can use right away.

For example, let's explore the word 'lion'

In [2]:
nlp(u'lion').vector

array([ 1.8963e-01, -4.0309e-01,  3.5350e-01, -4.7907e-01, -4.3311e-01,
        2.3857e-01,  2.6962e-01,  6.4332e-02,  3.0767e-01,  1.3712e+00,
       -3.7582e-01, -2.2713e-01, -3.5657e-01, -2.5355e-01,  1.7543e-02,
        3.3962e-01,  7.4723e-02,  5.1226e-01, -3.9759e-01,  5.1333e-03,
       -3.0929e-01,  4.8911e-02, -1.8610e-01, -4.1702e-01, -8.1639e-01,
       -1.6908e-01, -2.6246e-01, -1.5983e-02,  1.2479e-01, -3.7276e-02,
       -5.7125e-01, -1.6296e-01,  1.2376e-01, -5.5464e-02,  1.3244e-01,
        2.7519e-02,  1.2592e-01, -3.2722e-01, -4.9165e-01, -3.5559e-01,
       -3.0630e-01,  6.1185e-02, -1.6932e-01, -6.2405e-02,  6.5763e-01,
       -2.7925e-01, -3.0450e-03, -2.2400e-02, -2.8015e-01, -2.1975e-01,
       -4.3188e-01,  3.9864e-02, -2.2102e-01, -4.2693e-02,  5.2748e-02,
        2.8726e-01,  1.2315e-01, -2.8662e-02,  7.8294e-02,  4.6754e-01,
       -2.4589e-01, -1.1064e-01,  7.2250e-02, -9.4980e-02, -2.7548e-01,
       -5.4097e-01,  1.2823e-01, -8.2408e-02,  3.1035e-01, -6.33

In [3]:
nlp(u'lion').vector.shape

(300,)

As we can see, 'lion' is represented by this 300x1 shaped vector. This is what we call 'Word to Vector' or Word2Vec for short.

However this is not the only characteristic of spacy's models. We can also have 'Document to Vector' which means that for each document will take the average of all the singular vectors of the particular document.

In [4]:
nlp(u'The quick brown fox jumped').vector.shape

(300,)

## Checking the token similarity

Thanks to the pre-calculated vectors, we can measure the similarity of words (cosine similarity) in a given document.

In [5]:
tokens = nlp('lion cat pet')

For example, we can establish a semantic relationshipt between a lion and a cat, both are felines, also between a cat and a pet, usually a cat is a pet. Now let's check this assumption with the corresponding vectors for each word.

In [9]:
def print_similarity(tokens):
    for token in tokens:
        sim = {t:token.similarity(t) for t in tokens}
        print(f'Similarities with word {token}:\n{sim}\n')

print_similarity(tokens)

Similarities with word lion:
{lion: 1.0, cat: 0.5265437, pet: 0.39923772}

Similarities with word cat:
{lion: 0.5265437, cat: 1.0, pet: 0.7505456}

Similarities with word pet:
{lion: 0.39923772, cat: 0.7505456, pet: 1.0}



Naturally, each word is completely similar (1.0) to itself. The closest to one, the more similar the word is. Notice how the word 'lion' has a similarity measure of 0.5 with the word 'cat'. But at the same time, it has a low similarity (0.39) with the word 'pet'. Which is the opposite with the word 'cat' in which we obtained a similarity measure of 0.75 with the word 'pet'.

So indeed the similarity between vectors can reveal the semantic relationship between them.

However, there are other types of words which we know are different but since these might be commonly used in the similar context, the word vectors might be similar to each other.

Let's take a look at another set of words.

In [10]:
tokens = nlp('like love hate')

From our knowledge, we know that love & hate are antonyms, i.e., words with opposite meaning. But since these are oftenly used in the same context, the vectors will tend to be similar. This means that word vectors are good for detecting context but no for defining or getting the meaning of the words.

In [11]:
print_similarity(tokens)

Similarities with word like:
{like: 1.0, love: 0.65790397, hate: 0.6574652}

Similarities with word love:
{like: 0.65790397, love: 1.0, hate: 0.6393099}

Similarities with word hate:
{like: 0.6574652, love: 0.6393099, hate: 1.0}



Inspectinv spacy's vocabulary, we can observe the amount of words and dimensions.

In [12]:
nlp.vocab.vectors.shape

(684831, 300)

We have a total of $684,831$ words with $300$ dimensions each.

If there is a word that it is not present in the vocabulary, this means it won't have a vector.

In [13]:
tokens = nlp('dog cat nargle')
for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

dog True 7.0336733 False
cat True 6.6808186 False
nargle False 0.0 True


## Calculating a new vector
Technically speaking we can perform linear algebra calculations to obtain the vector of a new word.

In [14]:
from scipy import spatial

cosine_similarity = lambda v1, v2: 1 - spatial.distance.cosine(v1, v2)

In [15]:
king = nlp.vocab['king'].vector
man = nlp.vocab['man'].vector
woman = nlp.vocab['woman'].vector

Now we can perform simple operations, addition and substraction of vectors and see where it leave us.

In [16]:
# Notice for king we substract the man features and add the women features.
# We expect the resulting vector to be similar to Queen, princess, highness, etc.
nv = king - man + woman

In [37]:
computed_similarities = {word.text:cosine_similarity(nv,word.vector) 
                         for word in nlp.vocab if word.has_vector and word.is_lower and word.is_alpha
                        }


In [38]:
computed_similarities = sorted(computed_similarities.items(), key=lambda kv: kv[1], reverse=True)

In [39]:
computed_similarities[:10]

[('king', 0.8024259805679321),
 ('queen', 0.7880843877792358),
 ('prince', 0.6401076912879944),
 ('kings', 0.6208544373512268),
 ('princess', 0.6125636100769043),
 ('royal', 0.5800970792770386),
 ('throne', 0.5787012577056885),
 ('queens', 0.5743793845176697),
 ('monarch', 0.563362181186676),
 ('kingdom', 0.5520980954170227)]

Notice the similarity measure for the word 'queen', it is significanlty close to what we meant.