![](./resources/images/header_logo.png)

# Chapter 3 Understanding word vectors

This section will reason about how computers understand plain text. We already know that we use language model to analyse texts. What features do we need to understand the meaning behind particular words or even sentences?

The answer is Math, namely vectors.

Each word is represented using a vector or a matrix of numbers. We call these **word vectors**  or **word embeddings**.

Let's see some real examples. We will continue to use spacy en_core_web_md model.

In [57]:
# Imports
import spacy
nlp = spacy.load("en_core_web_md")

In [58]:
# Check the vector behind the word "cat"
nlp.vocab["cat"].vector

array([-0.15067  , -0.024468 , -0.23368  , -0.23378  , -0.18382  ,
        0.32711  , -0.22084  , -0.28777  ,  0.12759  ,  1.1656   ,
       -0.64163  , -0.098455 , -0.62397  ,  0.010431 , -0.25653  ,
        0.31799  ,  0.037779 ,  1.1904   , -0.17714  , -0.2595   ,
       -0.31461  ,  0.038825 , -0.15713  , -0.13484  ,  0.36936  ,
       -0.30562  , -0.40619  , -0.38965  ,  0.3686   ,  0.013963 ,
       -0.6895   ,  0.004066 , -0.1367   ,  0.32564  ,  0.24688  ,
       -0.14011  ,  0.53889  , -0.80441  , -0.1777   , -0.12922  ,
        0.16303  ,  0.14917  , -0.068429 , -0.33922  ,  0.18495  ,
       -0.082544 , -0.46892  ,  0.39581  , -0.13742  , -0.35132  ,
        0.22223  , -0.144    , -0.048287 ,  0.3379   , -0.31916  ,
        0.20526  ,  0.098624 , -0.23877  ,  0.045338 ,  0.43941  ,
        0.030385 , -0.013821 , -0.093273 , -0.18178  ,  0.19438  ,
       -0.3782   ,  0.70144  ,  0.16236  ,  0.0059111,  0.024898 ,
       -0.13613  , -0.11425  , -0.31598  , -0.14209  ,  0.0281

In [59]:
# How many numbers represent the word vector for "cat"?
print(len(nlp.vocab["cat"].vector))

300


Yeah, 300 is a lot of numbers for just a single word representation, but all of these are singularly or in groups give some information about the cat like "it's an animal" or "it's similar to lion" or "it drinks milk". The bigger the dataset we train on, the more information we can include in the model and then use it for different purposes.

# Cosine similarity

Let's explore the idea of the similarity of two unit vectors in a 2-D plane.

In [60]:
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact, BoundedIntText, IntSlider
from random import randrange
from math import sin, cos, radians

int_box = BoundedIntText(
    value=randrange(0,100),
    min=-100,
    max=100,
    step=5,
    description='Text:',
    disabled=False
)

int_box_2 = BoundedIntText(
    value=randrange(0,100),
    min=-100,
    max=100,
    step=5,
    description='Text:',
    disabled=False
)

int_box_3 = BoundedIntText(
    value=randrange(0,100),
    min=-100,
    max=100,
    step=5,
    description='Text:',
    disabled=False
)

int_box_4 = BoundedIntText(
    value=randrange(0,100),
    min=-100,
    max=100,
    step=5,
    description='Text:',
    disabled=False
)

int_slider = IntSlider(
    value=0,
    min=0,
    max=360,
    step=1,
    description='Test:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)

int_slider_2 = IntSlider(
    value=0,
    min=0,
    max=360,
    step=1,
    description='Test:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)

@interact(x1=int_box, y1=int_box_2, x2=int_box_3, y2=int_box_4, angle_1=int_slider, angle_2=int_slider_2)
def plot(x1,y1,x2,y2, angle_1, angle_2):
    length = 50
    dx1 = cos(radians(angle_1))*length
    dy1 = sin(radians(angle_1))*length
    dx2 = cos(radians(angle_2))*length
    dy2 = sin(radians(angle_2))*length
    plt.figure(figsize=(8, 8), dpi=100)
    ax = plt.axes()
    ax.arrow(x1, y1, dx1, dy1, head_width=5, head_length=2, fc='black', ec='black')
    ax.arrow(x2, y2, dx2, dy2, head_width=5, head_length=2, fc='blue', ec='blue')
    plt.xlim(-100, 100)
    plt.ylim(-100, 100)
    plt.show()

interact(x1=int_box, y1=int_box_2, x2=int_box_3, y2=int_box_4, angle_1=int_slider, angle_2=int_slider_2)

interactive(children=(BoundedIntText(value=98, description='Text:', min=-100, step=5), BoundedIntText(value=9,…

<ipywidgets.widgets.interaction._InteractFactory at 0x1c8c79fd040>

The same idea and formula extends to multi-dimensional vectors, including our 300 dimensional word vectors.
Cosine similarity really gives insight into how similar two vectors are, but it is not a great scoring system.
That is why, we usually use **similarity distance** instead of actual cosine.

distance = 1 - cos(v1, v2)

## Calculating similarity between two words

Now that we know how vector similarity works, we can use it to calculate similarity between two words.

In [61]:
# Is cat similar to dog?
# Is cat similar to lion?
# Is cat similar to pet?
# Is cat similar to milk?
cat = nlp.vocab["cat"]
lion = nlp.vocab["lion"]
pet = nlp.vocab["pet"]
milk = nlp.vocab["milk"]
words = [cat, lion, pet, milk]
for i in range(len(words)):
    print(words[i].similarity(words[i+1]))

0.5265437960624695
0.39923766255378723
0.27374202013015747


IndexError: list index out of range

# Exercise 1

Find how similar are:
* "brother" and "sister"
* "strong" and "weak"
* "cactus" and "piano"

In [None]:
# Write your code here

In [62]:
answer_a = ""
answer_b = ""
answer_c = ""

In [63]:
from answers import check
check(answer_a)
check(answer_b)
check(answer_c)

ModuleNotFoundError: No module named 'answers'

As you can see from this exercise: antonyms (words with opposite meaning) can be also similar, because they share some common semantic feature, eg. "short" and "tall" both relate to "height" category to some extent and many many other groups of words.

##  Word Vector Arithmetic

Because words are stored as vectors, we can perform normal vector operations on them, for example vector addition, multiplication of a vector by a scalar and many many more.

Have you wondered what will you get if you combine a "yellow" and "fruit"? "Banana". Let's see!

In [65]:
new_vector = nlp.vocab["yellow"].vector + nlp.vocab["fruit"].vector

array([-6.47009969e-01,  5.60670018e-01,  9.13359970e-02, -1.23790002e+00,
        3.25237423e-01,  4.32835996e-01, -3.67932022e-01,  3.93610001e-01,
       -7.75529981e-01,  2.20079994e+00, -7.61919975e-01, -6.41800016e-02,
       -1.08894992e+00, -3.79496992e-01,  2.77195007e-01, -2.71789998e-01,
       -1.12000108e-02,  3.48930001e+00,  4.71540034e-01, -1.09525001e+00,
        8.95690024e-01, -3.02999973e-01, -2.49027997e-01,  4.59549993e-01,
        2.39600003e-01, -8.05019975e-01,  3.25670004e-01, -1.23650014e-01,
       -2.48000026e-03, -7.25495994e-01, -6.18510008e-01, -6.85140043e-02,
        6.19000018e-01, -5.08849978e-01, -5.03639996e-01,  7.26449966e-01,
        1.23558998e-01, -2.19999999e-01, -4.23379987e-01,  6.70194983e-01,
        3.83930027e-01,  1.36059999e-01,  5.42799756e-03, -3.93188000e-01,
        1.29553008e+00,  1.77930027e-01, -5.25689960e-01,  3.96849990e-01,
        4.20414001e-01, -2.55299807e-02, -6.96550012e-01, -2.41649985e-01,
        5.99190071e-02, -