![](./resources/images/header_logo.png)

# Chapter 3 Understanding word vectors

This chapter will reason about how computers understand plain text. 

We already know that we use language model to analyse texts. What do machines need to understand the meaning behind particular words or even sentences?

The answer is Math, namely vectors.

## Word vectors
Each word is usually represented as a **vector** or if you prefer a matrix of numbers. We call these **word vectors**  or **word embeddings**.

Let's see some real examples. We will continue to use spacy `en_core_web_md` model.

In [1]:
import spacy
nlp = spacy.load("en_core_web_md")

Let's get a word `cat` from our vocabulary and check how it is semantically represented as a vector.

In [2]:
# Check the vector behind the word "cat"
nlp.vocab["cat"].vector

array([-0.15067  , -0.024468 , -0.23368  , -0.23378  , -0.18382  ,
        0.32711  , -0.22084  , -0.28777  ,  0.12759  ,  1.1656   ,
       -0.64163  , -0.098455 , -0.62397  ,  0.010431 , -0.25653  ,
        0.31799  ,  0.037779 ,  1.1904   , -0.17714  , -0.2595   ,
       -0.31461  ,  0.038825 , -0.15713  , -0.13484  ,  0.36936  ,
       -0.30562  , -0.40619  , -0.38965  ,  0.3686   ,  0.013963 ,
       -0.6895   ,  0.004066 , -0.1367   ,  0.32564  ,  0.24688  ,
       -0.14011  ,  0.53889  , -0.80441  , -0.1777   , -0.12922  ,
        0.16303  ,  0.14917  , -0.068429 , -0.33922  ,  0.18495  ,
       -0.082544 , -0.46892  ,  0.39581  , -0.13742  , -0.35132  ,
        0.22223  , -0.144    , -0.048287 ,  0.3379   , -0.31916  ,
        0.20526  ,  0.098624 , -0.23877  ,  0.045338 ,  0.43941  ,
        0.030385 , -0.013821 , -0.093273 , -0.18178  ,  0.19438  ,
       -0.3782   ,  0.70144  ,  0.16236  ,  0.0059111,  0.024898 ,
       -0.13613  , -0.11425  , -0.31598  , -0.14209  ,  0.0281

In [3]:
# How many numbers are used represent the word vector for "cat"?
print(len(nlp.vocab["cat"].vector))

300


Surprised? `300` is a lot of numbers for just a single word representation.

But all of these numbers individually or in groups give some information about the meaning of `cat`, for example they may have an underlying meaning saying "it's an animal" or "it's similar to lion" or "it drinks milk". 

**Can we see these relations in a matrix as humans?**

Of course **not**. We do not see the objects as a series of numbers like machines.  It is too hard to infer anything by just looking at these matrices.

**Can a computer really know all things about a `cat`?**

**No, even people don't know it all**. As computers operate on zeros and ones, each word representation has limited information because of the size of the model (its dimensionality) and its training process constraints. Usually the bigger the dataset we train on, the more information we can include in the model and then use it for various purposes. Training such models require enourmous amounts of data, storage and computing power.

### (optional) How are word vectors created?

The main technique behind the creation of word vectors is called **word2vec**. It takes in text as input and outputs a set of vectors associated with the words in this text. How does it work?

**Word2Vec** uses the context of individual words to create their numerical representations. It trains the words against other words in its neighbourhood (context). One of two methods can be used here:
- **continuous bag of words (CBOW)** uses context to predict a target word
- **skip-gram** uses a word to predict a target context

If you're interested to learn more about these approaches, check out https://en.wikipedia.org/wiki/Word2vec

# Measuring word similarity


### Cosine similarity
To understand how can we compare the semantics (meanings) of different words or even sentences, we firstly have to understand some basic maths behind it. One of the simplest but efficient methods of comparing two vectors is by measuring **the angle** between them [1].

[1]. Cosine similarity - Statistics for Machine Learning [Book, Image] [Internet]. [cited 2022 Mar 10]. Available from: https://www.oreilly.com/library/view/statistics-for-machine/9781788295758/eb9cd609-e44a-40a2-9c3a-f16fc4f5289a.xhtml]

![](./resources/images/cosine_similarity.png)
Let's explore the idea of the similarity of two unit vectors in a 2-D plane.
**cosine similarity**

In [9]:
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact, BoundedIntText, IntSlider
from random import randrange
from math import sin, cos, radians

int_box = BoundedIntText(
    value=randrange(0,100),
    min=-100,
    max=100,
    step=5,
    description='Text:',
    disabled=False
)

int_box_2 = BoundedIntText(
    value=randrange(0,100),
    min=-100,
    max=100,
    step=5,
    description='Text:',
    disabled=False
)

int_box_3 = BoundedIntText(
    value=randrange(0,100),
    min=-100,
    max=100,
    step=5,
    description='Text:',
    disabled=False
)

int_box_4 = BoundedIntText(
    value=randrange(0,100),
    min=-100,
    max=100,
    step=5,
    description='Text:',
    disabled=False
)

int_slider = IntSlider(
    value=0,
    min=0,
    max=360,
    step=1,
    description='Test:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)

int_slider_2 = IntSlider(
    value=0,
    min=0,
    max=360,
    step=1,
    description='Test:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)

@interact(x1=int_box, y1=int_box_2, x2=int_box_3, y2=int_box_4, angle_1=int_slider, angle_2=int_slider_2)
def plot(x1,y1,x2,y2, angle_1, angle_2):
    length = 50
    dx1 = cos(radians(angle_1))*length
    dy1 = sin(radians(angle_1))*length
    dx2 = cos(radians(angle_2))*length
    dy2 = sin(radians(angle_2))*length
    plt.figure(figsize=(8, 8), dpi=100)
    ax = plt.axes()
    ax.arrow(x1, y1, dx1, dy1, head_width=5, head_length=2, fc='black', ec='black')
    ax.arrow(x2, y2, dx2, dy2, head_width=5, head_length=2, fc='blue', ec='blue')
    plt.xlim(-100, 100)
    plt.ylim(-100, 100)
    plt.show()

interact(x1=int_box, y1=int_box_2, x2=int_box_3, y2=int_box_4, angle_1=int_slider, angle_2=int_slider_2)

ModuleNotFoundError: No module named 'matplotlib'

Because angle is not the best numerical measure, we use the cosine of this angle to get a value between -1 and 1.
The same idea and formula extends to multi-dimensional vectors, including our 300 dimensional word vectors.
**Cosine similarity** really gives insight into how similar two vectors are, but it is not a great scoring system.
That is why, we usually use **similarity distance** instead of actual cosine.

distance = 1 - cos(v1, v2)

### Calculating similarity between two words

Now that we know how vector similarity works, we can use it to calculate similarity between two words.

In [22]:
# Is cat similar to dog?
# Is cat similar to lion?
# Is cat similar to pet?
# Is cat similar to milk?
cat = nlp.vocab["cat"]
lion = nlp.vocab["lion"]
pet = nlp.vocab["pet"]
milk = nlp.vocab["milk"]
words = [cat, lion, pet, milk]
for word_1 in words:
    for word_2 in words:
        print(f"{word_1.text} is similar to {word_2.text} with {word_1.similarity(word_2)}")

cat is similar to cat with 1.0
cat is similar to lion with 0.5265437960624695
cat is similar to pet with 0.7505456805229187
cat is similar to milk with 0.30299440026283264
lion is similar to cat with 0.5265437960624695
lion is similar to lion with 1.0
lion is similar to pet with 0.39923766255378723
lion is similar to milk with 0.23199084401130676
pet is similar to cat with 0.7505456805229187
pet is similar to lion with 0.39923766255378723
pet is similar to pet with 1.0
pet is similar to milk with 0.27374202013015747
milk is similar to cat with 0.30299440026283264
milk is similar to lion with 0.23199084401130676
milk is similar to pet with 0.27374202013015747
milk is similar to milk with 1.0


## Exercise 3

Find how similar are:

a. "brother" and "sister"

b. "strong" and "weak"

c. "cactus" and "piano"

Round the calculated similarity to 2 decimal places using round() function
Example usage:
round(number, decimal_places)
round(5.247, 2) = 5.25

In [104]:
# Write your code here
brother = nlp.vocab["brother"]
sister = nlp.vocab["sister"]
strong = nlp.vocab["strong"]
weak = nlp.vocab["weak"]
cactus = nlp.vocab["cactus"]
piano = nlp.vocab["piano"]
print(round(brother.similarity(sister), 2))
print(round(weak.similarity(strong), 2))
print(round(cactus.similarity(piano), 2))

0.75
0.66
0.13


In [105]:
# Replace zeros with your answers
answer_a = 0.75
answer_b = 0.66
answer_c = 0.13

In [106]:
# DO NOT CHANGE THIS CELL
# RUN THIS CELL TO CHECK YOUR ANSWERS
%run resources/python/check_ex_2.py $answer_a $answer_b $answer_c

Correct! Well Done!


As you can see from this exercise: antonyms (words with opposite meaning) can also be quite similar, because they share some some semantic feature, eg. "short" and "tall" both relate to "height" category to some extent. 

#  Word Vector Arithmetic

Because words are stored as vectors, we can perform normal vector operations on them, for example vector addition, multiplication of a vector by a scalar and many many more.

Have you wondered what will you get if you combine a "yellow" and "fruit"? "Banana". Let's see!

In [10]:
new_vector = nlp.vocab["yellow"].vector + nlp.vocab["fruit"].vector
similarities = []
for word in nlp.vocab:
    similarities.append(new_vector.similarity(word))
s = sort(similarities, ascending=True)
for _ in range(10):
    print(s.pop())

AttributeError: 'numpy.ndarray' object has no attribute 'similarity'

## Quiz 3

In [11]:
from jupyterquiz import display_quiz
display_quiz("resources/quizzes/questions3.json")




# [Next Chapter](4.%20Let's%20Learn%20NLP%20-%20Sentiment%20Analysis.ipynb)