# Word Vectors and similarity
Word Vectors are numerical vector representations of words and documents. The numeric form helps understand the semantics about the word and can be used for NLP tasks such as classification.

Because, vector representation of words that are similar in meaning and context appear closer together.

spaCy models support inbuilt vectors that can be accessed through directly through the attributes of Token and Doc.

First, load a spaCy model of your choice. Here, I am using the medium model for english en_core_web_md. Next, tokenize your text document with nlp boject of spacy model.

You can check if a token has in-buit vector through Token.has_vector attribute.

In [1]:
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.1.0/en_core_web_md-3.1.0-py3-none-any.whl (45.4 MB)
[K     |████████████████████████████████| 45.4 MB 29.7 MB/s eta 0:00:01
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [2]:
# Check if word vector is available
import spacy

# Loading a spacy model
nlp = spacy.load("en_core_web_md")
tokens = nlp("I am an excellent cook")

for token in tokens:
  print(token.text ,' ',token.has_vector)

I   True
am   True
an   True
excellent   True
cook   True


In [3]:
# Check if word vector is available
tokens=nlp("I wish to go to hogwarts lolXD ")
for token in tokens:
  print(token.text,' ',token.has_vector)

I   True
wish   True
to   True
go   True
to   True
hogwarts   True
lolXD   False


In [4]:
# Extract the word Vector
tokens=nlp("I wish to go to hogwarts lolXD ")
for token in tokens:
  print(token.text,' ',token.vector_norm)

I   6.4231944
wish   5.1652417
to   4.74484
go   5.05723
to   4.74484
hogwarts   7.4110312
lolXD   0.0


Notice that when vector is not present for a token, the value of vector_norm is 0 for it.

Identifying similarity of two words or tokens is very crucial . It is the base to many everyday NLP tasks like text classification , recommendation systems, etc.. It is necessary to know how similar two sentences are , so they can be grouped in same or opposite category.

In [5]:
# Compute Similarity
token_1=nlp("bad")
token_2=nlp("terrible")

similarity_score=token_1.similarity(token_2)
print(similarity_score)

0.773919122839211


In [6]:
review_1=nlp(' The food was amazing')
review_2=nlp('The food was excellent')
review_3=nlp('I did not like the food')
review_4=nlp('It was very bad experience')

score_1=review_1.similarity(review_2)
print('Similarity between review 1 and 2',score_1)

score_2=review_3.similarity(review_4)
print('Similarity between review 3 and 4',score_2)

Similarity between review 1 and 2 0.9566213306804962
Similarity between review 3 and 4 0.8461898618188776


You can also check if two tokens or docs are related (includes both similar side and opposite sides) or completely irrelevant.

In [7]:
# Compute Similarity between texts 
pizza=nlp('pizza')
burger=nlp('burger')
chair=nlp('chair')

print('Pizza and burger  ',pizza.similarity(burger))
print('Pizza and chair  ',pizza.similarity(chair))

Pizza and burger   0.7269758060509472
Pizza and chair   0.1917965994241628
