Learn spacy part 2
----------------------

We will use again the `fr_core_web_md` model in this part. We can use also the `en_core_web_md` model used for english words.

In [6]:
# let's import spacy and numpy
import spacy
import numpy as np

In [35]:
# create an nlp object
nlp = spacy.load("en_core_web_md")

In [36]:
with open("data/wiki_us.txt", "r") as f:
    text = f.read()

In [37]:
doc = nlp(text)
sentence1 = list(doc.sents)[0]

In [38]:
print(sentence1)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


#### Word and sentence similarities with vectorization

Let's print the 10 most similar words to the word "country" with some vectorization and distance computations between the vectors.

In [39]:
word = "country"

ms = nlp.vocab.vectors.most_similar(
    np.asarray([nlp.vocab.vectors[nlp.vocab.strings[word]]]), n = 10
)

In [40]:
words = [nlp.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]

print(words)

['country—0,467', 'nationâ\x80\x99s', 'countries-', 'continente', 'Carnations', 'pastille', 'бесплатно', 'Argents', 'Tywysogion', 'Teeters']


Let's compare some sentences.

In [45]:
# the two following sentences talks almost about the same things
# 'fast food context'
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very dangerous.")

In [46]:
print(doc1, "<->", doc2, doc1.similarity(doc2))

I like salty fries and hamburgers. <-> Fast food tastes very dangerous. 0.6739822816641421


We obtained a high similarity or most precisely a low distance between the sentences.

In [51]:
# The following sentence is not talking about fast foods
doc3 = nlp("The Empire State Building is in New York.")

In [52]:
print(doc1, "<->", doc3, doc1.similarity(doc3))

I like salty fries and hamburgers. <-> The Empire State Building is in New York. 0.1766669125394067


So we got a low similarity between the first sentence and the last one.

In [55]:
doc4 = nlp("I eat a lot of fruits.")
doc5 = nlp("I eat a lot of legumes.")



In [56]:
print(doc4, "<->", doc5, doc4.similarity(doc5))

I eat a lot of fruits. <-> I eat a lot of legumes. 0.992280000395795


The last example illustrates that the similarity is calculate above the semantic structures of the sentences.

Let's do a more clear example by comparing "I eat a lot of fruits" with "I eat a lot of burgers" and see if we will get a lower similarity for them.

In [57]:
doc6 = nlp("I eat a lot of burgers.")
print(doc4, "<->", doc6, doc4.similarity(doc6))

I eat a lot of fruits. <-> I eat a lot of burgers. 0.9877901604375569


It seems correct since that we used the same semantic for the two sentences but with a little difference that burgers can contains legumes or not.