# Examination of spaCy package
Spacy is a popular Python library used to process the natural language. In this notebook, we will discover their APIs, strengths, weaknesses, and possible solutions to improve them.

In [11]:
import spacy as sp
english = sp.load('en_core_web_sm') #Load the English language model

## Containers
In Spacy there are a few massive data structures holding large quantity of data. Some of them include:
* `Language` an object devoted to describe the language itself, including its grammar, words and syntactic relation;
* `Doc` an object representing a piece of text in said language with verity of utilities.

### Doc Container
Spacy provides a way to tokenise words is by loading a Language object (`english`) and pass it into the constructor getting a `Doc`ument object that represents composite data about each lexeme. Some of the data we can access include:
* `text` the textual representation from the source;
* `pos_` the part of the speech;
* `dep_` role in the sentence;
* `label_` linguistic labels;
* `lemma_` base form of the word;
* `ents` named entities;
* `sents` individual sentences.

In [12]:
with open('sample.txt', 'r') as f:
    text = f.read() #Text used from Wikipedia https://en.wikipedia.org/wiki/Chocolate

chocolate = english(text) #Process the text
#Doc is a linear sequence of tokens (words, punctuation, etc.) and their attributes.
for token in chocolate[:18]:
    print(token)
#Text is not just split by spaces, but by punctuation and other symbols.

Chocolate
is
a
food
made
from
roasted
and
ground
cacao
seed
kernels
that
is
available
as
a
liquid


In [13]:
#We can also yield a verity of properties from the tokens:
for token in chocolate[:18]:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)

Chocolate chocolate NOUN NN nsubj Xxxxx True False
is be AUX VBZ ROOT xx True True
a a DET DT det x True True
food food NOUN NN attr xxxx True False
made make VERB VBN acl xxxx True True
from from ADP IN prep xxxx True True
roasted roasted ADJ JJ amod xxxx True False
and and CCONJ CC cc xxx True True
ground ground NOUN NN conj xxxx True False
cacao cacao NOUN NN compound xxxx True False
seed seed NOUN NN compound xxxx True False
kernels kernels PROPN NNP pobj xxxx True False
that that PRON WDT nsubj xxxx True True
is be AUX VBZ relcl xx True True
available available ADJ JJ acomp xxxx True False
as as ADP IN prep xx True True
a a DET DT det x True True
liquid liquid ADJ JJ pobj xxxx True False


## Named Entity Recognition
Spacy does well to identify named entities in text and sorting them in different categories:
* **GPE** geopolitical entity;
* **cardinal** numeric value;
* **person** personal name;
* **date** date of time;
* **loc** location;
* **event** historical events;
* **percent** percentage;
* **org** organisation...
The model occasionally makes mistakes, however it is trained well.

In [14]:
for entity in chocolate.ents:
    print(entity.text, entity.label_)
#As you can see, tokens are not always words, but also punctuation and other symbols.

cacao GPE
Cacao GPE
19th-11th century DATE
BCE ORG
Mesoamerican NORP
Maya PERSON
Aztecs PERSON
cacao GPE
two CARDINAL
Powdered ORG
dutch NORP
today DATE
one CARDINAL
Western holidays DATE
Christmas DATE
Easter ORG
Valentine PERSON
Americas LOC
West African NORP
Ghana ORG
the 21st century DATE
some 60% PERCENT
some two million CARDINAL
West Africa GPE
2018 DATE


## Data visualisation
Spacy provides ways to represent the syntactic connections between the words and visualising their hierarchy right in the terminal in the best traditions of Python. To do this, we use the `displacy` sub-module and its method `render()`.

In [19]:
#Visualising the dependency tree.
sentence = list(chocolate.sents)[0]
sp.displacy.render(sentence, style='dep', jupyter=True)

In [17]:
#Visualising the named entities.
sp.displacy.render(chocolate, style='ent', jupyter=True)

# Word vectors
Word vectors, or word embeddings, is a technique used by NLP to parse meaning of words. Since computers do not think in the way humans do and they cannot understand the meaning behind words, it takes an intermediate representation to map meaning so that computers could perform certain operations on it. In the past, each word corresponded to a unique integers, as if an enum, however nowadays data scientists assign each word an array of numbers to capture its meaning, semantics and lexicology. We will explore word vectors in `en_core_web_md` data model.

In [20]:
vector = sp.load('en_core_web_md') #Load the English language model with vectors
chocolate = vector(text) #Process the text

Spacy allows us to evaluate the doc objects in verity of ways, including:
* meaning comparison;


In [24]:
#Evaluate the similarity between variable text samples.
sample1 = "Chocolate is a healthy sweet."
sample2 = "Chocolate is a delicious sweet."
sample3 = "Chocolate is a terrible sweet."
sample4 = "Chocolate cannot burn fat."
sample5 = "Chocolate Magnate Vladyslav Korol is a Ukrainian businessman and politician."
print(vector(sample1).similarity(vector(sample2))) #Comparing semantically similar sentences
print(vector(sample1).similarity(vector(sample3)))
print(vector(sample2).similarity(vector(sample3)))
print("-------------------")
print(vector(sample1).similarity(vector(sample4))) #Comparing unrelated sentences
print(vector(sample1).similarity(vector(sample5)))
print(vector(sample4).similarity(vector(sample5)))

0.9838006734337593
0.9782059508625104
0.9869280032429322
-------------------
0.5412843641434292
0.8045995912346176
0.39264233824859496


From the results we see that the previously loaded vectorised `en_core_web_md` model compared sentences that fall in a similar patterns as highly similar (>96%). Even despite this, it recognises differences in meaning as well as it dropped one percent between sample 1 and 3 because of the difference of *healthy - terrible*. On top of that, the sample 2 was less similar to sample 1 than sample 3, and sample 3 is massively different to sample 2.