# Examination of spaCy package
Spacy is a popular Python library used to process the natural language. In this notebook, we will discover their APIs, strengths, weaknesses, and possible solutions to improve them.

In [81]:
import spacy as sp
english = sp.load('en_core_web_sm') #Load the English language model

# Containers
In Spacy there are a few massive data structures holding large quantity of data. Some of them include:
* `Language` an object devoted to describe the language itself, including its grammar, words and syntactic relation;
* `Doc` an object representing a piece of text in said language with verity of utilities.  

## Language container
`Language` object possesses detailed description about the language we deal with and is usually initiated once when `spacy.load()`ing the model. It documents:
* vocabulary of the language;
* weighed binaries;
* stop words (words to ignore in the most cases);
* tokenisation exceptions (such as dotted abbreviations like `J.A.R.V.I.S` are treated as one lexeme instead of multiple sentences);
* punctuation rules to divide the sentences, describe syntactic relations and formulate phrases;
* symbol classes (Latin characters, quotation marks, numeric literals) and their uses;
* lemmatiser (primary form getter), etc.
From there, it can produce `Doc` objects when source text is passed into the constructor. 

## Doc Container
Spacy provides a way to tokenise words is by loading a Language object (`english`) and pass it into the constructor getting a `Doc`ument object that represents composite data about each lexeme. Some of the data we can access include:
* `text` the textual representation from the source;
* `pos_` the part of the speech;
* `dep_` role in the sentence;
* `label_` linguistic labels;
* `lemma_` base form of the word;
* `ents` named entities;
* `sents` individual sentences.

In [82]:
from spacy.tokens import Doc
with open('sample.txt', 'r') as f:
    source = f.read() #Text used from Wikipedia https://en.wikipedia.org/wiki/Chocolate

chocolate = english(source) #Process the text
#Doc is a linear sequence of tokens (words, punctuation, etc.) and their attributes.
#Note text is not just split by spaces, but by punctuation and other symbols.
for token in chocolate[:18]:
    print(token)

Chocolate
is
a
food
made
from
roasted
and
ground
cacao
seed
kernels
that
is
available
as
a
liquid


In [83]:
#We can also yield a verity of properties from the tokens:
for token in chocolate[:18]:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)

Chocolate chocolate NOUN NN nsubj Xxxxx True False
is be AUX VBZ ROOT xx True True
a a DET DT det x True True
food food NOUN NN attr xxxx True False
made make VERB VBN acl xxxx True True
from from ADP IN prep xxxx True True
roasted roasted ADJ JJ amod xxxx True False
and and CCONJ CC cc xxx True True
ground ground NOUN NN conj xxxx True False
cacao cacao NOUN NN compound xxxx True False
seed seed NOUN NN compound xxxx True False
kernels kernels PROPN NNP pobj xxxx True False
that that PRON WDT nsubj xxxx True True
is be AUX VBZ relcl xx True True
available available ADJ JJ acomp xxxx True False
as as ADP IN prep xx True True
a a DET DT det x True True
liquid liquid ADJ JJ pobj xxxx True False


# Named Entity Recognition
Spacy does well to identify named entities in text and sorting them in different categories:
* **GPE** geopolitical entity;
* **cardinal** numeric value;
* **person** personal name;
* **date** date of time;
* **loc** location;
* **event** historical events;
* **percent** percentage;
* **org** organisation...  

In [84]:
for entity in chocolate.ents:
    print(entity.text, entity.label_)
#As you can see, tokens are not always words, but also 
#punctuation and other symbols, grouped together in spans.

cacao GPE
Cacao GPE
19th-11th century DATE
BCE ORG
Mesoamerican NORP
Maya PERSON
Aztecs PERSON
cacao GPE
two CARDINAL
Powdered ORG
dutch NORP
today DATE
one CARDINAL
Western holidays DATE
Christmas DATE
Easter ORG
Valentine PERSON
Americas LOC
West African NORP
Ghana ORG
the 21st century DATE
some 60% PERCENT
some two million CARDINAL
West Africa GPE
2018 DATE
one CARDINAL
the Gulf Coast LOC
Veracruz GPE
Mexico GPE
1750 DATE
Pacific LOC
Chiapas GPE
Mexico GPE
Mokaya GPE
Aztec NORP
1440–1521 DATE
Brooklyn Museum ORG
Classic PRODUCT
460–480 AD MONEY
Maya PERSON
Rio Azul GPE
Maya PERSON
Maya PERSON
Maya PERSON
Maya PERSON
cacao GPE
the 15th century DATE
Aztecs ORG
Mesoamerica PERSON
Quetzalcoatl ORG
one CARDINAL
Maya PERSON
Aztecs ORG
Cymbopetalum ORG
Aztecs ORG
Mexican NORP
Aztecs ORG
Aztecs ORG
one CARDINAL
100 CARDINAL
one CARDINAL
three CARDINAL
Maya PERSON
Aztecs PERSON
Spanish NORP
Gonzalo Fernández de Oviedo y Valdés PERSON
Nicaragua GPE
1528 DATE
the 16th century DATE
European NOR

## Entity ruler
Language models are trained with machine learning, ang given the complexity of natural language, they are prone to make sometimes significant mistakes. Spacy offers a way to correct models with use of the `entity_ruler` pipe that can match the text by custom patterns and label them accordingly. Let's use this pipe to correct named entity recognition. 

In [85]:
#Patterns is a list of dictionaries of the token text and the expected label.
patterns = [{"label": "NATION", "pattern": "Maya"}, 
            {"label": "NATION", "pattern": "Aztecs"},
            {"label": "DATE", "pattern": "BCE"}]
#Ruler is the English entity recogniser.
ruler = english.add_pipe("entity_ruler", before="ner")
ruler.add_patterns(patterns)
#We can now see the new entities in the text.
chocolate = english(source)
for entity in chocolate.ents:
    print(entity.text, entity.label_)

cacao GPE
Cacao GPE
19th-11th century DATE
BCE DATE
Mesoamerican NORP
Maya NATION
Aztecs NATION
cacao GPE
two CARDINAL
Powdered ORG
dutch NORP
today DATE
one CARDINAL
Western holidays DATE
Christmas DATE
Easter ORG
Valentine PERSON
Americas LOC
West African NORP
Ghana ORG
the 21st century DATE
some 60% PERCENT
some two million CARDINAL
West Africa GPE
2018 DATE
one CARDINAL
the Gulf Coast LOC
Veracruz GPE
Mexico GPE
1750 DATE
Pacific LOC
Chiapas GPE
Mexico GPE
Mokaya GPE
Aztec NORP
1440–1521 DATE
Brooklyn Museum ORG
Classic PRODUCT
460–480 AD MONEY
Maya NATION
Rio Azul GPE
Maya NATION
Maya NATION
Maya NATION
Maya NATION
cacao GPE
the 15th century DATE
Aztecs NATION
Mesoamerica PERSON
Quetzalcoatl ORG
one CARDINAL
Maya NATION
Aztecs NATION
Cymbopetalum ORG
Aztecs NATION
Mexican NORP
Aztecs NATION
Aztecs NATION
one CARDINAL
100 CARDINAL
one CARDINAL
three CARDINAL
Maya NATION
Aztecs NATION
Spanish NORP
Gonzalo Fernández de Oviedo y Valdés PERSON
Nicaragua GPE
1528 DATE
the 16th century D

From what we can see, _Aztecs_ and _Maya_ are not tagged as nations. This approach allows us to override classically trained systems and impose our own definitions, however since it is rule-based, it will apply to all instances of _patterns_. Consequently, it should be used only when no ambiguity is ensured.  

Entity ruler is just one of the many pipes. In Spacy, it is possible to design your own custom pipes that will process language according to the specific needs and add it to the language object that will be able to produce `Doc`s implementing this functionality.   

In our example, we see that `cacao` is recognised as _an organisation_. Let's modify our language model in such way it will not treat it as a named entity. 

In [86]:
#Spacy uses @Language.component decorator to register a function as a component.
from spacy.language import Language 
@Language.component("filter") 
def filter(doc):
    """Removes cacao from the named entities."""
    entities = list(doc.ents) 
    for entity in doc.ents:
        if entity.text == "cacao" or entity.text == "Cacao":
            entities.remove(entity)
    doc.ents = entities #Update the entities
    return doc

#We can add our custom component to the pipeline and see it in action.
english.add_pipe("filter", after="ner")
chocolate = english(source)
for entity in chocolate.ents:
    print(entity.text, entity.label_)

19th-11th century DATE
BCE DATE
Mesoamerican NORP
Maya NATION
Aztecs NATION
two CARDINAL
Powdered ORG
dutch NORP
today DATE
one CARDINAL
Western holidays DATE
Christmas DATE
Easter ORG
Valentine PERSON
Americas LOC
West African NORP
Ghana ORG
the 21st century DATE
some 60% PERCENT
some two million CARDINAL
West Africa GPE
2018 DATE
one CARDINAL
the Gulf Coast LOC
Veracruz GPE
Mexico GPE
1750 DATE
Pacific LOC
Chiapas GPE
Mexico GPE
Mokaya GPE
Aztec NORP
1440–1521 DATE
Brooklyn Museum ORG
Classic PRODUCT
460–480 AD MONEY
Maya NATION
Rio Azul GPE
Maya NATION
Maya NATION
Maya NATION
Maya NATION
the 15th century DATE
Aztecs NATION
Mesoamerica PERSON
Quetzalcoatl ORG
one CARDINAL
Maya NATION
Aztecs NATION
Cymbopetalum ORG
Aztecs NATION
Mexican NORP
Aztecs NATION
Aztecs NATION
one CARDINAL
100 CARDINAL
one CARDINAL
three CARDINAL
Maya NATION
Aztecs NATION
Spanish NORP
Gonzalo Fernández de Oviedo y Valdés PERSON
Nicaragua GPE
1528 DATE
the 16th century DATE
European NORP
Central American NORP


# Matcher
One of the great powers of the Spacy package is its ability to retrieve pieces of text that correspond to a custom defined pattern that can match parts of speech and sentence structure. There's a verity of patterns that matcher can evaluate, including:
* `ORTH` searches the exact spelling equivalent;
* `LENGTH` searches the tokens whose length equals the specified ones;
* `ENT_TYPE` searches for the tokens with the specified entity type;
* `IS_SENTENCE_START` evaluates if the lexeme is the beginning of the sentence;
* `IS_LOWER`, `IS_UPPER`, `IS_TITLE` checks if the token confronts to the naming convention;
* `IS_ALPHA`, `IS_DIGIT`, `IS_ASCII` checks if the token is composed from the sets of characters;
* `IS_PUNC`, `IS_SPACE`, `IS_STOP` checks if the token is one of those synctatic characters (is used to verify position in sentence);
* `IS_NUM`, `IS_URL`, `IS_EMAIL` checks if the token belongs to a category;
* `POS`, `TAG`, `LEMMA`, `DEP`, `SHAPE`, `MORPH` analyses spans morphologically and lexically by criteria.  

Patterns are represented as lists of dictionaries where each type of pattern is accompanied with argument and a _qualifier_ that sets how many times pattern must occur:
* `!` negates the pattern by forcing it to be matched 0 times;
* `?` makes the pattern optional by letting it happen 0 or 1 time;
* `+` require the pattern to match 1 or more times;
* `*` allow pattern to match 0 or more times;
* `{n}` require the pattern to match _exactly_ n times;
* `{n,}` require the pattern to match _at least_ n times;
* `{,m}` require the pattern to match _at most_ m times.

With matcher, Spacy can filter language sample and detect an extensive number of themes in it, such as spam, certain topic, politic or religious views, personality traits, indirect speech authorship and edit a wide range of mistakes in the text. To showcase it, let's build a script that will print all fragments from the sample text that contains different dates.

In [87]:
#Matcher is a rule-based matcher that works with language vocabulary.
from spacy.matcher import Matcher
matcher = Matcher(english.vocab)
patterns = [{"ENT_TYPE": "DATE", "OP": "*"}] #Search for any number of dates
matcher.add("Search dates", [patterns]) #Add the pattern to the matcher
matches = matcher(chocolate) #Find all matches in the text
spans = list()
#Match is a tuple of the match ID, start and end indexes.
for match_id, start, end in matches: 
    spans.append(chocolate[start:end])
#Matchers return duplicate matches if they overlap we need to remove.
spans = sp.util.filter_spans(spans) #Delete shorter copies
for span in spans:
    print(str(span.sent).replace("\n", " "))

Cacao has been consumed in some form since at least the Olmec civilization (19th-11th century BCE), and the majority of Mesoamerican people ─ including the Maya and Aztecs ─ made chocolate beverages.  
Much of the chocolate consumed today is in the form of sweet chocolate, a combination of cocoa solids, cocoa butter or added vegetable oils, and sugar.
Gifts of chocolate molded into different shapes (such as eggs, hearts, coins) are traditional on certain Western holidays, including Christmas, Easter, Valentine's Day, and Hanukkah.
Gifts of chocolate molded into different shapes (such as eggs, hearts, coins) are traditional on certain Western holidays, including Christmas, Easter, Valentine's Day, and Hanukkah.
Although cocoa originated in the Americas, West African countries, particularly Côte d'Ivoire and Ghana, are the leading producers of cocoa in the 21st century, accounting for some 60% of the world cocoa supply.  
A 2018 report argued that international attempts to improve condit

It is furthermore possible to pair Spacy Matcher with regular expressions to achieve even more accurate and deeper analysis. This can be useful for describing patterns that fall in strict rules as for their writing, such as names of beginning with Mr. or Mrs. or professional titles with Dr. While Matcher deals with semantic meaning of the words and classifies them based on their reflection in reality, regular expressions are great for detecting string patterns. In the example below I will create a pipe that will recognise named entities based on their addressing.

In [88]:
import re
from spacy.tokens import Span
from spacy.util import filter_spans
@Language.component("addresses")
def searchNames(doc: Doc) -> Doc:
    """Searches for names in the text marked with addressing and adds them to the named entities."""
    #The pattern expression searches for names with addresses Mr., Mrs., Ms., Dr., Prof., etc.
    pattern = re.compile(r"(P|M|D)(r|s)(s|.)?(f|.)?\s[A-Z][a-z]+") 
    names = [doc.char_span(match.start(), match.end(), label="PERSON")
                    for match in pattern.finditer(doc.text)]
    #Add the names to the named entities.
    doc.ents = tuple(filter_spans(list(doc.ents) + (names)))
    return doc

english.add_pipe("addresses")
chocolate = english(source)
for entity in chocolate.ents:
    if entity.label_ == "PERSON":
        print(entity.text, entity.label_)

Valentine PERSON
Mesoamerica PERSON
Gonzalo Fernández de Oviedo y Valdés PERSON
Christopher Columbus PERSON
José de Acosta PERSON
Vanilla PERSON
Alexander VII PERSON
Coenraad van Houten PERSON
Joseph Fry PERSON
Daniel Peter PERSON
Henri Nestlé PERSON
Tuke PERSON
Milton S. Hershey PERSON
Baker PERSON
Dr. James PERSON
John Hannon PERSON
Yucatec Mayan PERSON


From the example above, I built a custom component that uses regular expressions to match entities in text and added it to the language pipeline that will be applied every time text in this language is processed and will be reflected in every Doc objects it produces.  

This example is not flawless as it only includes the first word after the address and does not anticipate the occasions when the address is used as an example of film, and for purposes like these it is possible to involve machine learning in the research process or add higher pipes that will deal with the specific cases.

## Data visualisation
Spacy provides ways to represent the syntactic connections between the words and visualising their hierarchy right in the terminal in the best traditions of Python. To do this, we use the `displacy` sub-module and its method `render()`.

In [89]:
#Visualising the dependency tree.
sentence = list(chocolate.sents)[0]
sp.displacy.render(sentence, style='dep', jupyter=True)

In [90]:
#Visualising the named entities.
sp.displacy.render(chocolate, style='ent', jupyter=True)
#Note that all applied components are working as intended.

The model seized in classifying expressions like _"some two million"_ as cardinal and _"West Africa"_ as geopolitical entity, however it was mistaken to classify _"Powdered"_ as an organisation,and _"Aztecs"_ as a person due to its inability to gasp the context. This proves the model was trained on supervised machine learning algorithms that do not yield perfect accuracy, however they constantly progress and improve themselves over the periods of time as they are fed with more quality data.

# Word vectors
Word vectors, or word embeddings, is a technique used by NLP to parse meaning of words. Since computers do not think in the way humans do and they cannot understand the meaning behind words, it takes an intermediate representation to map meaning so that computers could perform certain operations on it. In the past, each word corresponded to a unique integers, as if an enum, however nowadays data scientists assign each word an array of numbers to capture its meaning, semantics and lexicology. We will explore word vectors in `en_core_web_md` data model.  

Spacy allows us to evaluate the doc objects in verity of ways, including:
* meaning comparison;
* structural comparison.

In [91]:
#Evaluate the similarity between variable text samples.
vector = sp.load('en_core_web_md') #Load the English language model with vectors
text = vector(source)
sample1 = "Chocolate is a healthy sweet."
sample2 = "Chocolate is a delicious sweet."
sample3 = "Chocolate is a terrible sweet."
sample4 = "Chocolate cannot burn fat."
sample5 = "Chocolate Magnate Vladyslav Korol is a Ukrainian businessman and politician."
print(vector(sample1).similarity(vector(sample2))) #Comparing semantically similar sentences
print(vector(sample1).similarity(vector(sample3)))
print(vector(sample2).similarity(vector(sample3)))
print("-------------------")
print(vector(sample1).similarity(vector(sample4))) #Comparing unrelated sentences
print(vector(sample1).similarity(vector(sample5)))
print(vector(sample4).similarity(vector(sample5)))

0.9838006734337593
0.9782059508625104
0.9869280032429322
-------------------
0.5412843641434292
0.8045995912346176
0.39264233824859496


From the results we see that the previously loaded vectorised `en_core_web_md` model compared sentences that fall in a similar patterns as highly similar (>96%). Even despite this, it recognises differences in meaning as well as it dropped one percent between sample 1 and 3 because of the difference of *healthy - terrible*. On top of that, the sample 2 was less similar to sample 1 than sample 3, and sample 3 is massively different to sample 2.

From the example above we see that word embeddings are highly competent at evaluating similarities of texts based on structural and semantic differences.

## Spacy Pipelines
Spacy enables developers to create custom text-processing models and apply layers functions to them. Pipelines are organised as a sequence of *pipes*, which are functions to process language in one way or another and yield certain result. They are useful if we need to only perform one task, then initilising an empty model and adding needed pipes, such as counting the amount of sentences in the text. Let's compare performance of custom and pre-trained models on the same task:

In [92]:
#Initialise a new NLP model to perform sentence counting.
custom = sp.blank('en') #Create a blank Language class
custom.add_pipe('sentencizer') #Add the sentencizer component to the pipeline

<spacy.pipeline.sentencizer.Sentencizer at 0x7fad56c68e80>

In [93]:
%%time
doc = custom(text) 
print(len(list(doc.sents))) #Print the number of sentences
custom.analyze_pipes() #Print the model statistics and architecture


94
CPU times: user 2.1 ms, sys: 0 ns, total: 2.1 ms
Wall time: 2.09 ms


{'summary': {'sentencizer': {'assigns': ['token.is_sent_start', 'doc.sents'],
   'requires': [],
   'scores': ['sents_f', 'sents_p', 'sents_r'],
   'retokenizes': False}},
 'problems': {'sentencizer': []},
 'attrs': {'token.is_sent_start': {'assigns': ['sentencizer'], 'requires': []},
  'doc.sents': {'assigns': ['sentencizer'], 'requires': []}}}

In [94]:
%%time
doc = english(text)
print(len(list(doc.sents)))
english.analyze_pipes()

87
CPU times: user 302 ms, sys: 0 ns, total: 302 ms
Wall time: 301 ms


{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'entity_ruler': {'assigns': ['doc.ents', 'token.ent_type', 'token.ent_iob'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ent

From the results above, we see that custom blank models are much more lightweight and take significantly less time, however due to their inexperience, they make inaccuracies. Trained models take more time, however they make more accurate predictions, hence offering developers a trade-off.

# Recap and conclusions
spaCy is an open-source MIT-licensed natural language processing library aimed to be industrial standard package. It allows NLP engineers to work with languages by loading a verity of pre-trained and pre-defined models that can produce highly detailed linguistic data structures analysing one piece of text, Docs. Doc object break the source text into meaningful parts, _spans_, that can be a single word or a phrase. Docs also offers an ability to access individual sentences and named entities in the text as well as visualise the syntactic relations. On top of that, Spacy powers linguists with matcher engine that will return all spans that match a specific pattern and lets them build their own functionality by adding their custom pipes to the language models and overwrite the existing capabilities.  

From its issues, I can note slight lack of integration with regular expressions in pattern matching, and one of the things that could be improved is adding a method that will locate a span from a string range.

The Spacy library brings a great start to make a verity of natural language-related tasks, including theme detection, text summirisation (via word embeddings), text editing, text parsing and machine translations. Paired with skilled developer and insightful scholar, this library can produce one of the most impressive linguistic software.

# Sources and references
spaCy official documentation: 
1. _https://spacy.io/api/language_
2. _https://spacy.io/api/Matcher_
3. _https://spacy.io/usage/linguistic-features#language-data_
