## Installation

Before starting you need to install spacy. Open a console/terminal/or powershell (from anaconda navigator) and run `conda install -c conda-forge spacy` then download the medium-sized english model `python -m spacy download en_core_web_md`.

Also install tqdm `conda install tqdm`, a useful loading bar for long operations.

You might need to restart this notebook (`Kernel` -> `restart` above).

## Load a model

In [1]:
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_md')

In [2]:
# Processing some text
doc = nlp('Andrea Palladio was an Italian architect. This course is supposed to be about natural language processing.')

## Tokenization / Lemmatization

In [3]:
for token in doc:
    print(token.text, token.lemma_, sep=' -> ')

Andrea -> Andrea
Palladio -> Palladio
was -> be
an -> an
Italian -> italian
architect -> architect
. -> .
This -> this
course -> course
is -> be
supposed -> suppose
to -> to
be -> be
about -> about
natural -> natural
language -> language
processing -> processing
. -> .


In [4]:
# Warning that a token will always display its text by default, but it is a much more complex object with many attributes
token = doc[4]
print(token)
print(type(token))
print(token.vocab)

Italian
<class 'spacy.tokens.token.Token'>
<spacy.vocab.Vocab object at 0x7f75981f1cc8>


### Parenthesis : Bag-of-Words

In [5]:
doc = nlp('''
In computing, stop words are words which are filtered out before
or after processing of natural language data (text).
''')

set(token.lemma_ for token in doc
    if not token.is_stop and not token.is_punct and not token.is_space)

{'computing',
 'datum',
 'filter',
 'language',
 'natural',
 'processing',
 'stop',
 'text',
 'word'}

## Parsing the text

In [6]:
doc_large = nlp('Andrea Palladio was an Italian architect. This course is supposed to be about natural language processing.')

### Sentences

In [7]:
for sent in doc_large.sents:
    print(sent)

Andrea Palladio was an Italian architect.
This course is supposed to be about natural language processing.


In [8]:
# A sentence is a DocSpan which is just a subpart of a Doc
type(sent)

spacy.tokens.span.Span

In [9]:
# Like with many structures in Spacy, they are generators and lists.
# So they are iterable but you can not access a random element by default.
doc_large.sents

<generator at 0x7f7571e20558>

In [10]:
# ... but you can always convert them easily
list(doc_large.sents)

[Andrea Palladio was an Italian architect.,
 This course is supposed to be about natural language processing.]

In [11]:
# From a token you can always access the corresponding doc, sent, ...
token = doc_large[1]
token.sent

Andrea Palladio was an Italian architect.

### POS

In [12]:
doc = nlp('Andrea Palladio was an Italian architect')
displacy.render(doc)

In [13]:
# Each token has its own tag
for token in doc:
    print(token.text, token.pos_, token.tag_, token.dep_, sep='      \t')

Andrea      	PROPN      	NNP      	compound
Palladio      	PROPN      	NNP      	nsubj
was      	AUX      	VBD      	ROOT
an      	DET      	DT      	det
Italian      	ADJ      	JJ      	amod
architect      	NOUN      	NN      	attr


In [14]:
# Each token has its own dependency
for token in doc:
    print(token.text, token.dep_, token.head, list(token.children), list(token.ancestors), sep='      \t')

Andrea      	compound      	Palladio      	[]      	[Palladio, was]
Palladio      	nsubj      	was      	[Andrea]      	[was]
was      	ROOT      	was      	[Palladio, architect]      	[]
an      	det      	architect      	[]      	[architect, was]
Italian      	amod      	architect      	[]      	[architect, was]
architect      	attr      	was      	[an, Italian]      	[was]


**The explanation of these terms can be consulted [here](https://spacy.io/api/annotation#pos-tagging) or one can use `spacy.explain`.**

In [15]:
# Definition of terms can always be obtained
print(spacy.explain('NNP'))
print(spacy.explain('amod'))
print(spacy.explain('JJ'))

noun, proper singular
adjectival modifier
adjective


## Load Xenotheka data

The file `texts_english.bin` contains the pre-parsed content of 558 books in English from xenotheka. It is a large file as it contains all the POS tagging, and took a few hours to compute.

In [16]:
from spacy.tokens import DocBin
from tqdm.auto import tqdm

# Efficient structure for saving lots of parsed documents
doc_bin = DocBin(attrs=["HEAD", "TAG", "LEMMA", "DEP", "POS"], store_user_data=True)

# Reading the binary file, 558 pre-parsed texts (376MB)
with open('texts_english.bin', 'rb') as f:
    loaded_data = doc_bin.from_bytes(f.read())

In [17]:
# WARNING: loading all the 558 books takes roughly 6GB of RAM
# You might want to try on a small subset first
N_BOOKS = 5

docs = []
N_BOOKS = min(N_BOOKS, len(loaded_data))
for i, doc in zip(tqdm(range(N_BOOKS)), loaded_data.get_docs(nlp.vocab)):
    docs.append(doc)

HBox(children=(IntProgress(value=0, max=5), HTML(value='')))




In [18]:
# Each book is just a large Spacy Doc, with the information of the corresponding original text file
docs[0].user_data['file']

'Abbott__Flatland.txt'

## Rule based matching

For more examples and explanations, head over to https://spacy.io/usage/rule-based-matching

In [19]:
from spacy.matcher import Matcher
# Create a matcher object
matcher = Matcher(nlp.vocab)

matcher.add("myVeryOwnMacthingRule", None, [{"LEMMA": "describe"}])

In [20]:
# Try to find matches in the first book
doc = docs[0]
matches = matcher(docs[0])
for match_id, start, end in matches:
    # match_id is the id of the matching rule, this line is to get back the name 'myVeryOwnMacthingRule'
    string_id = nlp.vocab.strings[match_id]
    # Get the token from the starting position in the document
    t = doc[start]
    print(start, t)
    print(t.sent)
    print('---')

234 described
In such a country, you will perceive at once that it is impossible that
there should be anything of what you call a "solid" kind; but I dare
say you will suppose that we could at least distinguish by sight the
Triangles, Squares, and other figures, moving about as I have described
them.  
---
893 describe
You may perhaps ask how under these disadvantagous circumstances we are
able to distinguish our friends from one another: but the answer to
this very natural question will be more fitly and easily given when I
come to describe the inhabitants of Flatland.  
---
9992 described
The children of the
poor are therefore allowed to "feel" from their earliest years, and
they gain thereby a precocity and an early vivacity which contrast at
first most favourably with the inert, undeveloped, and listless
behaviour of the half-instructed youths of the Polygonal class; but
when the latter have at last completed their University course, and are
prepared to put their theory into practi

In [21]:
d = nlp('I describe it.')
displacy.render(d)
for t in d:
    print(t, t.lemma_, t.pos_, t.tag_)

I -PRON- PRON PRP
describe describe VERB VBP
it -PRON- PRON PRP
. . PUNCT .


In [22]:
displacy.render(nlp('I was describing it.'))

In [23]:
displacy.render(nlp('I ask you to describe it.'))

In [24]:
d = nlp('Let me explain.')
displacy.render(d)
for t in d:
    print(t, t.lemma_, t.pos_, t.tag_)

Let let VERB VB
me -PRON- PRON PRP
explain explain VERB VB
. . PUNCT .


In [25]:
# Try to find matches in the first book
matcher = Matcher(nlp.vocab)
matcher.add("myVeryOwnMacthingRule", None, [
    {"LEMMA": {'IN':["describe", 'explain', 'define']}, 'POS': 'VERB'}
])
doc = docs[0]
matches = matcher(docs[0])
for match_id, start, end in matches:
    # match_id is the id of the matching rule, this line is to get back the name 'myVeryOwnMacthingRule'
    string_id = nlp.vocab.strings[match_id]
    # Get the token from the starting position in the document
    t = doc[start]
    # Print only if 
    if any(c.dep_ == 'nsubj' and c.text in ('me', 'I') for c in t.children):  # Direct subject, 1st of singular
        print('Author explaining')
        print(t.sent)
        print('---')
    elif any(c.dep_ == 'nsubj' for c in t.children):  # Direct subject, other
        print('Not author explaining')
        print(t.sent)
        print('---') 

Author explaining
In such a country, you will perceive at once that it is impossible that
there should be anything of what you call a "solid" kind; but I dare
say you will suppose that we could at least distinguish by sight the
Triangles, Squares, and other figures, moving about as I have described
them.  
---
Author explaining
But let me explain my meaning, without further eulogies on
this beneficent Element.


---
Author explaining
Now therefore the artful Irregular whom I described above as the real
author of this diabolical Bill, determined at one blow to lower the
status of the Hierarchy by forcing them to submit to the pollution of
Colour, and at the same time to destroy their domestic opportunities of
training in the Art of Sight Recognition, so as to enfeeble their
intellects by depriving them of their pure and colourless homes.  
---
Author explaining
The loss
of a few sides in a highly-developed Polygon is not easily noticed, and
is sometimes compensated by a successful opera

### Named Entity Recognition

In [26]:
doc = nlp('Le Corbusier was a French architect. He built Notre-Dame de Ronchamp.')
displacy.render(doc, style='ent')

In [27]:
for ent in doc.ents:
    print(ent, ent.label_)

Le Corbusier PERSON
French NORP
Notre-Dame de Ronchamp ORG


## Word Vectors 

In [28]:
doc = nlp('Architect is a beautiful job')
token = doc[0]

In [29]:
# Accessing the word vector for a token
token.vector

array([ 4.3056e-01, -2.5697e-01,  7.2521e-02, -3.4186e-01, -6.9907e-02,
        2.1462e-01,  1.6019e-01,  4.6952e-02, -3.0886e-01,  2.4907e+00,
        2.5334e-03, -3.3628e-01, -6.5238e-01,  3.7404e-01,  4.7670e-01,
        1.3775e-01,  1.0689e-01,  6.7676e-01,  6.3706e-01, -7.1364e-02,
        1.1170e+00, -3.7098e-01, -2.5129e-01, -5.2586e-01, -3.8282e-01,
       -5.2030e-01, -3.2476e-01,  6.1889e-01,  1.3549e-01,  7.1957e-01,
        1.7008e-01,  4.0052e-02,  3.8600e-01,  2.0393e-01,  5.9692e-02,
       -5.2871e-01, -4.9954e-03, -3.2586e-01, -5.7786e-02, -5.3843e-01,
        2.0995e-01, -2.3496e-02,  5.8118e-02, -2.5206e-01, -6.4122e-01,
       -4.9476e-01, -1.9597e-01, -7.9340e-01,  6.2612e-01,  6.1045e-01,
        4.8935e-01, -1.6363e-01, -1.7103e-01,  1.6363e-01,  7.6102e-02,
       -1.7962e-01, -1.8494e-01,  4.9449e-01,  7.9426e-01, -4.3375e-01,
       -3.7887e-01, -2.5459e-01,  2.8804e-01,  6.0464e-01,  1.9814e-01,
        3.4301e-01,  1.0328e-01, -3.0353e-01, -8.9278e-01, -8.34

In [30]:
# Function that will print the most similar words according to their word vectors
def print_most_similar(word, n=10):
    v = nlp.vocab[word].vector
    keys, _, scores = nlp.vocab.vectors.most_similar(v[None], n=n)
    for k, s in zip(keys[0], scores[0]):
        print(nlp.vocab[k].text, s)

In [31]:
print_most_similar('describe')

DESCRIBE 1.0
DESCRIBING 0.805
explain 0.7498
EXPLAIN 0.7498
DESCRIBES 0.7362
RELATE 0.7206
define 0.6901
DEFINE 0.6901
UNDERSTAND 0.6882
understand 0.6882


In [32]:
print_most_similar('architect', n=20)

co-designer 1.0
SURVEYOR 0.637
engineer 0.637
ARCHITECTURAL 0.6084
HOMEBUILDER 0.6013
COUTURIER 0.5686
CONSULTANT 0.5339
REMODELER 0.4933
DEVELOPER 0.4884
BUIDING 0.488
building 0.488
deisgn 0.4816
design 0.4816
Watchmakers 0.4762
artist 0.4741
CERAMICIST 0.4741
PLASTERERS 0.4734
Outplacement 0.4701
GENEALOGIST 0.4621
engineering 0.4598


In [33]:
my_words = nlp('beautiful building')
# A Doc has also a vector, which is the mean of the vectors of the individual tokens
my_words.has_vector

True

In [34]:
# Order the sentences based on their similarity with the original words
# Search in the first 10 books
sents = sorted((sent for doc in docs[:10] for sent in doc.sents),
                key=lambda sent: my_words.similarity(sent) if sent.has_vector else 0.0,
                reverse=True)

In [35]:
# Display the top 10 sentences
print(*sents[:10], sep='\n--\n')

Together with villas and garden architecture, gates were classified in the genre called Rustic, partly for the reason that the grandest of all ancient Roman city gates was the rusticated Porta Maggiore.
--
The spirit of the Early Christian church survived with its splendidly decorated inner walls and simple geometric exteriors.
--
The choir is charming,âfar more charming than the nave, as the beautiful woman is more charming than the elderly man.
--
No such building was constructed in the Renaissance.
--
Like the Cortile del Belvedere, which was built to rival the great villas of antiquity, the Campidoglio was a monumental symbol in which the haunting dream of ancient grandeur became concrete.
--
Decorative landscapes painted by Paolo Veronese within this villa (fig. 8.19) invite the owner and visitors to look through the architecture onto ideal landscapes that would be Plinian were it not for the appearance of Roman ruins.
--
The windows are set into walls built to close the fifteen