# Intro to spaCy and Dependency Parsing

Modified from [Natural Language Processing With spaCy in Python](https://realpython.com/natural-language-processing-spacy-python/) by Taranjeet Singh.


## Getting started with spaCy 

Load the language model instance in spaCy:

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

In case you get the following error:
```Python
OSError: [E053] Could not read config file from /usr/local/lib/python3.7/dist-packages/en_core_web_sm/en_core_web_sm-2.2.5/config.cfg
```
Try running
```
!pip install spacytextblob
!python -m textblob.download_corpora
!python -m spacy download en_core_web_sm
```


### Check the attributes of the object called `nlp`.

### Read and tokenize a string

In [None]:
text = 'This tutorial is about Natural Language Processing using spaCy.'
doc = nlp(text)
# Extract tokens in the given doc
print([token.text for token in doc])

### Read and tokenize a text file

#### Write the text file

In [None]:
text = ("This tutorial is about Natural Language Processing using spaCy."
        "But, you know that already."
        "Notice how the file is written differently than it is read.")

fname = 'text.txt'
with open(fname, 'w') as f:
  f.write(text)

#### Read the text file

In [None]:
text_from_file = open(fname).read()
doc_from_file = nlp(text_from_file)
# Extract tokens from the text file.
print([token.text for token in doc_from_file])

## Sentence detection

In [None]:
text = ("This tutorial is about Natural Language Processing using spaCy. "
        "But, you know that already. 654. "
        "However, I'm just going to keep adding text here because I have "
        "nothing better to do with my life. It is getting a bit predictable "
        "but I'm still adding this final sentence.")
doc = nlp(text)
sentences = list(doc.sents)

print('There are', len(sentences), 'sentences in the doc.\n')

print('Sentences:')
for sentence in sentences:
    print('\t', sentence)

### Custom sentence delimiter

In [None]:
from spacy.language import Language

@Language.component('custom_boundaries')
def set_custom_boundaries(doc):
    # Adds support to use `...` as the delimiter for sentence detection

    for token in doc[:-1]:
        if token.text == '...':
            doc[token.i+1].is_sent_start = True
    return doc


ellipsis_text = ('Hi, can you, ... never mind, I forgot'
                 ' what I was saying. So, do you think'
                 ' we should ...')

# Load a new model instance
custom_nlp = spacy.load('en_core_web_sm')
# Add the custom method to the langauge model pipeline.
custom_nlp.add_pipe('custom_boundaries', before='parser')
# Process the doc
custom_ellipsis_doc = custom_nlp(ellipsis_text)
custom_ellipsis_sentences = list(custom_ellipsis_doc.sents)

print('Custom sentence tokenization:')
for sentence in custom_ellipsis_sentences:
    print('\t', sentence)

# Sentence tokenization with no customization
ellipsis_doc = nlp(ellipsis_text)
ellipsis_sentences = list(ellipsis_doc.sents)
print('\nDefault sentence tokenization:')
for sentence in ellipsis_sentences:
    print('\t', sentence)


## Tokenization


In spaCy, we can print tokens by iterating on the `Doc` object. The token index is accessible in `token.idx`.

In [None]:
print('index\ttoken\n-------------')
for token in doc:
    print(f'{token.idx}\t{token}')

The token index is the starting position of the word in the orignal string. This is useful for example when replacing words.
Example:

In [None]:
print(f'token index: {doc[8].idx}, token: {doc[8]}')

The token class has more attributes.

In [None]:
for token in doc:
    print(token, token.idx, token.text_with_ws,
          token.is_alpha, token.is_punct, token.is_space,
          token.shape_, token.is_stop)

Some of the commonly used attributes:

 * **`text_with_ws`**: token text with trailing space (if present).
 * **`is_alpha`**: detects whether the token consists of alphabetic characters or not.
 * **`is_punct`**: whether the token is a punctuation symbol or not.
 * **`is_space`**: whether the token is a space or not.
 * **`shape_`**: the shape of the word.
 * **`is_stop`**: whether the token is a stop word or not.


<!-- ### Customized tokenization -->

In [None]:
custom_tokenization_text = ("Toronto-based and short-term are added to "
                            "illustrate custom tokenization.")

import re
import spacy
from spacy.tokenizer import Tokenizer

custom_nlp = spacy.load('en_core_web_sm')
prefix_re = spacy.util.compile_prefix_regex(custom_nlp.Defaults.prefixes)
suffix_re = spacy.util.compile_suffix_regex(custom_nlp.Defaults.suffixes)
infix_re = re.compile(r'''[-~]''')

def custom_tokenizer(nlp):
    # Adds support to use `-` as the delimiter for tokenization
    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                     suffix_search=suffix_re.search,
                     infix_finditer=infix_re.finditer,
                     token_match=None
                     )


custom_nlp.tokenizer = custom_tokenizer(custom_nlp)

print('With custom tokenization:')
custom_tokenizer_doc = custom_nlp(custom_tokenization_text)
print([token.text for token in custom_tokenizer_doc])

print('\nWithout custom tokenization:')
doc = nlp(custom_tokenization_text)
print([token.text for token in doc])

## Stop words

spaCy has a list of stop words for the English language.

In [None]:
import spacy
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
len(spacy_stopwords)

for stop_word in list(spacy_stopwords)[:10]:
    print(stop_word)

You can remove stop words from the input text:

In [None]:
for token in doc:
    if not token.is_stop:
        print(token)

You can also create a list of tokens not containing stop words:

In [None]:
no_stopword_doc = [token for token in doc if not token.is_stop]
print(no_stopword_doc)

## Lemmatization

spaCy has the attribute `lemma_` on the Token class. This attribute has the lemmatized form of a token:

In [None]:
conference_text = ('Rose is helping organize a developer '
                   'conference on Legal Applications of Natural Language'
                   ' Processing. She keeps organizing local Python meetups'
                   ' and several internal talks at her workplace.')
conference_doc = nlp(conference_text)
print('token\tlemma\n-------------------')
for token in conference_doc:
    print(f'{token}\t{token.lemma_}')

## Word Frequency

You can now convert a given text into tokens and perform statistical analysis over it. This analysis can give you various insights about word patterns, such as common words or unique words in the text:

In [None]:
from collections import Counter
complete_text = ('Eira is a Python developer currently'
                 ' working for a Toronto-based Legaltech company.'
                 ' She is interested in learning Natural Language Processing.'
                 ' There is a developer conference happening on 21 July'
                 ' 2022 in Toronto. It is titled "Applications of Natural'
                 ' Language Processing in Law". There is a helpline number '
                 ' available at +1-1234567891. Eira is helping organize it.'
                 ' She keeps organizing local Python meetups and several'
                 ' internal talks at her workplace. Eira is also presenting'
                 ' a talk. The talk will introduce the reader about "Use'
                 ' cases of Natural Language Processing in Legaltech".'
                 ' Apart from her work, she is very passionate about music.'
                 ' Eira is learning to play the Clave. She has enrolled '
                 ' herself in the weekend batch of Great Clave Academy.'
                 ' Great Clave Academy is situated in Kingston and Toronto'
                 ' and has world-class clave instructors.')

complete_doc = nlp(complete_text)
# Remove stop words and punctuation symbols
words = [token.text for token in complete_doc
         if not token.is_stop and not token.is_punct]
word_freq = Counter(words)

# 5 commonly occurring words with their frequencies
print('Top-5 most common words:')
common_words = word_freq.most_common(5)
print('\t', common_words)

# Unique words
print('All words that only occurs once in the text:')
print('\t', [word for (word, freq) in word_freq.items() if freq == 1])

## Part of Speech Tagging

Part of speech or POS is a grammatical role that explains how a particular word is used in a sentence. Eight common parts of speech:

 * Noun
 * Pronoun
 * Adjective
 * Verb
 * Adverb
 * Preposition
 * Conjunction
 * Interjection

Part of speech tagging is the process of assigning a POS tag to each token depending on its usage in the sentence. POS tags are useful for assigning a syntactic category like noun or verb to each word.

In spaCy, POS tags are available as an attribute on the Token object:

In [None]:
for token in doc:
    print(f'{token}\t{token.tag}\t{token.pos}\t{spacy.explain(token.tag_)}')

Here, two attributes of the Token class are accessed:

 * **`tag_`**: fine-grained part of speech.
 * **`pos_`**: coarse-grained part of speech.

`spacy.explain` gives descriptive details about a particular POS tag. spaCy provides a complete tag list along with an explanation for each tag.

Using POS tags, you can extract a particular category of words:

In [None]:
nouns = []
adjectives = []
for token in complete_doc:
    if token.pos_ == 'NOUN':
        nouns.append(token)
    if token.pos_ == 'ADJ':
        adjectives.append(token)

print('Nouns:')
print('\t', nouns)

print('Adjectives:')
print('\t', adjectives)

This can be used to derive insights, for example, remove the most common nouns, or see which adjectives are used for a particular noun.

## Visualization: Using displaCy

spaCy comes with a built-in visualizer called displaCy. You can use it to visualize a dependency parse or named entities in a browser or a Jupyter notebook.

You can use displaCy to find POS tags for tokens. By default, displayCy will spin a simple web server so that you can see the visualization by opening http://127.0.0.1:5000 in your browser. However, this only works if you run the code on your local machine.


In [None]:
from spacy import displacy
interest_text = ('We are interested in learning Natural Language Processing.')
interest_doc = nlp(interest_text)
displacy.serve(interest_doc, style='dep')

 If are running the code on Google Colab (or some other cloud-based platform) then you need to render the figure in Jupyter.

In [None]:
from spacy import displacy

interest_text = ('We are interested in learning Natural Language Processing.')
interest_doc = nlp(interest_text)
displacy.render(interest_doc, style='dep', jupyter=True, options={'distance': 150})

In the image above, each token is assigned a POS tag written just below the token.

## Preprocessing Functions

A preprocessing function converts text to an analyzable format.
You can create a preprocessing function that takes text as input and applies the following operations:

 * Lowercases the text
 * Lemmatizes each token
 * Removes punctuation symbols
 * Removes stop words

 Here’s an example:

In [None]:
def is_token_allowed(token):
    '''
    Only allow valid tokens which are not stop words
    and punctuation symbols.
    '''
    if (not token or not token.text.strip() or
        token.is_stop or token.is_punct):
        return False
    return True

def preprocess_token(token):
    # Reduce token to its lowercase lemma form
    return token.lemma_.strip().lower()

complete_filtered_tokens = [preprocess_token(token)
    for token in complete_doc if is_token_allowed(token)]
    
complete_filtered_tokens

Note that the complete_filtered_tokens does not contain any stop word or punctuation symbols and consists of lemmatized lowercase tokens.


## Rule-Based Matching Using spaCy

Rule-based matching is one of the steps in extracting information from unstructured text. It is used to identify and extract tokens and phrases according to patterns (such as lowercase) and grammatical features (such as part of speech).

Rule-based matching can use regular expressions to extract entities (such as phone numbers) from an unstructured text. It is different from extracting text using regular expressions only in the sense that regular expressions don’t consider the lexical and grammatical attributes of the text.

With rule-based matching, you can extract a first name and a last name, which are always proper nouns:

In [None]:
from spacy.matcher import Matcher

text = ('Hjalmar Turesson has never helped to organize anything, '
        ' especially not a developer conference on Legal Applications of '
        'Natural Language Processing.')

matcher = Matcher(nlp.vocab)

def extract_full_name(doc):
    pattern = [{'POS': 'PROPN'}, {'POS': 'PROPN'}]
    matcher.add('FULL_NAME', [pattern])
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]
        return span.text

doc = nlp(text)
extract_full_name(doc)

In this example, pattern is a list of objects that defines the combination of tokens to be matched. Both POS tags in it are `PROPN` (proper noun). So, the pattern consists of two objects in which the POS tags for both tokens should be `PROPN`. This pattern is then added to `Matcher` using `FULL_NAME` and the the `match_id`. Finally, matches are obtained with their starting and end indexes.

You can also use rule-based matching to extract phone numbers:

In [None]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

conference_org_text = ('There is a developer conference'
                       'happening on 21 July 2022 in London, Ontario. '
                       'It is titled "Applications of Natural Language '
                       'Processing". There is a helpline number available '
                       'at (519) 123-456')

def extract_phone_number(nlp_doc):
    pattern = [{'ORTH': '('}, {'SHAPE': 'ddd'},
               {'ORTH': ')'}, {'SHAPE': 'ddd'},
               {'ORTH': '-', 'OP': '?'},
               {'SHAPE': 'ddd'}]
    matcher.add('PHONE_NUMBER', [pattern])
    matches = matcher(nlp_doc)
    for match_id, start, end in matches:
        span = nlp_doc[start:end]
        return span.text

conference_org_doc = nlp(conference_org_text)
extract_phone_number(conference_org_doc)


In this example, only the pattern is updated in order to match phone numbers from the previous example. Here, some attributes of the token are also used:

 * **`ORTH`** gives the exact text of the token.
 * **`SHAPE`** transforms the token string to show orthographic features.
 * **`OP`** defines operators. Using `?` as a value means that the pattern is optional, meaning it can match `0` or `1` times.

## Dependency Parsing Using spaCy

Dependency parsing is the process of extracting the dependency parse of a sentence to represent its grammatical structure. It defines the dependency relationship between *headwords* and their *dependents*. The head of a sentence has no dependency and is called the *root* of the sentence. The verb is usually the head of the sentence. All other words are directly or indirectly linked to the headword.

The dependencies can be mapped in a directed graph (DAG) representation:

 * Words are the nodes.
 * The grammatical relationships are the edges.

Dependency parsing helps you know what role a word plays in the text and how different words relate to each other. It is also used in shallow parsing and *named entity recognition* (NER).

Here is how you can use dependency parsing to see the relationships between words:

In [None]:
text = 'Liv is not learning to dance salsa'
doc = nlp(text)
for token in doc:
    print(token.text, token.tag_, token.head.text, token.dep_, '||' , spacy.explain(token.tag_))

In this example, the sentence contains five relationships:

 * **`nsubj`**: nominal subject, the subject of the sentence. Its headword is a verb.
 * **`aux`**: auxiliary word. A function word whose headword is a verb. It expresses categories such as tense, mood, aspect, voice or evidentiality. It is often a verb (which may have non-auxiliary uses as well).
 * **`dobj`** is the direct object of the verb. Its headword is a verb.
 * **`neg`**
 * **`xcomp`**
 * **`npadvmod`**


The [Stanford typed dependencies manual](https://downloads.cs.stanford.edu/nlp/software/dependencies_manual.pdf) is a detailed list of relationships with descriptions. You can use displaCy to visualize the dependency tree:

In [None]:
from spacy import displacy

displacy.render(doc, style='dep', jupyter=True)

This image shows you that the subject of the sentence is the proper noun `Liv` and that it has a `learning` relationship with `dance`.

### Navigating the Tree and Subtree

The dependency parse tree has all the properties of a [tree](https://en.wikipedia.org/wiki/Tree_(data_structure)). This tree contains information about sentence structure and grammar and can be traversed in different ways to extract relationships.

spaCy provides attributes like children, lefts, rights, and subtree to navigate the parse tree:

In [None]:
salsa_text = ('Liv Ernerdahl is a professional salsa dancer '
              'currenly performing in the kitchen')
salsa_doc = nlp(salsa_text)
# Extract children of `dancing`
print(f'Children of {salsa_doc[8]}:', [token.text for token in salsa_doc[8].children])

# Extract previous neighboring node of `dancing`
print(f'One neighbour of {salsa_doc[8]}:',salsa_doc[8].nbor(-1))

# Extract next neighboring node of `dancing`
print(salsa_doc[8].nbor())

# Extract all tokens on the left of `dancing`
print([token.text for token in salsa_doc[8].lefts])

# Extract tokens on the right of `dancer`
print([token.text for token in salsa_doc[8].rights])

# Print subtree of `dancing`
print(list(salsa_doc[8].subtree))


In [None]:
displacy.render(salsa_doc, style='dep', jupyter=True, options={'distance': 120})

You can construct a function that takes a subtree as an argument and returns a string by merging words in it:

In [None]:
def flatten_tree(tree):
    return ''.join([token.text_with_ws for token in list(tree)]).strip()

# Print flattened subtree of `developer`
print(flatten_tree(salsa_doc[8].subtree))


You can use this function to print all the tokens in a subtree.

### Shallow Parsing

Shallow parsing, or chunking, is the process of extracting phrases from unstructured text. Chunking groups of adjacent tokens into phrases on the basis of their POS tags. There are some standard well-known chunks such as noun phrases, verb phrases, and prepositional phrases.

#### Noun Phrase Detection

A *noun phrase* is a phrase that has a noun as its head. It could also include other kinds of words, such as adjectives, ordinals, determiners. Noun phrases are useful for explaining the context of the sentence. They help you infer what is being talked about in the sentence.

spaCy has the property `noun_chunks` in the `Doc` object. You can use it to extract noun phrases:

In [None]:
conference_text = ('There is a developer conference '
                   'happening on 21 July 2022 in London, Ontario. ')
conference_doc = nlp(conference_text)
# Extract Noun Phrases
for chunk in conference_doc.noun_chunks:
    print(chunk)

By looking at noun phrases, you can get information about your text. For example, `a developer conference` indicates that the text mentions a conference, while the date `21 July` lets you know that conference is scheduled for 21 July. You can figure out whether the conference is in the past or the future. `London` tells you that the conference is in London.

#### Verb Phrase Detection

A *verb phrase* is a syntactic unit composed of at least one verb. This verb can be followed by other chunks, such as noun phrases. Verb phrases are useful for understanding the actions that nouns are involved in.

spaCy has no built-in functionality to extract verb phrases, so you’ll need a library called [textacy](https://textacy.readthedocs.io/en/latest/):

```
pip install textacy thinc
```

With `textacy` installed, you can use it to extract verb phrases based on grammar rules:

In [None]:
import textacy
# text = "All living things are made of cells. Cells have organelles."
talk_text = ('The talk will introduce the reader to use '
             'cases of Natural Language Processing in Legaltech')
verb_patterns = [[{"POS":"AUX"}, {"POS":"VERB"}, {"POS":"ADP"}], 
                 [{"POS": "VERB"}, {"POS":"NOUN"}],
                 [{"POS":"VERB"}]]

talk_doc = textacy.make_spacy_doc(talk_text, lang='en_core_web_sm')
verb_phrases = textacy.extract.token_matches(talk_doc, verb_patterns)

# Print all Verb Phrase
print('Verb phrases:')
for chunk in verb_phrases:
    print(f'\t{chunk.text}')

# Extract Noun Phrase to explain what nouns are involved
print('Noun phrases:')
for chunk in talk_doc.noun_chunks:
    print(f'\t{chunk}')

In this example, the verb phrase `introduce` indicates that something will be introduced. By looking at noun phrases, you can see that there is `The talk` (NP) that will `introduce` (VP) `the reader` (NP) to `use cases` (VP) of `Natural Language Processing` (NP) or `Legaltech` (NP).

The above code extracts all the verb phrases using a regular expression pattern of POS tags. You can tweak the pattern for verb phrases depending upon your use case.