# Finding linguistic patterns using spaCy

This section teaches you to find linguistic patterns using spaCy, a natural language processing library for Python.

If you are unfamiliar with the linguistic annotations produced by spaCy or need to refresh your memory, revisit [Part II](../part_ii/03_basic_nlp.ipynb) before working through this section.

After reading this section, you should:

 - know how to search for patterns among Tokens and their sequences
 - know how to search for patterns among morphological features

## Finding patterns using spaCy Matchers

Linguistic annotations, such as part-of-speech tags, syntactic dependencies and morphological features, help impose structure on written language. Crucially, linguistic annotations allow searching for structural patterns instead of individual words or phrases. This allows defining search patterns in a flexible way.

In the spaCy library, the capability for pattern search is provided by various components named Matchers.

spaCy provides three types of Matchers:

1. A [Matcher](https://spacy.io/api/matcher), which allows defining rules that search for particular **words or phrases** by examining *Token* attributes.  
2. A [DependencyMatcher](https://spacy.io/api/dependencymatcher), which allows searching parse trees for **syntactic patterns**.
3. A [PhraseMatcher](https://spacy.io/api/phrasematcher), a fast method for matching spaCy *Doc* objects to *Doc* objects.

### Finding words or phrases

To get started with the *Matcher*, let's import the spaCy library and load a small language model for English.

In [22]:
# Import the spaCy library into Python
import spacy

# Load a small language model for English; assign the result under 'nlp'
nlp = spacy.load('en_core_web_sm')

To have some data to work with, let's load some text from a Wikipedia article.

To do so, we use Python's `open()` function to open the file for reading, providing the `file`, `mode` and `encoding` arguments, as instructed in [Part II](../part_ii/01_basic_text_processing.ipynb#Loading-plain-text-files-into-Python).

We then call the `read()` method to read the file contents, and store the result under the variable `text`.

In [23]:
# Use the open() function to open the file for reading, followed by the
# read() method to read the contents of the file.
text = open(file='data/occupy.txt', mode='r', encoding='utf-8').read()

This returns a Python string object that contains the article in plain text, which is available under the variable `text`.

Next, we then feed this object to the language model under the variable `nlp` as instructed in [Part II](../part_ii/03_basic_nlp.ipynb#Performing-basic-NLP-tasks-using-spaCy).

We also use Python's `len()` function to count the number of words in the text.

In [24]:
# Feed the string object to the language model
doc = nlp(text)

# Use the len() function to check length of the Doc object to count 
# how many Tokens are contained within.
len(doc)

14867

Now that we have a spaCy *Doc* object with nearly 15 000 *Tokens*, we can continue to import the *Matcher* class from the `matcher` submodule of spaCy.

In [25]:
# Import the Matcher class
from spacy.matcher import Matcher

Importing the *Matcher* class from spaCy's `matcher` submodule allows creating *Matcher* objects.

When creating a *Matcher* object, you must provide the vocabulary of the language model used for finding matches to the *Matcher* object.

The reason for this is really rather simple: if you want to search for patterns in some language, you need to know its vocabulary first.

The vocabulary of a model is stored in a [*Vocab*](https://spacy.io/api/vocab) object. The *Vocab* object can be found under the attribute `vocab` of a spaCy *Language* object, which was introduced in [Part II](../part_ii/03_basic_nlp.ipynb#Performing-basic-NLP-tasks-using-spaCy).

In this case, we have the *Language* object that contains a small language model for English stored under the variable `nlp`, which means we can access its *Vocab* object by calling `nlp.vocab`.

We then call the *Matcher* **class** and provide the vocabulary under `nlp.vocab` to the `vocab` argument to create a *Matcher* object. We store the resulting object under the variable `matcher`.

In [26]:
# Create a Matcher and provide model vocabulary; assign result under the variable 'matcher'
matcher = Matcher(vocab=nlp.vocab)

# Call the variable to examine the object
matcher

<spacy.matcher.matcher.Matcher at 0x14ee96cc0>

The *Matcher* object is now ready to store the patterns that we want to search for.

These patterns, or more specifically, *pattern rules*, are created using a [specific format](https://spacy.io/api/matcher#patterns) defined in spaCy.

Each pattern consists of a Python list, which is populated by Python dictionaries. 

Each dictionary describes the pattern for matching a single spaCy *Token*. 

If you wish to match a sequence of *Tokens*, you must define multiple dictionaries within a single list, whose order follows that of the pattern to be matched.

Let's start by defining a simple pattern for matching sequences of pronouns and verbs, which we store under the variable `pronoun_verb`.

This pattern consists of a list, as marked by the surrounding brackets `[]`, which contains two dictionaries, marked by curly braces `{}` and separated by a comma. The key and value pairs in a dictionary are separated by a colon.

 - The dictionary key determines which *Token* attribute should be searched for matches. The attributes supported by the *Matcher* can be found [here](https://spacy.io/api/matcher#patterns).

 - The value under the dictionary key determines the specific value for the attribute.

In this case, we define a pattern that searches for a sequence of two coarse part-of-speech tags (`POS`), which were introduced in [Part II](../part_ii/03_basic_nlp.ipynb#Part-of-speech-tagging), namely pronouns (`PRON`) and verbs (`VERB`).

Note that both keys and values must be provided in uppercase letters.

In [27]:
# Define a list with nested dictionaries that contains the pattern to be matched
pronoun_verb = [{'POS': 'PRON'}, {'POS': 'VERB'}]

Now that we have defined the pattern using a list and dictionaries, we can add it to the *Matcher* object under the variable `matcher`.

This can be achieved using `add()` method, which requires two inputs:

 1. A Python string object that defines a name for the pattern. This is simply for purposes of identification.
 2. A list containing the pattern(s) to be searched for. A single rule for matching patterns can contain multiple patterns, hence the input must be a *list of lists*, e.g. `[pattern_1]`.

In [28]:
# Add the pattern to the matcher under the name 'pronoun+verb'
matcher.add("pronoun+verb", patterns=[pronoun_verb])

To search for matches the *Doc* object stored under the variable `doc`, we feed the *Doc* object to the *Matcher* and store the result under the variable `result`.

We also set the optional argument `as_spans` to `True`, which instructs spaCy to return the results as *Span* objects.

In [29]:
# Apply the Matcher to the Doc object under 'doc'; provide the argument
# 'as_spans' and set its value to True to get Spans as output
result = matcher(doc, as_spans=True)

# Call the variable to examine the output
result

[It aimed,
 It formed,
 it organizes,
 who designed,
 He wrote,
 They promoted,
 It refers,
 they saw,
 they argued,
 they called,
 it takes,
 they called,
 who comment,
 them using,
 they belong,
 himself warned,
 he said,
 they think,
 them gain,
 they wished,
 they blamed,
 I support,
 I saw,
 It showed,
 who gave,
 they refused,
 they saw,
 who caused,
 We are,
 who sought,
 who were,
 who were,
 who made,
 who criticized,
 it returned,
 its proposed,
 They received,
 there have,
 it came,
 it gained,
 He claimed,
 they presented,
 they call,
 It consists,
 there were,
 there was,
 What started,
 it is,
 they began,
 they perceived,
 they say,
 We agree,
 we see,
 it's,
 who are,
 who say,
 what's,
 they do,
 they reflect,
 He mentioned,
 We regard,
 who participated,
 he wrote,
 we have,
 who dislike,
 they employ,
 they have,
 there is,
 it stall,
 who emerged,
 It pushes,
 who called]

The output is a list of spaCy *Span* objects that match the requested pattern. Let's examine the first object in the list of matches in greater detail.

In [30]:
result[0]

It aimed

The *Span* object has various useful attributes, including `start` and `end`. These attributes contain the indices that indicate where in the *Doc* object the *Span* starts and finishes.

In [31]:
result[0].start, result[0].end

(36, 38)

Another useful attribute is `label`, which contains the name that we gave to the pattern. Let's take a closer look at this attribute.

In [32]:
result[0].label

12298179334642351811

The value stored under the `label` attribute is actually a spaCy [*Lexeme*](https://spacy.io/api/lexeme) object that corresponds to an entry in the language model's vocabulary. 

This *Lexeme* contains the name that we gave to the search pattern above, namely `pronoun+verb`.

We can easily verify this by fetching this *Lexeme* from the *Vocab* object under `nlp.vocab` and examining its `text` attribute.

In [33]:
nlp.vocab[12298179334642351811].text

'pronoun+verb'

The information under the `label` attribute is useful for disambiguating between patterns, especially if the same *Matcher* object contains multiple different patterns, as we will see shortly below.

Looking at the matches above, the pattern we defined is quite restrictive, as the pronoun and the verb must follow each other.

We cannot, for example, match patterns where the verb is preceded by auxiliary verbs.

spaCy allows increasing the flexibility of pattern rules using operators. These operators are defined by adding the key `OP` to the dictionary that defines a pattern for a single *Token*. spaCy supports the following operators:

 - `!`: Negate the pattern; the pattern can occur exactly zero times.
 - `?`: Make the pattern optional; the pattern may occur zero or one times.
 - `+`: Require the pattern to occur one or more times.
 - `*`: Allow the pattern to match zero or more times.

Let's explore the use of operators by defining another pattern rule, which extends the scope of our *Matcher*.

To do so, we define another pattern for a *Token* between the pronoun and the verb. This *Token* must have the coarse part-of-speech tag `AUX`, which indicates an auxiliary verb. 

In addition, we add another key and value pair to the dictionary for this *Token*, which contains the key `OP` with the value `+`. This means that the *Token* corresponding to an auxiliary verb must occur *one or more times*.

We store the resulting list with nested dictionaries under the variable `pronoun_aux_verb`, and add the pattern to the existing *Matcher* object stored under the variable `matcher`.

In [34]:
# Define a list with nested dictionaries that contains the pattern to be matched
pronoun_aux_verb = [{'POS': 'PRON'}, {'POS': 'AUX', 'OP': '+'}, {'POS': 'VERB'}]

# Add the pattern to the matcher under the name 'pronoun+aux+verb'
matcher.add('pronoun+aux+verb', patterns=[pronoun_aux_verb])

# Apply the Matcher to the Doc object under 'doc'; provide the argument 'as_spans'
# and set its value to True to get Spans as output. Overwrite previous matches by
# storing the result under the variable 'results'.
results = matcher(doc, as_spans=True)

Just as above, the *Matcher* returns a list of spaCy *Span* objects.

Let's loop over each item in the list `results`. We use the variable `result` to refer to the *Span* objects in the list, which contain our matches.

We first retrieve the *Lexeme* object stored under `result.label`, which we map to the language model's *Vocabulary* under `nlp.vocab`. 

As we learned above, this *Lexeme* corresponds to the name that we gave to the pattern rule, whose human-readable form can be found under the attribute `text`.

We then print a tabulator character to insert some space between the name of the pattern and the *Span* object containing the match.

In [35]:
# Loop over each Span object in the list 'results'
for result in results:
    
    # Print out the the name of the pattern rule, a tabulator character, and the matching Span
    print(nlp.vocab[result.label].text, '\t', result)

pronoun+verb 	 It aimed
pronoun+verb 	 It formed
pronoun+verb 	 it organizes
pronoun+verb 	 who designed
pronoun+verb 	 He wrote
pronoun+verb 	 They promoted
pronoun+verb 	 It refers
pronoun+aux+verb 	 they did have
pronoun+verb 	 they saw
pronoun+verb 	 they argued
pronoun+verb 	 they called
pronoun+verb 	 it takes
pronoun+aux+verb 	 they were working
pronoun+aux+verb 	 there had been
pronoun+aux+verb 	 who had lost
pronoun+verb 	 they called
pronoun+aux+verb 	 themselves be informed
pronoun+verb 	 who comment
pronoun+verb 	 them using
pronoun+aux+verb 	 anyone can join
pronoun+aux+verb 	 what is called
pronoun+verb 	 they belong
pronoun+verb 	 himself warned
pronoun+verb 	 he said
pronoun+verb 	 they think
pronoun+aux+verb 	 they will change
pronoun+aux+verb 	 it can help
pronoun+verb 	 them gain
pronoun+verb 	 they wished
pronoun+verb 	 they blamed
pronoun+aux+verb 	 It was organized
pronoun+verb 	 I support
pronoun+verb 	 I saw
pronoun+verb 	 It showed
pronoun+aux+verb 	 It was ren

The output shows that the pattern we added to the *Matcher* matches patterns that contain one (e.g. "we *can* build") or more (e.g. "they *have been* protesting") auxiliaries!

### Finding morphological features

As introduced in [Part II](../part_ii/03_basic_nlp.ipynb#Morphological-analysis), spaCy can also perform morphological analysis for individual *Tokens*, whose results are stored under the attribute `morph` of a *Token* object.

The `morph` attribute contains a string object, in which each morphological feature is separated by a vertical bar `|`, as illustrated below.

```
We 	 Case=Nom|Number=Plur|Person=1|PronType=Prs
```

As you can see, particular types of morphological features, e.g. *Case*, and their type, e.g. *Nom* (for the nominative case) are separated by equal signs `=`.

Let's begin exploring how we can define pattern rules that match morphological features.

To get started, we create a new *Matcher* object named `morph_matcher`.

In [36]:
# Create a Matcher and provide model vocabulary; assign result under the variable 'morph_matcher'
morph_matcher = Matcher(vocab=nlp.vocab)

We then define a new pattern with rules for two *Tokens*:

 1. Tokens that have a fine-grained part-of-speech tag `NNP` (proper noun), which can occur one or more times (operator: `+`)
 2. Tokens that have a coarse part-of-speech tag `VERB` and have precisely the following morphological features (`MORPH`): `Number=Sing|Person=Three|Tense=Pres|VerbForm=Fin`
 
We define the pattern using two dictionaries in a list, which we assign under the variable `propn_3rd_finite`.

In [37]:
# Define a list with nested dictionaries that contains the pattern to be matched
propn_3rd_finite = [{'TAG': 'NNP', 'OP': '+'},
                    {'POS': 'VERB', 'MORPH': 'Number=Sing|Person=Three|Tense=Pres|VerbForm=Fin'}]

We then add the pattern to the newly-created *Matcher* stored under the variable `morph_matcher` using the `add()` method.

We also provide the value `LONGEST` to the optional argument `greedy` for the `add()` method.

The `greedy` argument filters the matches for *Tokens* that include operators such as `+` that search *greedily* for more than one match.

By setting the value to `LONGEST`, spaCy returns the longest sequence of matches instead of returning every match.

In [38]:
# Add the pattern to the matcher under the name 'sing_3rd_pres_fin'
morph_matcher.add('sing_3rd_pres_fin', patterns=[propn_3rd_finite], greedy='LONGEST')

We then apply the *Matcher* to the data stored under the variable `doc`.

In [39]:
# Apply the Matcher to the Doc object under 'doc'; provide the argument 'as_spans'
# and set its value to True to get Spans as output. Overwrite previous matches by
# storing the result under the variable 'results'.
morph_results = morph_matcher(doc, as_spans=True)

# Loop over each Span object in the list 'morph_results'
for result in morph_results:

    # Print result
    print(result)

Occupy Wall Street uses
Rolling Jubilee claims
Noam Chomsky argues
Rolling Jubilee reports
Information Act requests
Jodi Dean argues


As you can see, the matches are relatively few in number, because we defined that the verb should have quite specific morphological features.

The question is, then, how can match just *some* morphological features?

To loosen the criteria for morphological features by focusing on [tense](https://en.wikipedia.org/wiki/Grammatical_tense) only, we need to use a dictionary with the key `MORPH`, but instead of a string object, we provide a dictionary as the value.

For this dictionary, we use the string `IS_SUPERSET` as the key. `IS_SUPERSET` is one of the attributes defined in the spaCy [pattern format](https://spacy.io/api/matcher#patterns).

Before proceeding any further, let's unpack the logic behind `IS_SUPERSET` a bit: 

We can think of morphological features associated with a given Token as a [set](https://en.wikipedia.org/wiki/Set_(mathematics)). To exemplify, a set could consist of the following four items:

```
Number=Sing, Person=Three, Tense=Pres, VerbForm=Fin
```

If we would have *another set* with just one item, `Tense=Pres`, we could describe the relationship between the two sets by stating that the first set (with four items) is a superset of the second set (with one item).

In other words, the larger (super)set contains the smaller (sub)set.

This is also how matching using `IS_SUPERSET` works: spaCy retrieves the morphological features for a given *Token*, and examines whether these features are a superset of the features defined in the search pattern.

The morphological features to be searched for are provided as a list of Python strings.

These strings, in turn, define particular morphological features, e.g. `Tense=Past`, as defined in the [Universal Dependencies](https://universaldependencies.org/u/overview/morphology.html) schema for describing morphology.

This list is then used as the value for the key `IS_SUPERSET`.

Let's now proceed to search for verbs in the past tense and add them to the *Matcher* object under `morph_matcher`.

In [40]:
# Define a list with nested dictionaries that contains the pattern to be matched
past_tense = [{'TAG': 'NNP', 'OP': '+'},
              {'POS': 'VERB', 'MORPH': {'IS_SUPERSET': ['Tense=Past']}}]

# Add the pattern to the matcher under the name 'past_tense'
morph_matcher.add('past_tense', patterns=[past_tense], greedy='LONGEST')

# Apply the Matcher to the Doc object under 'doc'; provide the argument 'as_spans'
# and set its value to True to get Spans as output. Overwrite previous matches by
# storing the result under the variable 'results'.
morph_results = morph_matcher(doc, as_spans=True)

Let's loop over the results and print out the name of the pattern, the *Span* object containing the match, and the morphological features of the final *Token* in the match, which corresponds to the verb.

In [41]:
# Loop over each Span object in the list 'results'
for result in morph_results:
    
    # Print out the the name of the pattern rule, a tabulator character, and the matching Span.
    # Finally, print another tabulator character, followed by the morphological features of the
    # last Token in the match (a verb).
    print(nlp.vocab[result.label].text, '\t', result, '\t', result[-1].morph)

past_tense 	 Community Environmental Legal Defense Fund released 	 Tense=Past|VerbForm=Fin
past_tense 	 Oakland Police Chief Howard Jordan expressed 	 Tense=Past|VerbForm=Fin
past_tense 	 U.S. Vice President Al Gore called 	 Tense=Past|VerbForm=Fin
past_tense 	 Los Angeles City Council became 	 Tense=Past|VerbForm=Fin
past_tense 	 Judge Jed S. Rakoff sided 	 Tense=Past|VerbForm=Fin
past_tense 	 Finance Minister Jim Flaherty expressed 	 Tense=Past|VerbForm=Fin
past_tense 	 Prime Minister Manmohan Singh described 	 Tense=Past|VerbForm=Fin
past_tense 	 Supreme Leader Ayatollah Khamenei voiced 	 Tense=Past|VerbForm=Fin
past_tense 	 Prime Minister Gordon Brown said 	 Tense=Past|VerbForm=Fin
past_tense 	 Anti-Defamation League stated 	 Tense=Past|VerbForm=Fin
past_tense 	 Occupy Wall Street endorsed 	 Tense=Past|VerbForm=Fin
past_tense 	 New York Times reported 	 Tense=Past|VerbForm=Fin
past_tense 	 Occupy Wall Street said 	 Tense=Past|VerbForm=Fin
past_tense 	 Lieutenant John Pike used 	 Te

As you can see, the `past_tense` pattern can match objects based on a single morphological feature, although most matches share another morphological feature, namely a finite form of the verb. 

## Examining matches in context using concordances

We can examine matches in their context of occurrence using *concordances*. In corpus linguistics, concordances are often understood as lines of text that show a match in its context of occurrence.

These concordance lines can help the analyst to understand the context in which a particular token or structure occurs, and to develop further hypotheses.

To create concordance lines using spaCy, let's start by importing the Printer class from wasabi, which is a small [Python library](https://pypi.org/project/wasabi/) that spaCy uses for colouring and formatting messages. We will use wasabi to highlight the matches in the concordance lines.

We first initialise a *Printer* object, which we then assign under the variable `match`. Next, we test the *Printer* object by printing some text in red colour.

In [50]:
# Import the Printer class from wasabi
from wasabi import Printer

# Initialise a Printer object; assign the object under the variable 'match'
match = Printer()

# Use the Printer to print out some text in red colour
match.text("Hello world!", color="red")

[38;5;1mHello world![0m


We then proceed to loop over the results returned by the *Matcher* object `morph_matcher`. As we learned above, the results consist of *Span* objects in a list, which are stored under the variable `morph_results`.

We loop over items in this list and use the `enumerate()` function to keep track of their count. We also provide the argument `start` with the value 1 to the `enumerate()` function to start counting from the number 1.

During the loop, we refer to this count using the variable `i` and to the *Span* object as `result`. The number under `i` is incremented with every *Span* object.

We then print out the following output for each *Span* object in the list `morph_results`:

 1. `i`: The number of the item in the list.
 2. `doc[result.start - 7: result.start]`: A slice of the *Doc* object stored under the variable `doc`, which we searched for matches. As usual, we define a slice using brackets and separate the start and end of a slice using a colon. We take a slice that begins 7 *Tokens* before the start of the match (`result.start - 7`), and terminates at the start of the match `result.start`.
 3. `match.text(result, color="red", no_print=True)`: The matching *Span* object, rendered using the wasabi *Printer* object `match` in red colour. We also set the argument `no_print` to `True` to prevent wasabi from printing the output on a new line.
 4. `doc[result.end: result.end + 7]`: Another slice of the *Doc* object stored under the variable `doc`. Here we take a slice that begins at the end of the match `result.end` and terminates 7 *Tokens* after the end of the match (`result.end + 7`).
 
Essentially, we use the indices available under `start` and `end` attributes of each *Span* to retrieve the linguistic context in which the *Span* occurs.  

In [25]:
# Loop over the matches in 'morph_results' and keep count of items
for i, result in enumerate(morph_results, start=1):
    
    # Print following information for each match
    print(i,   # Item number being looped over
          doc[result.start - 7: result.start],   # The slice of the Doc preceding the match
          match.text(result, color="red", no_print=True),   # The match, rendered in red colour using wasabi
          doc[result.end: result.end + 7])    # The slice of the Doc following the match

1 unity among the "99%".The [38;5;1mCommunity Environmental Legal Defense Fund released[0m a model community bill of rights,
2 raid was chaotic and violent, but [38;5;1mOakland Police Chief Howard Jordan expressed[0m his pleasure concerning the operation because neither
3 process.In March 2012, former [38;5;1mU.S. Vice President Al Gore called[0m on activists to "occupy democracy"
4 few demands. On 12 October 2011 [38;5;1mLos Angeles City Council became[0m one of the first governmental bodies in
5 by bullhorn, after reviewing it, [38;5;1mJudge Jed S. Rakoff sided[0m with plaintiffs, saying, "a
6 other countries."
Canada— [38;5;1mFinance Minister Jim Flaherty expressed[0m sympathy with the protests, stating "
7 of that."
8 of governance".
Iran— [38;5;1mSupreme Leader Ayatollah Khamenei voiced[0m his support for the Occupy Movement saying
9 —On 21 October 2011, former [38;5;1mPrime Minister Gordon Brown said[0m the protests were about fairness. "
10 Abraham Foxman, nation

This returns a set of concordance lines showing the matches in their context of occurrence.

In some cases, the preceding or following *Tokens* consist of line breaks indicating a paragraph break, which causes the output to jump a row or two.