# Finding linguistic patterns

This section introduces you to finding linguistic patterns using spaCy.

If you are unfamiliar with the linguistic annotations produced by spaCy or need to refresh your memory, revisit [Part II](../part_ii/03_basic_nlp.ipynb) before working through this section.

After reading this section, you should:

 - know how to use spaCy Matchers to search for linguistic patterns

## Finding patterns using spaCy Matchers

spaCy provides three types of Matchers:

1. A [Matcher](https://spacy.io/api/matcher), which allows defining rules that search for particular **words or phrases** by examining *Token* attributes.  
2. A [DependencyMatcher](https://spacy.io/api/dependencymatcher), which allows searching parse trees for **syntactic patterns**.
3. A [PhraseMatcher](https://spacy.io/api/phrasematcher), a fast method for matching spaCy *Doc* objects to *Doc* objects.

### Using the Matcher to find words or phrases

To get started with the *Matcher*, let's import the spaCy library and load a small language model for English.

In [None]:
# Import the spaCy library into Python
import spacy

# Load a small language model for English; assign the result under 'nlp'
nlp = spacy.load('en_core_web_sm')

To have some data to work with, let's load text from a Wikipedia article, as instructed in [Part II](../part_ii/01_basic_text_processing.ipynb#Loading-plain-text-files-into-Python).

First, we use the `open()` function to open the file for reading. 

We then call the `read()` method to read the file contents, and store the result under the variable `text`.

In [None]:
# Use the open() function to open the file for reading, followed by the
# read() method to read the contents of the file.
text = open(file='data/occupy.txt', mode='r', encoding='utf-8').read()

This gives us a Python string object that contains the article.

Next, we feed the text to the language model under the variable `nlp` as instructed in [Part II](../part_ii/03_basic_nlp.ipynb#Performing-basic-NLP-tasks-using-spaCy).

In [None]:
# Feed the string object
doc = nlp(text)

# Use the len() function to check length of the Doc object to count 
# how many Tokens are contained within.
len(doc)

Now that we have a *Doc* with nearly 15 000 *Tokens*, we can continue to import the *Matcher* class from the `matcher` submodule of spaCy.

In [None]:
# Import the Matcher class
from spacy.matcher import Matcher

Importing the *Matcher* class allows creating *Matcher* objects.

When creating a *Matcher* object, you must provide the vocabulary of the language model used for finding matches to the *Matcher* object.

The model vocabulary is stored in a [*Vocab*](https://spacy.io/api/vocab) object. The *Vocab* object is available under the attribute `vocab` of a spaCy *Language* object, which was discussed in [Part II](../part_ii/03_basic_nlp.ipynb#Performing-basic-NLP-tasks-using-spaCy).

In this case, we have the *Language* object stored under the variable `nlp`, which means we can access the *Vocab* object by calling `nlp.vocab`.

We then call the *Matcher* **class** and provide the vocabulary under `nlp.vocab` to the `vocab` argument to create a *Matcher* object. We store the resulting object under the variable `matcher`.

In [None]:
# Create a Matcher and provide model vocabulary; assign result under the variable 'matcher'
matcher = Matcher(vocab=nlp.vocab)

# Call the variable to examine the object
matcher

The *Matcher* object is now ready to store the patterns to be searched for.

These patterns are created using a [specific format](https://spacy.io/api/matcher#patterns) defined in spaCy.

Each pattern consists of a Python list, which is populated by dictionaries. Each dictionary describes the pattern for matching a single *Token*. If you wish to match a sequence of *Tokens*, you must define multiple dictionaries that follow the order of the pattern.

Let's start by defining a simple pattern, which we store under the variable `pattern_1`.

This pattern consists of a list, as marked by the surrounding brackets `[]`, which contains two dictionaries, marked by curly braces `{}` and separated by a comma. As usual, the key and value pairs in each dictionary are separated by a colon:

 - The dictionary key determines which *Token* attribute should be searched for matches. The attributes supported by the *Matcher* can be found [here](https://spacy.io/api/matcher#patterns).

 - The value under the dictionary key determines the specific value for the attribute.

In this case, we define a pattern that searches for a sequence of two coarse part-of-speech tags (`POS`), which were introduced in [Part II](../part_ii/03_basic_nlp.ipynb#Part-of-speech-tagging), namely pronouns (`PRON`) and verbs (`VERB`).

In [None]:
# Define a list with nested dictionaries that contains the pattern
pattern_1 = [{"POS": "PRON"}, {"POS": "VERB"}]

Now that we have defined the pattern using a list and dictionaries, we can add it to the *Matcher* object under the variable `matcher`.

This can be achieved using `add()` method, which requires two inputs:

 1. A Python string object that defines a name for the pattern.
 2. A list containing the pattern(s) to be searched for. A single rule for matching patterns can contain multiple patterns, hence the input must be a *list of lists*, e.g. `[pattern_1]`.

In [None]:
# Add the pattern to the matcher
matcher.add("pronoun+verb", patterns=[pattern_1])

To search for matches the *Doc* object stored under the variable `doc`, we feed the *Doc* object to the *Matcher* and store the result under `matches_1`.

In [None]:
# Apply the Matcher to the Doc object under 'doc'
matches_1 = matcher(doc)

# Call the variable to examine the output
matches_1

The output is a list that contains *tuples* with three items. 

The first item is a spaCy [*Lexeme*](https://spacy.io/api/lexeme) object, which corresponds to an entry in the language model's vocabulary. This entry contains the name that we gave to the search pattern above.

We can easily verify this by fetching this *Lexeme* from the *Vocab* object under `nlp.vocab` and examining its `text` attribute.

In [None]:
nlp.vocab[12298179334642351811].text

This information is mainly useful for disambiguating between matches if the same *Matcher* object contains multiple different patterns.

The next two items in the three-tuple refer to *Token* indices in the *Doc* object that match the pattern.

To inspect the matches, we must retrieve them from the *Doc* object.

In [None]:
# Loop over the list of matches, assigning the three items in the tuple to variables
# 'pattern_name', 'start_ix' and 'end_ix'.
for pattern_name, start_ix, end_ix in matches_1:
    
    # Use the brackets and a colon to access a slice of the Doc object under the 
    # variable 'doc'. The 'start_ix' and 'end_ix' variables determine where the
    # slice starts and ends.
    print(doc[start_ix: end_ix])