# Finding linguistic patterns

This section introduces you to finding linguistic patterns using spaCy.

If you are unfamiliar with the linguistic annotations produced by spaCy or need to refresh your memory, revisit [Part II](../part_ii/03_basic_nlp.ipynb) before working through this section.

After reading this section, you should:

 - 1
 - 2
 - 3

## Finding patterns using spaCy Matchers

spaCy provides three types of Matchers:

1. A [Matcher](https://spacy.io/api/matcher), which allows defining rules that search for particular **words or phrases** by examining *Token* attributes.  
2. A [DependencyMatcher](https://spacy.io/api/dependencymatcher), which allows searching parse trees for **syntactic patterns**.
3. A [PhraseMatcher](https://spacy.io/api/phrasematcher), a fast method for matching *Doc* objects to *Doc* objects.

### Using the Matcher

To get started with the *Matcher*, let's import the spaCy library and load a small language model for English.

In [2]:
# Import the spaCy library into Python
import spacy

# Load a small language model for English; assign the result under 'nlp'
nlp = spacy.load('en_core_web_sm')

To give us some data to work with, let's load some text extracted from a Wikipedia article and process it using the language model under the variable `nlp`.

In [3]:
# Open the file 'occupy.txt' and use the read() method to read the contents.
# Feed the result to the language model under 'nlp'.
doc = nlp(open(file='data/occupy.txt', mode='r', encoding='utf-8').read())

# Check the length of the Doc object, that is, how many Tokens are contained within.
len(doc)

14867

Now that we have a *Doc* with nearly 15 000 *Tokens*, we can continue to import the *Matcher* class from the `matcher` submodule of spaCy.

In [4]:
# Import the Matcher class
from spacy.matcher import Matcher

Importing the *Matcher* class allows creating *Matcher* objects, which must be initialised by providing the vocabulary object of the language model that will be used for finding matches.

This vocabulary is stored in a [*Vocab*](https://spacy.io/api/vocab) object, which is available under the attribute `vocab` of a *Language* object.

In [5]:
# Create a Matcher and provide model vocabulary; assign result under the variable 'matcher'
matcher = Matcher(nlp.vocab)

# Call the variable to examine the object
matcher

<spacy.matcher.matcher.Matcher at 0x161deca40>

This creates a *Matcher* object, which stores the patterns to be searched for.

The patterns to be matched are defined using a [specific format](https://spacy.io/api/matcher#patterns) defined in spaCy.

Each pattern consists of a Python list, which is populated by dictionaries, which each define a pattern for matching a single *Token*.

If you wish to match a sequence of *Tokens*, the dictionaries must follow their order.

In [6]:
pattern_1 = [{"POS": "PRON"}, 
             {"POS": "VERB"}]

matcher.add("PRON+VERB", [pattern_1])

In [7]:
matches_1 = matcher(doc)

In [8]:
for match in matches_1:
    
    print(doc[match[1]:match[2]])

It aimed
It formed
We are
it organizes
who designed
He wrote
They promoted
It refers
they saw
they argued
they called
it takes
they called
who comment
them using
they belong
himself warned
he said
they think
them gain
they wished
they blamed
I support
It showed
who gave
they refused
they saw
who caused
We are
who sought
who were
who were
who made
who criticized
it returned
its proposed
They received
there have
it came
it gained
He claimed
they presented
they call
It consists
there were
there was
What started
it is
We are
they began
they perceived
they say
We agree
we see
it's
who are
who say
what's
they do
they reflect
He mentioned
We regard
who participated
he wrote
we have
who dislike
they employ
they have
there is
it stall
who emerged
It pushes
who called


## Building your own concordancer