# Proof-of-Concept 4: Create Rules for Detecting Syntax Patterns

## How to use this PoC:
After you run it, you may have to scroll back up to the top.

To run it: in the drop-down menu, click **Kernel --> Restart & Run All --> Restart and Run All Cells**

    or

To run it: in the icon toolbar, click **the Fast-Forward button --> Restart and Run All Cells**.

## Attribution:
**Author**: Steven Kyle Crawford

Special thanks to the spaCy team, the NLTK team, and numerous authors.

## Description:
This notebook illustrates creating custom spaCy rules for somewhat accurately detecting grammatical and syntactic patterns.

This notebook demonstrates:
* detecting simple past tense
* comparing simple past and present perfect sentences
* analyzing sentences from authentic news articles, science publications, and editorials in NLTK's Brown corpora

## Helpful links:
* [spaCy linguistic features glossary](https://github.com/explosion/spaCy/blob/master/spacy/glossary.py#L20)

## Procedure:

### Step 0) Install the dependencies
* spaCy and its dependencies
* NLTK and its dependencies
* tabulate for pretty tables

In [1]:
# # Run this only once to avoid unnecessary redownloading
# # To enable or disable, <Ctrl> + a then <Ctrl> + /

# !pip install -U spacy
# !pip install -U spacy-lookups-data
# !python -m spacy download en_core_web_sm
# !pip install -U nltk
# !python -m nltk.downloader all-corpora # This will install only the corpora (no grammars or trained models
# !pip install -U tabulate

### Step 1) Decide on a pattern to detect
Example pattern used below: **simple past verb tense**.

Some patterns are more easily and reliably detected than others. Passive voice works with most verbs, but reported speech's main verb is restricted (say, tell, report, etc.). If detecting reported speech, all of these different verbs will need to be accounted for.

In [2]:
import spacy


nlp = spacy.load("en_core_web_sm")

### Step 2) Prepare to print a token table
Borrowed from Proof of Concept 3.

In [3]:
from tabulate import tabulate


def print_token_table(sentence, pos=False, tag=True, dependency=True, lemma=False):
    """Pretty print the linguistics features of each word in a sentence.
    If pos is True, then print the part-of-speech (POS). Defaults to True.
    If tag is True, then print the tag. Defaults to True.
    If dependency is True, then print the dependencies. Defaults to True.

    Given a string, return None.
    Depends on tabulate.
    """

    # Print the sentence
    print(sentence + "\n")

    # Create the table headers
    headers = []
    headers.append("Word")
    if pos:
        headers.append("POS")
        headers.append("POS Definition")
    if tag:
        headers.append("Tag")
        headers.append("Tag Definition")
    if dependency:
        headers.append("Dep.")
        headers.append("Dep. Definition")
    if lemma:
        headers.append("Lemma.")

    # Create the table data
    doc = nlp(sentence)
    data = []
    for word in doc:
        entry = []
        entry.append(word.text)
        if pos:
            entry.append(word.pos_)
            entry.append(spacy.explain(word.pos_))
        if tag:
            entry.append(word.tag_)
            entry.append(spacy.explain(word.tag_))
        if dependency:
            entry.append(word.dep_)
            entry.append(spacy.explain(word.dep_))
        if lemma:
            entry.append(word.lemma_)
        data.append(entry)

    # Print the table
    print(tabulate(data, headers=headers, tablefmt="github") + "\n\n")

### Step 3) Create example sentences using the pattern

In [4]:
# Source: https://englishstudyhere.com/tenses/20-sentences-in-simple-past-tense/
simple_past_sentences = [
    "Two boys played with a ball.",
    "An old lady walked with her cat.",
    "A nurse brought a little baby girl to the park.",
    "An old man sat down and read his book.",
    "A large truck came around the corner.",
]

# Source: https://englishstudyhere.com/grammar/100-sentences-of-present-perfect-tense-examples-of-present-perfect-tense/
present_perfect_sentences = [
    "My sister has made a big cake.",
    "You have grown since the last time I saw you.",
    "It hasn't drunk the water.",
    "I have seen that movie.",
    "We haven't received any mail since we were retired.",
]

### Step 3.5) If necessary, format the sentence to a single string
The NLTK corpora's Gutenberg books don't require this, but Brown does.

In [5]:
def convert_word_list_to_sentence(word_list):
    """A sentence looks like a natural sentence, but a word list is quite different.
    ['This', 'is', 'a', 'word', 'list', '.']
    Only a sentence can be used in spaCy rule-based matching.
    TODO: fix punctuation improperly surrounded by whitespace

    Given a list of strings, return a string.
    """

    return ' '.join(word_list).strip()


def convert_word_lists_to_sentences(word_lists):
    """A sentence looks like a natural sentence, but a word list is quite different.
    Each [] being its own list, a collection of word lists looks like:
        [
            ['This', 'is', 'one', 'sentence', '.'],
            ['This', 'is', 'another', 'sentence', '.'],
        ]
    Only a sentence can be used in spaCy rule-based matching.

    Given a list of list of strings, return a list of strings.
    """

    return [convert_word_list_to_sentence(word_list) for word_list in word_lists]

### Step 4) Find commonalities in the tokenized example sentences

In [6]:
for sentence in simple_past_sentences[:3]:
    print_token_table(sentence, pos=False)

Two boys played with a ball.

| Word   | Tag   | Tag Definition                            | Dep.   | Dep. Definition        |
|--------|-------|-------------------------------------------|--------|------------------------|
| Two    | CD    | cardinal number                           | nummod | numeric modifier       |
| boys   | NNS   | noun, plural                              | nsubj  | nominal subject        |
| played | VBD   | verb, past tense                          | ROOT   |                        |
| with   | IN    | conjunction, subordinating or preposition | prep   | prepositional modifier |
| a      | DT    | determiner                                | det    | determiner             |
| ball   | NN    | noun, singular or mass                    | pobj   | object of preposition  |
| .      | .     | punctuation mark, sentence closer         | punct  | punctuation            |


An old lady walked with her cat.

| Word   | Tag   | Tag Definition                            

#### Example 1

Two boys played with a ball.


| Word   | Tag   | Tag Definition                            | Dep.   | Dep. Definition        |
|--------|-------|-------------------------------------------|--------|------------------------|
| played | VBD   | verb, past tense                          | ROOT   |                        |


#### Example 2

An old lady walked with her cat.


| Word   | Tag   | Tag Definition                            | Dep.   | Dep. Definition        |
|--------|-------|-------------------------------------------|--------|------------------------|
| walked | VBD   | verb, past tense                          | ROOT   |                        |


#### Example 3

A nurse brought a little baby girl to the park.


| Word    | Tag   | Tag Definition                            | Dep.     | Dep. Definition       |
|---------|-------|-------------------------------------------|----------|-----------------------|
| brought | VBD   | verb, past tense                          | ROOT     |                       |


#### The pattern

In each sentence, a NN (singular noun) or NNS (plural noun) is followed by a VBD (past tense verb). Each past tense verb is the ROOT dependency of the sentence. This means it is not necessary to look for nouns; we can just look for the ROOT.

### Step 5) Create the spaCy rule

In [7]:
simple_past_rule = [
    {'TAG': 'VBD', 'DEP': 'ROOT'},
]

### Step 6) Put it all together

In [8]:
from spacy.matcher import Matcher


def is_simple_past_tense(sentence):
    """Return True if a sentence's main verb is in the simple past tense. Otherwise, return False.
    Recreating the matcher each time is not efficient.

    Given a string, return a boolean.
    """

    matcher = Matcher(nlp.vocab)
    matcher.add('SimplePast', [simple_past_rule])

    doc = nlp(sentence)
    matches = matcher(doc)

    return True if matches else False


def print_sentence_and_whether_simple_past(sentence):
    """Print a sentence and whether the main verb is in the simple past tense.

    Given a string, return None.
    """

    if is_simple_past_tense(sentence):
        print("YES =>", sentence + "\n")
    else:
        print("NO  =>", sentence + "\n")


def print_sentences_and_whether_simple_past(sentences):
    """Print sentences and whether their main verbs are in the simple past tense.

    Given a list of strings, return None.
    """

    [print_sentence_and_whether_simple_past(sentence) for sentence in sentences]

### Step 7) Use it on the example sentences

In [9]:
print_sentences_and_whether_simple_past(simple_past_sentences)
print_sentences_and_whether_simple_past(present_perfect_sentences)

YES => Two boys played with a ball.

YES => An old lady walked with her cat.

YES => A nurse brought a little baby girl to the park.

YES => An old man sat down and read his book.

YES => A large truck came around the corner.

NO  => My sister has made a big cake.

NO  => You have grown since the last time I saw you.

NO  => It hasn't drunk the water.

NO  => I have seen that movie.

NO  => We haven't received any mail since we were retired.



## Interactive Example:

### Try changing these settings
Ctrl + Enter = reload the cell/code block

In [10]:
# Change this: don't forget the ""
sentence = "Who first seduced them to that foul revolt?"


# Don't change this
print_sentence_and_whether_simple_past(sentence)

YES => Who first seduced them to that foul revolt?



## Other Examples:

### Example 1: 20 sentences from news articles

In [11]:
from nltk.corpus import brown


news_articles_word_lists = brown.sents(categories=['news'])[:20]
news_articles_sentences = convert_word_lists_to_sentences(news_articles_word_lists)

print_sentences_and_whether_simple_past(news_articles_sentences)

YES => The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .

YES => The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted .

NO  => The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr. .

YES => `` Only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in the election , the number of voters and the size of this city '' .

YES => The jury said it did find that many of Georgia's registration and election laws `` are outmoded or inadequate and often ambiguous '' .

YES => It recomm

### Example 2: 20 sentences from Australian scientific publications

In [12]:
from nltk.corpus import abc


word_lists = abc.sents(fileids="science.txt")[:20]
sentences = convert_word_lists_to_sentences(word_lists)

print_sentences_and_whether_simple_past(sentences)

NO  => Cystic fibrosis affects 30 , 000 children and young adults in the US alone Inhaling the mists of salt water can reduce the pus and infection that fills the airways of cystic fibrosis sufferers , although side effects include a nasty coughing fit and a harsh taste .

NO  => That ' s the conclusion of two studies published in this week ' s issue of The New England Journal of Medicine .

YES => They found that inhaling a mist with a salt content of 7 or 9 % improved lung function and , in some cases , produced less absenteeism from school or work .

NO  => Cystic fibrosis , a progressive and frequently fatal genetic disease that affects about 30 , 000 young adults and children in the US alone , is marked by a thickening of the mucus which makes it harder to clear the lungs of debris and bacteria .

NO  => The salt water solution " really opens up a new avenue for approaching patients with cystic fibrosis and how to treat them ," says Dr Gail Weinmann , of the US National Heart , Lu

### Example 3: Find past tense sentences in 50 sentences from editorials

In [13]:
from nltk.corpus import brown


word_lists = brown.sents(categories=["editorial"])[:50]
sentences = convert_word_lists_to_sentences(word_lists)

for sentence in sentences:
    if is_simple_past_tense(sentence):
        print(sentence + "\n")

Assembly session brought much good

There followed the historic appropriations and budget fight , in which the General Assembly decided to tackle executive powers .

The final decision went to the executive but a way has been opened for strengthening budgeting procedures and to provide legislators information they need .

The legislature expended most of its time on the schools and appropriations questions .

Fortunately it spared us from the usual spate of silly resolutions which in the past have made Georgia look like anything but `` the empire state of the South '' .

`` If once they become inattentive to the public affairs '' , Jefferson said , `` you and I , and Congress and assemblies , judges and governors , shall all become wolves '' .

The danger lay not in believing that our own A-bombs would deter Russia's use of hers ; ;

that theory was and is sound .

The danger lay in the American delusion that nuclear deterrence was enough .

By limiting American strength too much to nu