# Testing Spacy Patterns

The purpose of this notebook is to test and build patterns used for matching
verb constructions in English with informative tags. 

We will use Spacy's Matcher class for this, alongside the parser:

https://spacy.io/usage/rule-based-matching

To-do list of primary English tense constructions, curated from:

https://en.wikipedia.org/wiki/English_verbs#Expressing_tenses,_aspects_and_moods

```
simple present            writes
simple past               wrote
present progressive       is writing
past progressive          was writing
present perfect           has written
past perfect              had written
present perf. progress.   has been writing
past perf. progress.      had been writing
future                    will write
future perfect            will have written
future perf. progress.    will have been writing
```

secondary constructions:

```
imperative               write
future-in-past           would write
do-support               does write
be-going-to future       is going to write
```

Many of these can be found by parsing the sentence and applying Spacy's 
Matcher with some rules. It would be a good idea if the various constructional
combinations could be identified modularly, so that the 'perfect' in a 
past perfect progressive is matched in the same way as a simple past perfect. 

We can consider dividing these constructions up into 3 columns -- 1 each for 
tense, aspect, and modality. If a construction contributes to one of these categories,
the column gets filled. Otherwise it is left empty. 

```
"has been writing"

tense           aspect            modality
-----           ------             ------
past      perfect progressive
```

In [69]:
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Token, Span
from spacy.util import filter_spans # nice tip: https://stackoverflow.com/a/63303480/8351428
import collections

In [59]:
spacy.explain('VBP')

'verb, non-3rd person singular present'

In [94]:
test_sentences = '''\
He writes. He wrote. She is writing. She was writing.
He has written. He had written. She has been writing.
She had been writing. He will write. He will have written.
She will have been writing. Write. She would write. He does write.
He did write.
He is going to write.
Let it be written.
Let there be writing.
'''

nlp = spacy.load('en_core_web_sm')

# a set of rules to match tense-aspect-modality construtions in English
# NB order of patterns matters
tam_rules = [
    
    (
        'imperative', # NB: this must come first so it can be over-written by longer patterns
        [
            {'TAG': 'VB', 'DEP':'ROOT'},
        ]
    ),
    (
        'future',
        [
            {'TAG': 'MD', 'LEMMA': 'will'},
            {'TAG': 'VB', 'DEP': {'IN': ['ROOT']}},
        ]
    ),
    (
        'present', 
        [
            {'TAG':{'IN':['VBZ', 'VBP']}, 'DEP': {'NOT_IN': ['aux']}},
        ]
    ),
    (
        'present perfect progressive',
        [
            {'TAG': {'IN': ['VBZ', 'VBP']}, 'LEMMA': 'have'},
            {'TAG': 'VBN', 'LEMMA': 'be'},
            {'TAG': 'VBG'},
        ]
    ),
    (
        'past perfect progressive',
        [
            {'TAG': {'IN': ['VBD']}, 'LEMMA': 'have'},
            {'TAG': 'VBN', 'LEMMA': 'be'},
            {'TAG': 'VBG'},
        ]
    ),
    (
        'present perfect',
        [
            {'TAG': {'IN': ['VBZ', 'VBP']}, 'LEMMA': 'have'},
            {'TAG': 'VBN', 'DEP': {'NOT_IN': ['aux']}},
        ]
    ),
    (
        'past perfect',
        [
            {'TAG': {'IN': ['VBD']}, 'LEMMA': 'have'},
            {'TAG': 'VBN', 'DEP': {'IN': ['ROOT']}},
        ]
    ),
    (
        'future perfect',
        [
            {'TAG': 'MD', 'LEMMA': 'will'},
            {'TAG': {'IN': ['VB']}, 'LEMMA': 'have'},
            {'TAG': 'VBN', 'DEP': {'IN': ['ROOT']}},
        ]
    ),
    (
        'future perfect progressive',
        [
            {'TAG': 'MD', 'LEMMA': 'will'},
            {'TAG': {'IN': ['VB']}, 'LEMMA': 'have'},
            {'TAG': 'VBN', 'LEMMA': 'be'},
            {'TAG': 'VBG'},
        ]
    ),
    (
        'present progressive', 
        [
            {'TAG': {'IN':['VBZ', 'VBP']}, 'LEMMA':'be'}, 
            {'TAG':'VBG', 'LEMMA': {'NOT_IN':['go']}},
        ]
    ),
    (
        'past progressive',
        [
            {'TAG':'VBD', 'LEMMA':'be'},
            {'TAG': 'VBG'},
        ]
    ),
    (
        'future-in-past', # habitual?
        [
            {'LOWER': 'would', 'DEP': {'IN': ['aux']}},
            {'TAG':'VB'}
        ]
    ),
    (
        'do-support present',
        [
            {'TAG': {'IN': ['VBZ', 'VBP']}, 'LEMMA': 'do'},
            {'TAG': 'VB'},
        ]
    ),
    (
        'past perfect (did)',
        [
            {'TAG': {'IN': ['VBD']}, 'LEMMA': 'do'},
            {'TAG': 'VB'},
        ]
    ),
    (
        'be-going-to future',
        [
            {'TAG': {'IN':['VBZ', 'VBP']}, 'LEMMA':'be'}, 
            {'TAG': 'VBG', 'LEMMA': 'go'},
            {'TAG': 'TO'},
            {'TAG': 'VB'},
        ]
    ),
    (
        'MODAL-there-be',
        [
            {'TAG': {'IN':['VB', 'MD']}, 'lower': {'IN':['let', 'may']}},
            {'TAG': {'IN':['EX', 'PRP']}}, # EX = 'existential there'
            {'TAG': 'VB', 'LOWER': 'be'},
            {'TAG': 'VBN', 'OP': '?'}
        ]
    ),
]

tam_matches = collections.defaultdict(set)

def on_match(matcher, doc, mid, matches):
    for match in matches:
        begin, end = match[1:]
        tam_matches[(begin, end)].add(match)

getter = lambda token: token._.tam
Span.set_extension('tam', default='', force=True)
matcher = Matcher(nlp.vocab)

for tag, rules in tam_rules:
    matcher.add(tag, on_match, rules)

In [95]:
spacy.explain('VBN')

'verb, past participle'

In [96]:
parse = nlp(test_sentences)

In [97]:
matches = matcher(parse)
spans = []

# tag all spans with tam tag
for mid, start, end in matches:
    span = parse[start:end]
    span._.tam = nlp.vocab.strings[mid]
    spans.append(span)
    
# filter out overlapping spans
filtered_spans = filter_spans(spans)

for span in filtered_spans:
    print(span._.tam)
    print('\t', span)
    print()

present
	 writes

present progressive
	 is writing

past progressive
	 was writing

present perfect
	 has written

past perfect
	 had written

present perfect progressive
	 has been writing

past perfect progressive
	 had been writing

future
	 will write

future perfect
	 will have written

future perfect progressive
	 will have been writing

imperative
	 Write

future-in-past
	 would write

do-support present
	 does write

past perfect (did)
	 did write

be-going-to future
	 is going to write

MODAL-there-be
	 Let it be written

MODAL-there-be
	 Let there be

