# Grammar-Detector

Note: This package is in alpha.

## About This Notebook

This notebook demonstrates the installation and usage of the [grammar-detector package](https://pypi.org/project/grammar-detector/). This notebook does not demonstrate how to configure custom grammatical features by loading patternset YAML files due to technical limitations with Jupyter Notebooks.

## Author

Steven Kyle Crawford

# Description

A tool for detecting grammatical features in sentences, clauses, and phrases in just a few lines of code. This tool is one piece of a larger project to facilitate the creation of reading exercises for language instruction. It is designed to determine if a text contains sentences relevant to the desired grammatical feature. Any language supported by spaCy is theoretically supported.

The patterns for these grammatical features are defined in YAML files called patternsets in lieu of writing code. These YAML files expand the capabilities of the GrammarDetector. The input text to be analyzed is compared against the patterns in the patternsets. In other words, writing more code is unnecessary for supporting new grammatical features. This means that inaccurate results arise from inaccurate patterns and not from the code itself. To mitigate errors, unittests can be defined in the patternsets.

## Installation

This notebook uses the smallest available dataset instead of the default.

The default dataset, [en_core_web_lg](https://spacy.io/models/en#en_core_web_lg) (560 MB), can be substituted with another [spaCy dataset](https://spacy.io/usage/models#languages), such as [en_core_web_sm](https://spacy.io/models/en#en_core_web_sm) (12 MB) or [en_core_web_md](https://spacy.io/models/en#en_core_web_md) (40 MB).

### Installing the Package

In [1]:
pip install grammar-detector

Note: you may need to restart the kernel to use updated packages.


### Downloading the Language Model

In [2]:
# Workaround for Jupyter Notebooks
from sys import executable
!{executable} -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## Usage

### Usage: 0) Constructing the GrammarDetector

In [17]:
from grammardetector import GrammarDetector


# Default values
settings = {  
    "builtins": True,
    # "dataset": "en_core_web_lg", # Too big for the notebook (560 MB)
    "patternset_path": "",  # Custom patternsets
    "verbose": False,
    "very_verbose": False,
}
grammar_detector = GrammarDetector(**settings, dataset="en_core_web_sm")  # Optionally, pass in **settings

### Usage: 1) Running the GrammarDetector

In [18]:
from grammardetector import Match
from typing import Union   


ResultsType = dict[str, Union[str, list[Match]]]


results: ResultsType = grammar_detector("The dog chased a cat into the house.")
results

{'input': 'The dog chased a cat into the house.',
 'voices': [<active: chased>],
 'tense_aspects': [<past simple: chased>],
 'persons': [<3rd: dog>, <3rd: cat>, <3rd: house>],
 'determiners': [<definite: The dog>,
  <indefinite: a cat>,
  <definite: the house>],
 'transitivity': [<ditransitive: dog chased a cat into the house>]}

### Usage: 2) Interpreting the Results

In [5]:
from grammardetector import Match


feature: str = "tense_aspects"
verb_tense: Match = results[feature][0]

In [6]:
verb_tense  # Match

<past simple: chased>

In [7]:
verb_tense.rulename  # str

'past simple'

In [8]:
verb_tense.span_features  # dict[str, typing.Union[str, spacy.tokens.Span]]

{'span': chased,
 'phrase': 'chased',
 'root': 'chased',
 'root_head': 'chased',
 'pos': 'VERB',
 'tag': 'VBD',
 'dep': 'ROOT',
 'phrase_lemma': 'chase',
 'root_lemma': 'chase',
 'pos_desc': 'verb',
 'tag_desc': 'verb, past tense',
 'dep_desc': 'root'}

Each `Match` also includes the [`spaCy.tokens.Span`](https://spacy.io/api/span) and its properties.

In [9]:
verb_tense.span  # spacy.tokens.Span

chased

In [10]:
verb_tense.span.doc  # spacy.tokens.Doc

The dog chased a cat into the house.

In [11]:
verb_tense.span.vector  # numpy.ndarray[ndim=1, dtype=float32]

array([ 0.05253243,  0.7584752 , -1.4021966 , -0.2234522 , -1.3500535 ,
        1.3414049 ,  1.2377656 , -0.14931394,  0.7069486 ,  2.752092  ,
       -0.9106793 , -0.99602413, -0.13224733, -0.82341045,  0.5429548 ,
        0.08999176,  0.5639274 , -1.0829873 , -0.27505642,  0.6702783 ,
        0.21955457,  0.61367756,  0.55740964, -0.01022789, -0.8495478 ,
       -0.1037328 ,  0.582757  , -0.57657075, -0.70535654, -0.59675705,
        0.25138372, -0.39082566,  1.4249784 ,  0.07549933, -0.73862785,
       -0.5933064 ,  1.6329055 , -1.1838139 , -0.9827094 ,  0.3182447 ,
       -0.9244643 , -0.14059615,  0.27800757, -0.24165767,  0.46711537,
       -0.21405753, -0.6099269 ,  0.03935669, -0.13197665,  1.2914371 ,
       -0.22493592,  0.14974567,  0.5813708 ,  2.4722009 , -0.3984077 ,
       -0.09895566,  0.16179523, -0.48209858, -1.0198215 , -0.5356641 ,
       -0.9093542 , -0.6016121 , -0.10648981, -0.11024137, -0.3472551 ,
       -1.259284  ,  0.20666468,  0.08367673,  0.03628533, -1.40

In [12]:
verb_tense.span.vector_norm  # float

8.386418

### Usage: Printing Token Tables

Token tables list the `POS`, `TAG`, `DEP`, and the lemma for each `spacy.tokens.Token`.

Note: The token tables are not formatted well by Jupyter Notebooks. See the [README](https://pypi.org/project/grammar-detector/) for an accurate example.

In [13]:
table = grammar_detector.token_table("The dog chased a cat into the house.")  # (input: str) -> str
print(table)

| Word   | POS   | POS Definition   | Tag   | Tag Definition                            | Dep.   | Dep. Definition        | Lemma.   |
|--------|-------|------------------|-------|-------------------------------------------|--------|------------------------|----------|
| The    | DET   | determiner       | DT    | determiner                                | det    | determiner             | the      |
| dog    | NOUN  | noun             | NN    | noun, singular or mass                    | nsubj  | nominal subject        | dog      |
| chased | VERB  | verb             | VBD   | verb, past tense                          | ROOT   | root                   | chase    |
| a      | DET   | determiner       | DT    | determiner                                | det    | determiner             | a        |
| cat    | NOUN  | noun             | NN    | noun, singular or mass                    | dobj   | direct object          | cat      |
| into   | ADP   | adposition       | IN    | conjuncti

### Usage: Running Tests

Each patternset YAML file contains tests to validate the accuracy of the patterns.

In [14]:
grammar_detector.run_tests(builtin_tests=True)  # (builtin_tests: bool) -> None

FAIL: [tense_aspects: I am going to have run.]
----------------------------------------------------------------------
ACTUAL vs EXPECTED

Lists differ: ['present continuous'] != ['future perfect continuous']

First differing element 0:
'present continuous'
'future perfect continuous'

- ['present continuous']
+ ['future perfect continuous']
----------------------------------------------------------------------

FAIL: [persons: Biden and Harris are running for President and Vice-President.]
----------------------------------------------------------------------
ACTUAL vs EXPECTED

Lists differ: ['3rd', '3rd', '3rd', '3rd', '3rd', '3rd'] != ['3rd', '3rd', '3rd', '3rd']

First list contains 2 additional elements.
First extra element 4:
'3rd'

- ['3rd', '3rd', '3rd', '3rd', '3rd', '3rd']
?                       --------------

+ ['3rd', '3rd', '3rd', '3rd']
----------------------------------------------------------------------

FAIL: [determiners: the book that was on a shelf]
-------------

### Usage: Troubleshooting

The `GrammarDetector`'s constructor keyword arguments `verbose` and `very_verbose` expose INFO-level and DEBUG-level logs. These can be helpful when debugging custom patternset YAML files. Priority is given to `very_verbose` over `verbose`. The default is `False`.

In [15]:
detector_with_debug = GrammarDetector(verbose=True, very_verbose=False)

In [16]:
detector_with_debug("The dog chased a cat into the house.")

{'input': 'The dog chased a cat into the house.',
 'voices': [<active: chased>],
 'tense_aspects': [<past simple: chased>],
 'persons': [<3rd: dog>, <3rd: cat>, <3rd: house>],
 'determiners': [<definite: The dog>,
  <indefinite: a cat>,
  <definite: the house>],
 'transitivity': [<ditransitive: dog chased a cat into the house>]}