A common task in natural language processing is parsing, the process of determining the structure of a sentence. Knowing the structure of a sentence can help a computer to better understand the meaning of the sentence, and it can also help the computer extract information out of a sentence. In particular, it’s often useful to extract noun phrases out of a sentence to get an understanding for what the sentence is about.
In this project, we’ll use the context-free grammar formalism to parse English sentences to determine their structure. In a context-free grammar, we repeatedly apply rewriting rules to transform symbols into other symbols. The objective is to start with a nonterminal symbol S (representing a sentence) and repeatedly apply context-free grammar rules until we generate a complete sentence of terminal symbols (i.e., words). The rule S -> N V, for example, means that the S symbol can be rewritten as N V (a noun followed by a verb). If we also have the rule N -> "Holmes" and the rule V -> "sat", we can generate the complete sentence "Holmes sat.".
Noun phrases and verbs phrases might not always be as simple as a single word like "Holmes" or "sat. We might have noun phrases like "my companion" or "a country walk" or "the day before Thursday", which require more complex rules to account for. To account for the phrase "my companion", for example, we might imagine a rule like: NP -> N | Det N. In this rule, we say that an NP (a “noun phrase”) could be either just a noun (N) or a determiner (Det) followed by a noun, where determiners include words like "a", "the", and "my". The vertical bar (|) just indicates that there are multiple possible ways to rewrite an NP, with each possible rewrite separated by a bar.
To incorporate this rule into how we parse a sentence (S), we need to modify the S -> N V rule to S -> NP V, to allow for noun phrases (NP) as the subject of our sentence. To account for more complex types of noun phrases, we need more rules to complement the grammar even further. The more rules are implemented, the more complex sentences we can parse.
By defining this set of rules, the CYK algorithm, used by nltk's parse, is able to take a sentence (terminal symbols) and figure out the syntax tree (the structure of the non-terminal symbols).
According to how well the rules are established, this algorithm can detect non-well formed sentences, but it's incapable to detect some sentences that may not be semantically well-formed (non-sense sentences).
Since English grammar is inherently ambiguous, the same sentences can produce more than one syntax tree.
Each text file in the sentences directory contains an English sentence that our AI will parse.
In parser.py, the main function first accepts a sentence as input, either from a file or via user input. The sentence is preprocessed (via the preprocess function) and then parsed according to the context-free grammar defined by the file. The resulting trees are printed out, and all of the “noun phrase chunks” are printed as well (via the np_chunk function).
-
The
preprocessfunction accepts a sentence (a string) as input and return a lowercased list of its words. -
It uses nltk’s
word_tokenizefunction to perform tokenization. -
Any word that doesn’t contain at least one alphabetic character (e.g. . or 28) is excluded from the returned list.
-
The global variable
TERMINALSgenerates terminal symbols.Adjis a nonterminal symbol that generates adjectives,Advgenerates adverbs,Conjgenerates conjunctions,Detgenerates determiners,Ngenerates nouns,Pgenerates prepositions, andVgenerates verbs. -
The global variable
NONTERMINALSgenerates nonterminal symbols. The first rule begin withS ->sinceS(representing a sentence) is the starting symbol.NPrepresents noun phrases,VPrepresents verb phrases,AdjPrepresents adjective phrases,AdvPrepresents adverb phrases,PPrepresents preposition phrases, andConjPrepresents conjunctions phrases. -
Each rule include the
->characters to denote which symbol is being replaced, and may optionally include|symbols if there are multiple ways to rewrite a symbol. -
When combined both set of rules, we can parse all the sentences in the
sentencesdirectory.
Some constraints:
-
Avoiding over-generation of sentences, sentences like "Armchair on the sat Holmes" aren't accepted by the parser.
-
Avoiding under-generation of sentences. A very long and specific rule like
(S -> N V Det Adj Adj Adj N P Det N P Det N)would technically successfully generate sentence 10, but isn't useful or generalizable. -
Trying to be as general as possible without over-generating. The parser accepts the sentences: “Holmes sat in the armchair”, “Holmes sat in the red armchair” and “Holmes sat in the little red armchair”, but not the sentence: “Holmes sat in the the armchair”.
-
The
np_chunkfunction accepts a tree representing the syntax of a sentence (anltk.treeobject whose label isS) and return a list of all the noun phrase chunks in the sentence (a list ofnltk.treeobjects, where each element has the labelNP). -
A noun phrase chunk is defined as a subtree of the original tree whose label is
NPthat does not itself contain any other noun phrases as subtrees. -
For example, if "the home" is a noun phrase chunk, then "the armchair in the home" is not a noun phrase chunk, because the latter contains the former as a subtree.
- Language - Lecture 6 - CS50's Introduction to Artificial Intelligence with Python 2020
- Analyzing English Grammar
- NLTK 3.5 documentation
Inside of the parser directory:
pip3 install -r requirements.txt| Install this project’s dependency: nltk for natural language processing.
Inside of the parser directory:
-
python3 parser.py sentences/10.txt| Accepts the sentence via a file. -
python3 parser.py| Accepts the sentence via user input.
Luis Sanchez 2020.
A project from the course CS50's Introduction to Artificial Intelligence with Python 2020 from HarvardX.
