# Context Free Grammar Parsing using the CYK Algorithm
## Muya Guoji and Evelyn Kessler
## December 17, 2023

Outline

    A 10-15 minutes presentation during final events period
    Code for your implementation (when relevant) and instructions for how to run the code
    A 5 pages write up describing the work and walking through the code

- introduction; what is the CYK algorithm, our inspiration for the project, basic premise of CFG parsing
- introducing the CYK algorithm; what is Chomsky Normal Form (unary terminal and binary non-terminals) and why do we need it/how to write in it, basic architecture of the algorithm, possible uses/real world examples, theory walkthrough with lots of visuals (parsing trees)
- step by step; detailed step by step through the algorithm (practical walkthrough), explain each line, use print outs or other visuals to explain what the loops are doing
- using CYK parsing on natural language; transformers fail to get semantics while parsing can clarify ambiguous language, what are parts of speech, examples with real sentences (both parsing tree and run through our algorithm)

- showing our algorithm with comments/docstring/etc
- short paragraph of how to run it what the print outs mean
- some example CFGs that you can run and let user pick the sentence (try and create true and false sentences)

### Table of Contents
1. [Introduction](#introduction) - Evelyn
    1. [What is CFG Parsing?](#subintroduction1) - Evelyn
    2. [CFG Parsing in the Real World](#subintroduction2) - Evelyn
2. [The CYK Algorithm](#paragraph1)
    1. [Chomsky Normal Form](#subparagraph11) - Muya
    2. [Basic Code Architecture](#subparagraph12) - Muya
    3. [Uses in the World](#subparagraph13) - Evelyn
    4. [Visual Walkthrough](#subparagraph14) - Evelyn
3. [Code Walkthrough](#paragraph2) - Muya

4. [CYK Parsing on Natural Langauge](#paragraph3)
    1. [Benefits of CYK over Transformers](#subparagraph31) - Muya
    2. [What are Parts of Speech?](#subparagraph32) - Muya
    3. [Writing CNF CFGs for Natural Language Parsing](#subparagraph33) - Evelyn (geeks for geeks reference)
    4. [Visual Walkthrough: Parsing a Sentence](#subparagraph34) - Evelyn
5. [Appendix](#appendix)
    1. [CYK Implementation](#subappendix1)
    2. [Instructions for Running the Algorithm](#subappendix2)
    3. [Example CFGs and strings to parse](#subappendix3)
    4. [How to create your own CFG and strings to parse](#subappendix4)
6. [Sources](#sources)

#### Introduction <a name="introduction"></a>
A context free grammar, or CFG, is a grammar that is used to generate all possible strings in a given language. A context free grammar, G, can be defined by a tuple 

$$
G = (\Sigma, \Gamma, R, S)
$$

where $$\Sigma$$ is a list of terminal symbols, $$\Gamma$$ is a list of non-terminal symbols, R is a list of rules of the form $$x -> \mu$$ where $$\mu \elem (\Gamma \cup \Sigma)*$$, and S is the start symbol. One way of defining a language is to write a CFG for that language. Then, we can find for any string whether that string is in the language by evaluating whether the rules of the CFG could possibly generate that string. For this method we start with the start symbol S and apply the rules continuously until we generate the desired string. If we can't generate the string after (a reasonable number of) rules applications *double check this is actually right; are CFGs where you get stuck or where you run out??*, we can conclude that the given string is not valid for the language described by that CFG.

##### What is CFG Parsing? <a name="subintroduction1"></a>
CFG parsing is another way to find whether a given string is valid for a language as described by a CFG. Instead of starting with S and building up to the given string using the rules, CFG parsing works backwards, starting with the given string and seeing if we can reduce it to S using a combination of the given rules.

##### CFG Parsing in the Real World <a name="subintroduction2"></a>



#### The CYK Algorithm <a name="paragraph1"></a>

##### Chomsky Normal Form <a name="subparagraph11"></a>
Before delving into the algorithm, we would like to first touch on the concept of Chomsky Normal Form (CNF). It is a prerequisite for understanding the CYK algorithm, as the algorithm specifically requires grammars to be in CNF in order to be processed. 

CNF, developed by the well-known modern linguist Noam Chomsky, aims to simplify the rules of context-free
grammars for more efficient parsing and algorithmic analysis. It is a specific method for expressing CFGs, wherein every production rule takes one of two forms: either a non-terminal symbol producing two other non-terminal symbols or a non-terminal symbol producing a single terminal symbol. This single terminal symbol may include epsilon, indicating the deletion of a sentence. 

$$
A-> BC
$$
$$
A-> a
$$
$$
S-> \Epsilon
$$
*(Uppercase letters represent non-terminal symbols, and lowercase letters represent terminal symbols)*


By unifying production rules in such a specific and clean form really simplifies the design and analysis of parsing
algorithms like CYK. 


##### Basic Code Architecture <a name="subparagraph12"></a>
Let's begin the code run-through of the algorithm with a high-level overview of the CYK algorithm's architecture:

1. Input and Grammar Preparation: The algorithm starts with an input string, which are sentences breaking down into words, and a set of grammar rules in Chomsky Normal Form (CNF).

2. Table Initialization: A table (matrix) is created with dimensions based on the length of the input string. This table will be used to store possible grammar derivations for substrings of the input.

3. Table Filling: starting from the bottom level, the algorithm fills in the table and moves upward gradually. Each cell in the table represents a substring of the sentence and is filled with all possible grammar rules that could generate that substring. 

4. Combining Substrings: As it moves up the table, the algorithm combines smaller substrings that have already been matched with grammar rules to form larger substrings. It checks to see if these larger substrings can be generated by any of the CFG rules.

5. Final Verification: Once the table is filled, the algorithm checks the top cell , which represents the entire string (sentence). If this cell contains the start symbol of the grammar, the string is considered derivable from the given grammar.

6. Result and Visualization: Visualization of the table after each iteration can be used to enhance clarity and explainability of the processes for model implementers.

#### Code Walkthrough <a name="paragraph1"></a>

#### CYK Parsing on Natural Langauge<a name="paragraph1"></a>

##### The Benifits of CYK in NLP and a Comparative Analysis with Transformers <a name="subparagraph11"></a>
Natural Language Processing (NLP) is a field is an interdisciplinary subfield of computer science and linguistics. that is primarily concerned with giving computers the ability to support and manipulate human language (*Wikipedia*). It encompasses a range of algorithmic tools for enabling computers to understand and process human language. Among these tools are transformers and the CYK algorithm.

With the development and hype surrounding many large language models, transformers have garnered significant worldwide attention, possibly a byproduct of their attention mechanisms. While transformers are powerful tools in NLP, their approach to understanding language is based on statistical learning from large datasets. They are highly effective in many contexts *but might not always align perfectly with the complexities of human language understanding*. On the other hand, the CYK algorithm, with its basis in formal grammar theory, offers a more rule-based approach to parsing language. It can break down sentences into their grammatical components, making CYK particularly valuable in applications where precise syntax parsing is crucial, such as in compiling programming languages or in certain aspects of natural language understanding where the exact structure of sentences is significant. 

Which brings us to the next topic: parts of speech. 

##### Part of Speech <a name="subparagraph11"></a>
In NLP and any linguistics domain, parts of speech is a key terminology. Part of speech refers to a category of words with similar grammatical properties. Words in the same part of speech play similar roles in sentences and share similar rules of grammar. The main parts of speech in English include nouns, pronouns, verbs, adjectives, prepositions and so on. Identifying parts of speech and aligning substrings with these parts are the main preparations to enable the CYK algorithm to analyze sentence structure. 

For instance, with a suitable context-free grammar provided in Chomsky Normal Form (CNF) that shows 'blue' as an adjective, 'bird' as a noun, 'sings' as a verb, 'happily' as an adverb, 'in' as a preposition, 'the' as a definite article, and 'spring' as a noun, the CYK algorithm can determine whether a sentence like 'The blue bird sings happily in the spring' is structurally valid according to that grammar.

#### Sources <a name="sources"></a>