# Parse Tree Readability

This Jupyter notebook is for developing, explaining, and experimenting with parse tree readability, a measure of readability I'm developing based on the premises that word length and sentence length are not the most important qualities of a piece of text when it comes to determining the difficulty of understanding it.

## Libraries and Imports

We'll need several libraries for the corpora, for simple text analysis, and for parsing sentences.

In addition, we'll set up this notebook's graphing displays and create global variables for global things, such as American English.

In [None]:
%matplotlib inline
from math import sqrt
import nltk
from nltk import Tree
import textacy
import pyphen  # hyphenation library
import spacy

pyphen.language_fallback('en_US')
dic = pyphen.Pyphen(lang='en_US')

en_nlp = spacy.load('en')

The first thing we'll do is pull in our corpus. The corpus needs to contain a variety of reading levels within it.

Additionally, we need to have an already-parse corpus. We can use the Penn Treebank for this.

In [None]:
wsj = nltk.corpus.treebank.parsed_sents('wsj_0012.mrg')
# Let's look at a sample sentence
w = wsj[0]
print(w)  # the tree structure
print(w.pos())  # a list of tagged words

In [None]:
production = w.productions()[0]
print(production)

If I'm able to, I'd like to use tagged, but non-parsed sentences.

TODO: get some small texts which are tagged with Penn Treebank tags


So, what are the reading levels of these per the current gold standard SMOG?

In [None]:
# TODO: get the SMOG scores of each one

Recap of the SMOG scores.

[[ TODO: if I'm going to not use individual works, but per-genre works, I'll need to be politic about leaping to conclusions. ]]

Next, let's begin to create our parsing functions. These are going to start simply, and get more complicated.

## Textual metrics

Let's begin with text-based metrics: sentence length and syllabification. We'll use the syllabification library `pyphen` to count the syllables in each word.

We can build from these to create the standard text-based metrics, such as SMOG, Flesch-Kincaid, etc.

Also note that every one of these takes a POS-tagged sentence, such as in the Treebank corpus, arranged as a list of (word/punctuation, POS) tuples. They do not take text, so it's important, if you want to use these on text, to tag the text beforehand. This can be done with, e.g., spaCy:

```python
doc = en_nlp(u'One morning, when Gregor Samsa woke from troubled dreams, \
he found himself transformed in his bed into a horrible vermin. He lay on \
his armour-like back, and if he lifted his head a little he could see his \
brown belly, slightly domed and divided by arches into stiff sections.')
[[(word.text, word.tag_) for word in sent] for sent in doc.sents]
```

In [None]:
def depunctuate(tagged_sentence):
    """From a tagged sentence (as a list), returns a sentence without punctuation"""
    punct_tags = set(['(', ')', ',', '--', '.', ':'])
    return [(word, tag) for word, tag in tagged_sentence if tag not in punct_tags]
    
    
def n_words(tagged_sentence):
    """
    From a tagged sentence (as a list), returns the number of words in it,
    sans punctuation.
    """
    return len(depunctuate(tagged_sentence))

def word_lengths(tagged_sentence):
    """
    From a tagged sentence (as a list), returns a list of non-punctation
    word lengths
    """
    return [len(word) for word, pair in depunctuate(tagged_sentence)]

def avg_word_length(tagged_sentence):
    """The average word length of the sentence"""
    l = word_lengths(tagged_sentence)
    return sum(l) / len(l)

def syllables(tagged_sentence):
    """
    From a tagged sentence (as a list), returns a list of non-punctation
    syllables per word.
    """
    return [len(dic.positions(word)) + 1 for word, pair in depunctuate(tagged_sentence)]

def n_monosyllable_words(tagged_sentence):
    """Returns the number of one syllable words in the (word, tag)-list sentence"""
    return len([sylls for sylls in syllables(tagged_sentence) if sylls == 1])

def n_polysyllable_words(tagged_sentence):
    """Returns the number of 3+ syllable words in the (word, tag)-list sentence"""
    return len([sylls for sylls in syllables(tagged_sentence) if sylls >= 3])

## Standard Readability Metrics

With these building blocks in place, we can create our own versions of the standard readability metrics.

In [None]:
def SMOG(text):
    """
    Computes the SMOG score of a piece of text, in this case, a list of tagged sentences.
    There must be at least 30 sentences in the text or an Error will be thrown.
    
    McLaughlin, G. Harry (May 1969). "SMOG Grading — a New Readability Formula" (PDF).
    Journal of Reading. 12 (8): 639–646.
    """
    if len(text) < 30:
        raise ValueError('There must be at least 30 sentences in the input for an accurate \
        readability score.')
    
    n_polysyllables = sum(n_polysyllable_words(sentence) for sentence in text)
    grade = 1.0430 * sqrt(n_polysyllables * (30 / len(text))) + 3.1291
    return grade
    
def flesch_kincaid_grade_level(text):
    """
    Computes the Flesch-Kincaid grade level of a piece of text, in this case,
    a list of tagged sentences.

    Kincaid JP, Fishburne RP Jr, Rogers RL, Chissom BS (February 1975).
    "Derivation of new readability formulas (Automated Readability Index,
    Fog Count and Flesch Reading Ease Formula) for Navy enlisted personnel".
    Research Branch Report 8-75, Millington, TN: Naval Technical Training,
    U. S. Naval Air Station, Memphis, TN.
    """
    
    total_words = sum(n_words(sentence) for sentence in text)
    total_syllables = sum(sum(syllables(sentence)) for sentence in text)
    grade = 0.39 * (total_words / len(text)) + 11.8 * (total_syllables/total_words) - 15.59
    return grade

In [None]:
# TODO: test these against the built-in versions in spaCy

For the next round of text analysis functions, let's dig a little deeper into the POS analysis. Remember that this is still sans parse-tree-ification.

## Tag-based Metrics

These use the POS tags. Since we're using the Penn Treebank, we need to make sure that whatever tags we're using are within this tagset; some tagged texts from the Brown Corpus, for example, use other tagsets.

In [None]:
def n_POSs(tagged_sentence):
    """Returns the number of unique POS's in the tagged sentence"""
    return len(set(pos for word, pos in tagged_sentence))

# TODO: what else?

## Tree-based Metrics

These use the structure of the parse trees, not the productions (yet!) to measure things. We'll start with simple measurements, and build on top of those.

First, of course, we need to get the parse trees. [ TODO ]

In [None]:
def get_trees(sentence):
    """
    Returns all possible parse trees from a sentence.
    The sentence is POS-tagged, and this returns a list of nltk.Trees
    """
    pass  # TODO

def tree_depth(tree):
    """Returns the tree depth of an individual parse tree. O(n)"""
    return 0  # TODO - there may be a built-in method for this

def max_tree_depth(trees):
    """Returns the max tree depth over a collection of parse trees"""
    return max((tree_depth(tree) for tree in trees))

def avg_tree_depth(trees):
    """Returns the average tree depth over a collection of parse trees"""
    return sum((tree_depth(tree) for tree in trees))/ len(trees)

## Parse-tree-based Metrics

These use the parse trees to measure things. We'll start with simple measurements, and build on top of those.

In [None]:
def n_productions(parse_tree, production):
    """Returns the number of productions of type `production` in the parse_tree"""
    return 0  # TODO
    
def n_noun_phrases(parse_tree):
    """Returns the number of noun phrases in the parse tree"""
    return n_productions(parse_tree, 'NP')

def n_subordinate_clauses(parse_tree):
    """Returns the number of subordinate clauses in the parse tree"""
    return 0  # TODO

def possible_pos_tags(sentence):
    """
    Words may be parsed as multiple parts of speech; for example, "parts" can
    be a plural noun or a present singular verb.
    Takes a tagged sentence, ignoring the current tags, and maps the list to
    a list of the count of POS's available for each word.
    E.g.: [("Parts", 'NNS'), ('of', 'IN')...] -> [2, 1, ...]
    """
    return [1 for word_pair in sentence]  # TODO

def possible_taggings(sentence):
    """
    Similar to the above, but sees how many of these possible POS's were
    _actually_ considered by the CKY algorithm (or whatever the library I'm
    using to tag these uses).
    """
    return [1 for word_pair in sentence]  # TODO
    # I think I can't use spaCy for this; I'll look into CoreNLP or SyntaxNet or nltk

def n_negations(sentence):
    """
    Returns the number of negations in the sentence
    """
    return 0  # TODO -- how!?

## Readability

In this part I:

1. Calculate all of these stats for the pieces of the corpus

2. Calculate the SMOG scores

3. Figure out some relationship $f$ as above.`

In [None]:
def combined_textual_stats(corpus):
    """
    For a corpus (a list of POS-tagged sentences), determine all statistics as above
    and return a dictionary of these.
    """
    n_w = sum((n_words(sentence) for sentence in corpus))
    a_w_l = sum(sum(word_lengths(sentence)) for sentence in corpus) / n_w
    s = sum( sum(syllables(sentence)) for sentence in corpus)
    n_m_w = sum((n_monosyllable_words(sentence) for sentence in corpus))
    n_p_w = sum((n_polysyllable_words(sentence) for sentence in corpus))
    
    return {
            'n_words': n_w,
            'avg_word_length': a_w_l,
            'syllables': s,
            'n_monosyllable_words': n_m_w,
            'n_polysyllable_words': n_p_w,
           }