# _Working with different kinds of ‘text’ in the Digital Humanities_
## Singapore University of Technology and Design, 18-19 March, 2019
## Introduction to NLP: From Counting to Lanugage Models

Welcome to From Counting to Lanugage Models!
This is a hands-on workshop focusing on various foundation techniques for Natural Language Processing and their applications in Digital Humanities nad beyond. It anything, it's a methods workshop more than a critical or theoretical workshop: the emphasis is put on the how rathen than the why or what for.

<!--
The workshop will be split into 4 sections with 10 minute breaks in-between. The sections get incrementally more advanced, building on concepts and methods from the previous sections.
-->


To follow along, you can run the script portions piecemeal, in order, as we progress through the workshop material. Up to you. Familiarity with programming concepts and Python is required; Numpy and Jupyter desirable.

Instructor:


<figure>
    <img src="http://postdata.linhd.uned.es/wp-content/uploads/2019/02/javierweb.jpg"
         alt="Javier's picture">
    <figcaption>
        <div align="center">
        <strong>Javier de la Rosa</strong>
        <br/>
        <em>versae@linhd.uned.es</em>, <em><a href="https://twitter.com/versae">@versae</a></em>
        <br/>
        NLP Postdoctoral Fellow at <a href="http://postdata.linhd.uned.es/">UNED's POSTADA Project</a>
       </div>
    </figcaption>
</figure>


## What are we covering today?
- What is NLP
- NLP in Python
- Tokenization
- Part of Speech Tagging
- Named Entity Recognition and Relation Detection
- Word transformations
- Keywords in context
- Counting
- TF-IDF and Document-Term Matrices
- Topic Models
- Clustering and PCA
- ~~Word-word matrices~~
- ~~Word embeddings~~
- ~~Language models~~

Use cases:
- Readability indices
- Corpus level statistics

## NLP in Python

Python is builtin with a very mature regular expression library, which is the building block of natural language processing. However, more advanced tasks need different libraries. Traditionally, in the Python ecosystem the Natural Language Processing Toolkit, abbreviated as `NLTK`, has been until recently the only working choice. Unfortunately, the library has not aged well, and even though it's updated to work with the newer versions of Python, it does not provide us the speed we might need to process large corpora, as its intended use is merely educational.

Another solution that appeared recently is called `spaCy`, and it is much faster since is written in a pseudo-C Python language optimized for speed called Cython. See the [documentation](https://spacy.io/usage/models) for details.

Both these libraries are complex and therefore there exist wrappers around them to simplify their APIs. The two more popular are `Textblob` for NLTK and CLiPS Parser, and `textacy` for spaCy.  In this workshop we will be using spaCy with a touch of textacy thrown in at the very end.

In [None]:
%%capture --no-stderr
import sys
!pip install Cython
!pip install spacy nltk textacy textblob requests matplotlib scikit-learn
!python -m spacy download en
!python -m spacy download es
!python -m nltk.downloader all
print("All done!", file=sys.stderr)

In [None]:
%matplotlib inline

In [None]:
import spacy

Let's load the English data for now. Support for other [languages is available as well](https://spacy.io/usage/models), although some features might not work. 

In [None]:
nlp = spacy.load('en')

We're also going to need a couple of helper functions to retrieve some texts from US presidents' State of the Union speeches.

In [None]:
# helper functions
import requests

def get_text(url):
    return requests.get(url).text

def get_speech(url):
    page = get_text(url)
    full_text = page.split('\n')
    return " ".join(full_text[2:])

In [None]:
clinton_url = "https://raw.githubusercontent.com/sul-cidr/python_workshops/master/data/clinton2000.txt"
clinton_speech = get_speech(clinton_url)
print(clinton_speech[:500],  "...")

Now, let's create a SpaCy `Document` of the text.

In [None]:
doc = nlp(clinton_speech)

## Tokenization

While basic, some cleaning has been done already. Compare these 2 texts:

In [None]:
get_text(clinton_url)[:500]

In [None]:
clinton_speech[:500]

In NLP, the act of splitting text is called tokenization, and each of the individual chunks is called a token. Therefore, we can talk about word tokenization or sentence tokenization depending on what it is that we need to divide the text into.

In [None]:
# word level
for token in doc[:20]:
    print(token.text)

In [None]:
# sentence level
for token in [sent for sent in doc.sents][:10]:
    print("- ", token.text)

Very easily, SpaCy allows for the extraction of noun phrases, which can be useful sometimes.

In [None]:
# noun phrases
for phrase in list(doc.noun_chunks)[:10]:
    print(phrase)

## Part of Speech Tagging

SpaCy also allows you to perform Part-Of-Speech tagging, a kind of grammatical chunking, out of the box. For POS, SpaCy follows the Universal Dependencies tag set.

In [None]:
# simple part of speech tag
for token in doc[:20]:
    print(token.text, token.pos_, sep="\t")

Detailed information can also be obtained if available. In these cases, the format will depend on the language and corpus used. For English, [MBSP tags](http://www.clips.ua.ac.be/pages/mbsp-tags) are used, while in Spanish, the [Universal Feature inventory](https://universaldependencies.org/u/feat/index.html) is available.

In [None]:
# detailed tag
# For what these tags mean, you might check out http://www.clips.ua.ac.be/pages/mbsp-tags
for token in doc[:20]:
    print(token.text, token.tag_, sep="\t")

A syntactic dependency is a relation between two words in a sentence

In [None]:
# syntactic dependency
for token in doc[:20]:
    print(token.text, token.dep_, sep="\t")

However, it's easier to understand with a tree.

In [None]:
# visualizing the sentence
from spacy import displacy

In [None]:
first_sent = list(doc.sents)[0]
first_sent

In [None]:
single_doc = nlp(str(first_sent))
options = {"compact": True, 'bg': 'white',
           'color': 'black', 'font': 'Source Sans Pro'}
displacy.render(single_doc, style="dep", jupyter=True, options=options)

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Write a function `count_chars(text)` that receives `text` and returns the total number of characters ignoring spaces and punctuation marks. For example, `count_chars("Well, I am not 90 years old.")` should return `20`.
<br/>
* **Hint**: You could count the characters in the words.*
</p>
</div>

In [None]:
def count_chars(text):
    doc = ...
    words = [... for token in doc if ... != 'PUNCT']
    return ...

count_chars("Well, I am not 30 years old.")

## Named Entity Recognition

Named Entity Recognition (NER) is a popular technique used in information extraction to identify and segment the named entities and classify or categorize them under various predefined classes.

For English, SpaCy uses the [OntoNotes 5](https://catalog.ldc.upenn.edu/LDC2013T19) corpus, which is sufficiently rich and specific regarding the [information it can caputre](https://spacy.io/api/annotation#named-entities). 

In [None]:
for ent in doc.ents[:20]:
    print(ent.text, ent.label_, sep="\t")

If you're working on tokens, you can still access entity type. Notice, though that the phrase entities are broken up here because we're iterating over tokens

In [None]:
for token in doc[:150]:
    if token.ent_type_ is not '':
        print(token.text, token.ent_type_, f"({spacy.explain(token.ent_type_)})", sep="\t")

SpaCy comes with built in entity visualization

In [None]:
displacy.render(single_doc, style="ent", jupyter=True)

In [None]:
%%capture --no-display
for sent in list(doc.sents)[:10]:
    displacy.render(nlp(sent.text), style="ent", jupyter=True)

It is possible to train your own entity recognition model, and to train other types of models in SpaCy, but you need sufficient labeled data to make it work well.

## Word transformations

Lemmas

In [None]:
for token in doc[:20]:
    print(token.text, token.lemma_, sep="\t")

In [None]:
for token in nlp('here are octopi'):
    print(token.lemma_)

In [None]:
for token in nlp('There have been many mice and geese surrounding the pond.'):
    print(token, token.lemma_, sep="\t")

Say we just want to lematize verbs

In [None]:
for token in doc[:1500]:
    if token.tag_ == "VBP":
        print(token.text, token.lemma_, sep="\t")

If you're using the simple part of speech instead of the tags.

In [None]:
for token in doc[:250]:
    if token.pos_ == "VERB":
        print(token.text, token.lemma_, sep="\t")

Lowercasing

In [None]:
for token in doc[:20]:
    print(token.text, token.lower_, sep="\t")

## Keyword in Context (KWIC)

"A KWIC index, [the most common format for concordance lines], is formed by sorting and aligning the words within an article title to allow each word (except the stop words) in titles to be searchable alphabetically in the index." -- https://en.wikipedia.org/wiki/Key_Word_in_Context.

It also allows for a quick exploration of how specific words are being used and in what context. One quick (but potentially very resource intensive) way of computing KWIC is by using n-grams. N-grams are sliced splits of tokens in groups of _n_, thus a 2-gram (bi-gram) is a group of 2 words, a 3-gram a group of 3. The way they are built is a follows.

```
This is a sentence
```

If we extract all bi-grams, we get

`This, is`, `is a`, `a sentence`.

And if we now focus in, for example, the context of `a`, we can see very quickly that is being used as follows:
```
is a
   a sentence.
```

SpaCy does not support natively splitting by n-grams, but its wrapper `textacy` does, so all we need to do is to reconstruct a basic search over the ngrams with textacy

In [None]:
import textacy

In [None]:
list(textacy.extract.ngrams(nlp("This is a sentence"), 2, filter_stops=False))

However, textacy already includes KWIC by default

In [None]:
textacy.text_utils.KWIC(doc.text, "people")

## Counting

Counting is at the basics of Natural Language Processing, and in some sub-disciplines is still the king of methods. Let's see a couple of approaches to counting.

First, we will use the builtin `Counter()` class and a sample document containing a couple of sentences.

In [None]:
from collections import Counter

In [None]:
sample_sents = "One fish, two fish, red fish, blue fish. One is less than two."

Create a list of the words without the punctuation.

In [None]:
new_doc = nlp(sample_sents)
words = [token.text for token in new_doc if token.pos_ is not 'PUNCT']
words

In [None]:
counter = Counter(words)

To all the distinct words in a document or a corpus, we call vocabulary or lexicon.

In [None]:
counter.keys()

And the frequency of each term in a document can be then determined.

In [None]:
counter.most_common()

In [None]:
counter["fish"]

This is the basics of what is known as bag of words (BoW), which is widely used technice to transform text into numbers (thus: vectorization) suitable for machine learning algorithms. It's also supported in textacy out of the box (with some caveats).

In [None]:
tdoc = textacy.Doc(nlp(sample_sents))

In [None]:
tdoc.to_bag_of_words(normalize=None, as_strings=True)

The main difference is that textacy always removes stop words. It should actually be optional.

In [None]:
tdoc.count("fish")

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Let's define the lexicon of a person as the number of different words she uses to speak. Write a function `get_lexicon(text, n)` that receives `text` and `n` and returns the lemmas of nouns, verbs, and adjectives that are used at least `n` times.
<br/>
</p>
</div>

In [None]:
def get_lexicon(text, n):
    doc = nlp(text)
    # return a list of words that     
    words = [... for token in doc if token.pos_ in ...]
    # count the words     
    counter = Counter(...)
    # filter by number
    filtered_words = [word for word in counter if ...]
    return sorted(filtered_words)
    
get_lexicon(clinton_speech, 30)

## TF-IDF and Document-Term Matrices

From our intuition, we think that the words which appear more often should have a greater weight in textual data analysis, but that's not always the case. Words such as “the”, “will”, and “you” —stopwords— appear the most in a corpus of text, but are of very little significance. Instead, the words which are rare are the ones that actually help in distinguishing between the data, and carry more weight.

TF-IDF stands for “Term Frequency — Inverse Data Frequency”, and it's just a vectorization algorithm that tries to assign weights based on the relative importance of a word within a document and the corpus it belongs to.

- Term Frequency (tf): gives the frequency of the word ($t$) in each document ($d$) in the corpus ($D$). It is the ratio of number of times the word appears in a document compared to the total number of words in that document. It increases as the number of occurrences of that word within the document increases. Each document has its own tf.
- Inverse Data Frequency (idf): used to calculate the weight of rare words across all documents in the corpus. The words that occur rarely in the corpus have a high IDF score.

$$idf( t, D ) = log \frac{ \text{| } D \text{ |} }{ 1 + \text{| } \{ d \in D : t \in d \} \text{ |} }$$

Combining these two we come up with the TF-IDF score for a word in a document in the corpus. It is the product of tf and idf.

$$tfidf( t, d, D ) = tf( t, d ) \times idf( t, D )$$

Let's now compile a tiny corpus to illustrate.

In [None]:
raw_corpus = [
    "The sky is blue.",
    "The sun is bright today.",
    "The sun in the sky is bright.",
    "We can see the shining sun, the bright sun.",
]

In [None]:
corpus = textacy.Corpus('en', texts=raw_corpus)
corpus

In [None]:
from textacy.vsm.vectorizers import Vectorizer

In [None]:
vectorizer = Vectorizer(tf_type='linear', apply_idf=True, idf_type='smooth', apply_dl=False)

In [None]:
vectorizer.fit_transform([
    doc.to_terms_list(normalize=None, as_strings=True, ngrams=(1,), filter_stops=False)
    for doc in corpus.docs
]).todense().T

This matrix above is the document-term matrix, in which (although now transposed), rows represent documents and columns weights, in this case tf-idf weights.

In [None]:
vectorizer.vocabulary_terms

It could also be obtained a doc-term matrix with raw counts instead (tf).

In [None]:
vectorizer = Vectorizer(tf_type='linear', apply_idf=False, apply_dl=False)
vectorizer.fit_transform([
    doc.to_terms_list(normalize=None, as_strings=True, ngrams=(1,), filter_stops=False)
    for doc in corpus.docs
]).todense().T

In [None]:
vectorizer.vocabulary_terms

Let's now try with a bigger corpus of US President's speeches.

In [None]:
clinton_url = "https://raw.githubusercontent.com/sul-cidr/python_workshops/master/data/clinton2000.txt"
bush_url = "https://raw.githubusercontent.com/sul-cidr/python_workshops/master/data/bush2008.txt"
obama_url = "https://raw.githubusercontent.com/sul-cidr/python_workshops/master/data/obama2016.txt"
trump_url = "https://raw.githubusercontent.com/sul-cidr/python_workshops/master/data/trump.txt"

In [None]:
clinton_speech = get_speech(clinton_url)
bush_speech = get_speech(bush_url)
obama_speech = get_speech(obama_url)
trump_speech = get_speech(trump_url)

In [None]:
speeches = textacy.Corpus(
    'en',
    texts=[clinton_speech, bush_speech, obama_speech, trump_speech],
    metadatas=[{"name": "clinton"}, {"name": "bush"}, {"name": "obama"}, {"name": "trump"}]
)

In [None]:
speeches

In [None]:
vectorizer = Vectorizer(tf_type='linear', apply_idf=True, idf_type='smooth', apply_dl=False)  # tf-idf
terms_list = [
    doc.to_terms_list(normalize=None, as_strings=True, ngrams=(1,), filter_stops=True)
    for doc in speeches.docs
]
doc_term_matrix = vectorizer.fit_transform(terms_list)

In [None]:
doc_term_matrix

In [None]:
vectorizer.terms_list[250:275]

## Topic Models

Once we have our weighted document-term matrix, is easy to calculate what are the more prominent topics using topic modeling.

In [None]:
model = textacy.tm.TopicModel('lsa', n_topics=20)
model.fit(doc_term_matrix)
model

In [None]:
doc_topic_matrix = model.transform(doc_term_matrix)
for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, topics=range(4)):
    print('topic', topic_idx, ':', '   '.join(top_terms))

Visualize the model

In [None]:
model.termite_plot(doc_term_matrix, vectorizer.id_to_term,
                   topics=range(4),  n_terms=25, sort_terms_by='seriation')

## Clustering with PCA

It is now also possible to cluster the documents based in their tf-idf weihts using PCA.

In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

In [None]:
X = doc_term_matrix.todense()
labels_color_map = {
    'Clinton': '#20b2aa', 'Bush': '#ff7373', 'Obama': '#005073', 'Trump': '#F0926E'
}
labels = list(labels_color_map.keys())
reduced_data = PCA(n_components=2).fit_transform(X)

fig, ax = plt.subplots(figsize=(16, 8))
for index, instance in enumerate(reduced_data):
    pca_comp_1, pca_comp_2 = reduced_data[index]
    color = labels_color_map[labels[index]]
    ax.scatter(pca_comp_1, pca_comp_2, c=color)
ax.legend(labels);

# Readability indices

Readability indices are ways of assessing how easy or complex it is to read a particular text based on the words and sentences it has. They usually output scores that correlate with grade levels.

A couple of indices that are presumably easy to calculate are the [Auto Readability Index (ARI)](https://en.wikipedia.org/wiki/Automated_readability_index) and the [Coleman-Liau Index](https://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index):

$$
ARI = 4.71\frac{chars}{words}+0.5\frac{words}{sentences}-21.43
$$
$$ CL = 0.0588\frac{letters}{100 words} - 0.296\frac{sentences}{100words} - 15.8 $$


In [None]:
# problem: the tokens in spacy include punctuation. to get this right, we should remove punct
# we then have to make sure our functions handle lists of words rather than spacy doc objects

def coleman_liau_index(doc, words):
    return (0.0588 * letters_per_100(doc)) - (0.296 * sentences_per_100(doc, words)) - 15.8

def count_chars(words):
    return sum(len(w) for w in words)

def sentences_per_100(doc, words):
    return (len(list(doc.sents)) / len(words)) * 100

def letters_per_100(words):
    return (count_chars(words) / len(words)) * 100

In [None]:
# To get just the words, without punctuation tokens
def return_words(doc):
    return [token.text for token in doc if token.pos_ is not 'PUNCT']

In [None]:
fancy_doc = nlp("Regional ontology, clearly defined by Heidegger, equals, if not surpasses, the earlier work of Heidegger's own mentor, Husserl")
fancy_words = return_words(fancy_doc)
fancy_words

In [None]:
coleman_liau_index(fancy_doc, fancy_words)

In [None]:
doc = nlp(clinton_speech)
clinton_speech_words = return_words(doc)
coleman_liau_index(doc, clinton_speech_words)

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Write a function `auto_readability_index(doc)` that receives a spacy `Doc` and returns the Auto Readability Index (ARI) score as defined above. 
<br/>
* **Hint**: Feel free to use functions we've defined before.*
   
</p>
</div>

In [None]:
def auto_readability_index(doc):
    words = ...
    chars = ...
    words = ...
    sentences = ...
    return (4.71 * (chars / words)) + (0.5 * (words / sentences)) - 21.43

In [None]:
auto_readability_index(fancy_doc)

In [None]:
auto_readability_index(doc)

In [None]:
clinton_url = "https://raw.githubusercontent.com/sul-cidr/python_workshops/master/data/clinton2000.txt"
bush_url = "https://raw.githubusercontent.com/sul-cidr/python_workshops/master/data/bush2008.txt"
obama_url = "https://raw.githubusercontent.com/sul-cidr/python_workshops/master/data/obama2016.txt"
trump_url = "https://raw.githubusercontent.com/sul-cidr/python_workshops/master/data/trump.txt"

In [None]:
clinton_speech = get_speech(clinton_url)
bush_speech = get_speech(bush_url)
obama_speech = get_speech(obama_url)
trump_speech = get_speech(trump_url)

In [None]:
speeches = {
    "clinton": nlp(clinton_speech),
    "bush": nlp(bush_speech),
    "obama": nlp(obama_speech),
    "trump": nlp(trump_speech),
}

In [None]:
print("Name", "Chars", "Words", "Unique", "Sentences", sep="\t")
for speaker, speech in speeches.items():
    words = return_words(speech)
    print(speaker, count_chars(words), len(words), len(set(words)), len(list(speech.sents)), sep="\t")

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Write a function `avg_sentence_length(blob)` that receives a spaCy `doc` and returns the average number of words in a sentence for the doc. You might need to use our `return_words` function.
</p>
</div>

In [None]:
# average sentence length
def avg_sentence_length(doc):
    return ... / len(list(doc.sents))

In [None]:
for speaker, speech in speeches.items():
    print(speaker, avg_sentence_length(speech))

We might stop to ask why Obama's speech seems to have shorter sentences. Is it deliberate rhetorical choice? Or could it be an issue with the data itself?

In this case, if we look closely at the txt file, we can see that the transcription of the speech included the world 'applause' as a one word sentence throughout the text. Let's see what happens if we filter that out. 

In [None]:
obama_clean_speech = obama_speech.replace("(Applause.)", "")

In [None]:
# Let's compare lengths of the texts. We should see a difference.

len(obama_speech), len(obama_clean_speech)

In [None]:
# Now let's recheck the average sentence length of Obama's speech.
avg_sentence_length(nlp(obama_clean_speech))

In [None]:
speeches = {
    "clinton": nlp(clinton_speech),
    "bush": nlp(bush_speech),
    "obama": nlp(obama_clean_speech),
    "trump": nlp(trump_speech),
}

Let's write a quick function to get the most common words used by each person

In [None]:
def most_common_words(doc, n):
    words = return_words(doc)
    c = Counter(words)
    return c.most_common(n)

In [None]:
for speaker, speech in speeches.items():
    print(speaker, most_common_words(speech, 10))

You can see quickly that we need to remove some of these most common words. To do this, we'll use common lists of stopwords.

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS
print(list(STOP_WORDS)[:100])

In [None]:
# to make sure we've got all the punctuation out and to remove some contractions, we'll have a custom stoplist
custom_stopwords = [',', '-', '.', '’s', '-', ' ', '(', ')', '--', '---', 'n’t', ';', "'s", "'ve", "  ", "’ve"]

In [None]:
def most_common_words(doc, n):
    words = [token.text for token in doc if token.pos_ is not 'PUNCT' 
             and token.lower_ not in STOP_WORDS and token.text not in custom_stopwords]
    c = Counter(words)
    return c.most_common(n)

In [None]:
for speaker, speech in speeches.items():
    print(speaker, ": ", most_common_words(speech, 10), "\n")

This sort of exploratory work is often the first step in figuring out how to clean a text for text analysis. 

Let's assess the lexical richness, defined as the ratio of number of unique words by the number of total words.

In [None]:
def lexical_richness(doc):
    words = return_words(doc)
    return len(set(words)) / len(words)

In [None]:
for speaker, speech in speeches.items():
    print(speaker, lexical_richness(speech))

Let's look at the readbility scores for all four speeches now

For the Automated Readability Index, you can get the appropriate grade level here: https://en.wikipedia.org/wiki/Automated_readability_index

In [None]:
for speaker, speech in speeches.items():
    words = return_words(speech)
    print(speaker, "ARI:", auto_readability_index(speech), "CL:", coleman_liau_index(speech, words))

To get some comparison, let's also look at some stats calculated through Textacy. We'll see the ARI and CL scores, which use the same formulas we used. However, you might notice that the scores are different. To understand why, you have to dig into the source code for Textacy, where you'll find that it filters out punctuation in creating the word list, which affects the number of characters. It also lowercases the punctuation-filtered words before creating the set of unique words, decreasing that number as well compared to how we calculated it here. These changes affect both the ARI and CL scores.

In [None]:
# https://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index
# https://en.wikipedia.org/wiki/Automated_readability_index
txt_speeches = [clinton_speech, bush_speech, obama_clean_speech, trump_speech]
corpus = textacy.Corpus('en', txt_speeches)
for doc in corpus:
    stats = textacy.text_stats.TextStats(doc)
    print({
        "ARI": stats.automated_readability_index,
        "CL": stats.coleman_liau_index,
        "stats": stats.basic_counts
    })

Why do we have such a significant difference in the CL scores? Let's look quickly at the textacy implementation: https://github.com/chartbeat-labs/textacy/blob/5927d539dd989c090f8a0b0c06ba40bb204fce82/textacy/text_stats.py#L277

In [None]:
print("Name", "Chars", "Words", "Unique", "Sentences", sep="\t")
for speaker, speech in speeches.items():
    words = return_words(speech)
    print(speaker, count_chars(words), len(words), len(set(words)), len(list(speech.sents)), sep="\t")


## Corpus level statistics

In [None]:
# clinton, bush, obama, trump
for doc in corpus:
    stats = textacy.text_stats.TextStats(doc)
    print({
        "stats": stats.basic_counts
    })