Joseph Caguioa

Spring 2020

DS 7337: Natural Language Processing

Section 404 (Tuesday 2030-2200)

HW2 Due: Date of Live Session 4 (1/28/20)

---

# Homework 2

## <u><a name="toc">Table of Contents:</a></u>
* [Question 1](#question1)
* [Question 2](#question2)
* [Question 3](#question3)

---

In [1]:
from nltk import *
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


### <a name="question1">Question 1</a> 

<b>In Python, create a method for scoring the vocabulary size of a text, and normalize the score from 0 to 1. It does not matter what method you use for normalization as long as you explain it in a short paragraph. (Various methods will be discussed in the live session.)</b> <sub>[(back to top)](#toc)</sub>

The Bird-Klein-Loper text contains a code snippet for calculating vocabulary size that involves counting the set of all words (accounting for case), not including numbers and punctuation.

In [2]:
len(set(word.lower() for word in text1 if word.isalpha())) # Should return 16948

16948

Note that this differs from the lexical diversity score defined earlier in the chapter. That function gets unique tokens (types), which is stated to indirectly indicate vocabulary size, but also includes punctuation and does not deduplicate based on case.

One reasonable method for normalizing the vocabulary score would be to divide the vocabulary size by the total number of words in the text. That is to say, strip away all tokens with non-alphabetic characters to obtain all words, and then find the percentage of unique words within that total.

In [3]:
def vocabulary(text):
    '''
    Function to obtain unique vocabulary from text.
    Adapted definition from pg25 of Bird-Klein-Loper's NLP with Python.
    
    Parameters: 
        text (list): Text as list of tokens
    
    Returns:
        list: Set of unique words
    '''
    
    vocab = set(word.lower() for word in text if word.isalpha())
    return vocab

def words_only(text):
    '''
    Function to remove non-alphabetic tokens.
    
    Parameters:
        text (list): Text as list of tokens
    
    Returns:
        list: List of all words in text, including duplicates
    '''
    
    words = list(word for word in text if word.isalpha())
    return words

def vocab_diversity(text):
    '''
    Function to get percent of unique vocabulary
    
    Parameters:
        text (list): Text as list of tokens
    
    Returns:
        float: Percentage of unique vocabulary in word total
    '''
    
    return len(vocabulary(text))/len(words_only(text))

In [4]:
example_sentence_1 = ['A', 'a', 'An', 'an', 'The', 'the', ';', '.', 'ABC123']
example_sentence_2 = ['A', 'a', 'antidisestablishmentarianism', '.']

print(f"The vocabulary size of example 1 is {vocab_diversity(example_sentence_1)}.")
print(f"The vocabulary size of example 2 is {vocab_diversity(example_sentence_2)}.")

The vocabulary size of example 1 is 0.5.
The vocabulary size of example 2 is 0.6666666666666666.


---

### <a name="question2">Question 2</a> 

<b>After consulting section 3.2 in chapter 1 of Bird-Klein, create a method for scoring the long-word vocabulary size of a text, and likewise normalize (and explain) the scoring as in step 1 above.</b> <sub>[(back to top)](#toc)</sub>

Bird-Klein-Loper 3.2 identifies long words as those with more than 15 characters.

In [5]:
text1_long_words = [word for word in text1 if len(word) > 15]
print(text1_long_words)

['CIRCUMNAVIGATION', 'uncomfortableness', 'cannibalistically', 'circumnavigations', 'superstitiousness', 'apprehensiveness', 'indiscriminately', 'indiscriminately', 'superstitiousness', 'comprehensiveness', 'circumnavigating', 'preternaturalness', 'circumnavigation', 'apprehensiveness', 'indiscriminately', 'simultaneousness', 'indispensableness', 'apprehensiveness', 'undiscriminating', 'irresistibleness', 'Physiognomically', 'physiognomically', 'physiognomically', 'circumnavigation', 'hermaphroditical', 'circumnavigating', 'characteristically', 'comprehensiveness', 'comprehensiveness', 'uncompromisedness', 'uninterpenetratingly', 'responsibilities', 'supernaturalness', 'subterraneousness', 'apprehensiveness', 'simultaneousness']


Note that the example list above contains some repeated words after taking capitalization and plurality into account (e.g., circumnavigation, comprehensiveness). In order to score long-word vocabulary size, it would make sense to consider the percentage of the set of unique long words (that is, ignoring case) out of the total vocabulary. Doing this gives a better sense for vocabulary complexity than using the word total or token total as the normalizing quantity.

In [6]:
def long_vocabulary(text):
    '''
    Function to get set of "long" words.
    
    Parameters:
        text (list): Text as list of tokens
    
    Returns:
        list: Set of unique words with more than 15 characters
    '''
    
    vocab = vocabulary(text)
    long_vocabulary = [word for word in vocab if len(word) > 15]
    return long_vocabulary
    
def long_vocabulary_size(text):
    '''
    Function to 
    
    Parameters:
        text (list): Text as list of tokens
        
    Returns:
        float: Percentage of long words out of set of vocabulary
    '''

    return len(long_vocabulary(text))/len(vocabulary(text))

In [7]:
print(f"The long-word vocabulary size of example 1 is {long_vocabulary_size(example_sentence_1)}.")
print(f"The long-word vocabulary size of example 2 is {long_vocabulary_size(example_sentence_2)}.")

The long-word vocabulary size of example 1 is 0.0.
The long-word vocabulary size of example 2 is 0.5.


---

### <a name="question3">Question 3</a> 

<b>Now create a “text difficulty score” by combining the lexical diversity score from homework 1, and your normalized score of vocabulary size and long-word vocabulary size, in equal weighting. Explain what you see when this score is applied to the same graded texts you used in homework 1.</b> <sub>[(back to top)](#toc)</sub>

In [8]:
# Lexical diversity. Definition from pg9 of Bird/Klein/Loper's NLP with Python
def lexical_diversity(text):
    return len(set(text)) / len(text)

print(lexical_diversity(text3)) # Should return 0.06230453

0.06230453042623537


In [9]:
# Import package needed to access texts as urls
import urllib3

http = urllib3.PoolManager()

def tokenize_url(url):
    """
    Function to obtain text from a web address and return list of tokens.
    
    Parameters:
        url (string): Url of interest
    
    Returns:
        list: Text from url as list of tokens
    """
    
    response = http.request('GET', url)
    raw = response.data.decode('utf-8')
    return word_tokenize(raw)

In [10]:
fourth_url = 'http://www.gutenberg.org/cache/epub/14880/pg14880.txt'
fourth_tokens = tokenize_url(fourth_url)

fifth_url = 'http://www.gutenberg.org/cache/epub/15040/pg15040.txt'
fifth_tokens = tokenize_url(fifth_url)

sixth_url = 'http://www.gutenberg.org/cache/epub/16751/pg16751.txt'
sixth_tokens = tokenize_url(sixth_url)

The above contents to get the texts are taken from my Homework 1 notebook. Equally weighting the three scores could simply translate as taking their average.

In [11]:
def text_difficulty_score(text):
    '''
    Function to calculate text difficulty using equal weights of lexical diversity, 
    vocabulary size, and long-word vocabulary size.
    
    Parameters:
        text (list): Text as list of tokens 
    
    Returns:
        float: Average of three text difficulty parameters
    '''
    
    return (lexical_diversity(text) + vocab_diversity(text) + long_vocabulary_size(text))/3

In [12]:
print(f"The text difficulty of McGuffey's Fourth Eclectic Reader is \
{round(text_difficulty_score(fourth_tokens),5)}.")
print(f"The text difficulty of Mcguffey's Fifth Eclectic Reader is \
{round(text_difficulty_score(fifth_tokens),5)}.")
print(f"The text difficulty of McGuffey's Sixth Eclectic Reader is \
{round(text_difficulty_score(sixth_tokens),5)}.")

The text difficulty of McGuffey's Fourth Eclectic Reader is 0.08133.
The text difficulty of Mcguffey's Fifth Eclectic Reader is 0.07462.
The text difficulty of McGuffey's Sixth Eclectic Reader is 0.06678.


Again, perhaps unexpectedly, this created "text difficulty score" decreases as the reading level increases. Why might this be? Some investigation can be done using len() on the vocabulary(), words_only(), and long_vocabulary() functions defined above.

Score | Fourth Reader | Fifth Reader | Sixth Reader
-|-|-|-
Word Total | 63404 | 97807 | 138387
Vocabulary | 7629 | 10836 | 13713
Vocab Diversity | 0.12032 | 0.11079 | 0.099092
Long Words | 1 | 2 | 5

Given the nature of these introduction to reading textbooks, the long-word vocabulary size has little bearing; each book contains no more than a handful of big words, which are drops in the bucket compared to general vocabulary size. Of more interest is vocabulary size, which still appears outstripped by the general word counts. This is expected, as the structure of readers will typically have repetition for the words they introduce.

Perhaps like with lexical diversity, this attempt at normalization still does not do well at accounting for text size. Maybe methods that look at difficulty measures beyond word counts, such as sentence structure, can give improved insight.