### SMU-MSDS-7337-Natural Language Processing HomeWork 2

##### By-Rashmi Patel

In [1]:
import nltk
from nltk.book import *
from nltk.corpus import words
import re
import time
import requests

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [2]:
# function for removing the metadata
def removeMeta(t):
    start_line = "*** START OF THIS PROJECT GUTENBERG EBOOK [A-Z ]+***"
    end_line = "*** END OF THIS PROJECT GUTENBERG EBOOK [A-Z]+***"
    return re.split(end_line, re.split(start_line, text)[1])[0]

# function for splitting the text
def splitTxt(t):
    return re.split("[^a-z']+")

# function for vocabulary count
def vocabCount(t):
    if t is str:
        t = splitTxt(removeMeta(t))
    return len(set(t))

# function for lexical diversity
def lexicalDiversity(t):
    if t is str:
        t = splitTxt(removeMeta(t))
    return len(set(t)) / len(t)
    


### Question 1: In Python, create a method for scoring the vocabulary size of a text, and normalize the score from 0 to 1. It does not matter what method you use for normalization as long as you explain it in a short paragraph. (Various methods will be discussed in the live session.

In [3]:
all_words = words.words()

def vocabScore(t):
    return vocabCount(t) / len(all_words)



In [4]:
max_words = max([vocabScore(text1), vocabScore(text2), vocabScore(text3), vocabScore(text4),
                 vocabScore(text5), vocabScore(text6), vocabScore(text7), vocabScore(text8),
                 vocabScore(text9)])
print(max_words)

0.08159722222222222


In [5]:
print(text1.name)
print(vocabScore(text1))
print('=============================')
print(text2.name)
print(vocabScore(text2))
print('=============================')
print(text3.name)
print(vocabScore(text3))

Moby Dick by Herman Melville 1851
0.08159722222222222
Sense and Sensibility by Jane Austen 1811
0.02886337523655042
The Book of Genesis
0.011781055690727224


From the nltk.corpus.words.words() object, we are comparing the word count in a given text against the total number of words which is equal to 236376. 

This means is that the vocabulary score of a given text is its unique word count against that value. In this case, Moby Dick by Herman Melville 1851 has roughly 8% of the word count of the words object nltk.corpus.words.words().



### Question 2: After consulting section 3.2 in chapter 1 of Bird-Klein, create a method for scoring the long-word vocabulary size of a text, and likewise normalize (and explain) the scoring as in step 1 above.

In [6]:
def longWordCnt(t, min_length = 10):
    if t is str:
        t = split_text(t)
    
    longWords = []
    
    for w in t:
        if(len(w) >= min_length):
            longWords.append(w)
    
    return vocabCount(longWords)
    
def longWordScore(t, min_length = 10):
    count = 0
    
    for w in all_words:
        if(len(w) >= min_length):
            count += 1
    
    return float(longWordCnt(t, min_length)) / count

In [7]:
print('There are',longWordCnt(text4),
      'long words found in the text with the default minimum length of 10 and the long word score is'
      ,longWordScore(text4))

There are 2419 long words found in the text with the default minimum length of 10 and the long word score is 0.021070694400891956


The function longWordCnt() takes the text as parameter and perform counting each word of specified length (here the default is given of 10, which can be considered enough length for word" in English).

The function longWordScore() walks through the words() object, finds the total number of words of the specified length, and then divides the longWordCnt of the specified text by that value.

For example, if words() object contains, 60 words of the appropriate length, while the specified text only had 12, the returned value would be .2 (12 / 60).



### Question 3: Now create a “text difficulty score” by combining the lexical diversity score from homework 1, and your normalized score of vocabulary size and long-word vocabulary size, in equal weighting. Explain what you see when this score is applied to same graded texts you used in homework 1.

In [8]:
txtFile1 = './data/1.txt'
txtFile3 = './data/3.txt'
txtFile5 = './data/5.txt'

In [9]:
tf1 = open(txtFile1, 'rU')
tf3 = open(txtFile3, 'rU')
tf5 = open(txtFile5, 'rU')

raw = tf1.read()
tokens = nltk.word_tokenize(raw)
txt1 = nltk.Text(tokens)

raw = tf3.read()
tokens = nltk.word_tokenize(raw)
txt3 = nltk.Text(tokens)

raw = tf5.read()
tokens = nltk.word_tokenize(raw)
txt5 = nltk.Text(tokens)

  tf1 = open(txtFile1, 'rU')
  tf3 = open(txtFile3, 'rU')
  tf5 = open(txtFile5, 'rU')


In [10]:
def txtDiffScore(text):
    return ((1/3)*lexicalDiversity(text) + (1/3)*vocabScore(text) + (1/3)*longWordScore(text))

In [11]:
print('The text difficulty score for first grade reader is',txtDiffScore(txt1))
print('The text difficulty score for third grade reader is',txtDiffScore(txt3))
print('The text difficulty score for fifth grade reader is',txtDiffScore(txt5))

The text difficulty score for first grade reader is 0.05913730461208604
The text difficulty score for third grade reader is 0.0488844200899054
The text difficulty score for fifth grade reader is 0.06370412477093979


The lexical diversity function used is slightly different then homework1 , since the first one didn't account for removing out non-alphabetic characters which means that words like 'how' and 'how? are not considered same word.

To be clear, the files I'm using here are the McGuffey's Eclectic Readers, specifically the first, third and fifth grade readers.

These values indicate that the first grade reader is slightly more difficult than the third grade reader, while the fifth grade reader is the most difficult of the set. 
