### Cameron Stewart
# Project 2

### 1.	In Python, create a method for scoring the vocabulary size of a text, and normalize the score from 0 to 1. It does not matter what method you use for normalization as long as you explain it in a short paragraph. 
Some relevant resources that you can leverage:                   
https://docs.tibco.com/pub/spotfire/6.5.0/doc/html/norm/norm_scale_between_0_and_1.htm
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

In [1]:
def vocabulary_normalizer(vocab_size):
    min_val=2500
    max_val=30030
    clean_vocab_size=max(min(vocab_size,max_val),min_val)
    return (clean_vocab_size - min_val) / (max_val - min_val)

To normalize vocabulary size between 0 and 1, we would ideally have a large sample of book vocabulary sizes to reference. Since we do not, I searched for examples on the internet to find the estimated min and max. Based on findings from this site (https://www.grammarly.com/blog/7-novels-to-read-for-a-better-vocabulary/#:~:text=Ulysses,English%20novels%20of%20all%20time.), Ulysses is one of the most difficult novels to read and represents near the 'max' for unique words in an English novel at 30,030 unique words. We will use this as our max to normalize our vocabulary size scale. I did not find a good source for the minimum, but did find the a minimum word count for a novel to be ~40,000. Based on this, I will estimate my lower bound of vocabulary size to be 2,500. We will normalize the studied book's vocabulary size between the defined min and max values. There is also an boundary checker to ensure that rare books outside this range can't fall outside of 0 and 1.

### 2.	After consulting section 3.2 in chapter 1 of Bird-Klein, create a method for scoring the long-word vocabulary size of a text, and likewise normalize (and explain) the scoring as in step 1 above.

In [2]:
#Function to count unique long words over 7 letters that appear over 7 times
def unique_long_word_counter(tokenized_text):
    fdist = FreqDist(tokenized_text)
    return len([x for x in set(tokenized_text) if len(x) > 7 and fdist[x] > 7])

#Function to normalize unique long word count
def long_word_vocabulary_normalizer(lw_vocab_size):
    min_val=40
    max_val=800
    clean_lw_vocab_size=max(min(lw_vocab_size,max_val),min_val)  
    return (clean_lw_vocab_size - min_val) / (max_val - min_val)

Using Bird-Klein's example, we defined a 'long-word' as anything over 7 characters that appeared more than 7 times. The first function counts the unique words that meet this long word criteria. Once again for the normalization, we would prefer to have a large sample to help us define the min and max values of long word vocabulary size. Without this and no clear reference list online, I made a subjective decision to set the min to 40 and the max to 800. This should be updated when more data is collected. Also, a potential concern that should be monitored is that hyphenated words are currently included.

### 3.	Now create a “text difficulty score” by combining the lexical diversity score from homework 1, and your normalized score of vocabulary size and long-word vocabulary size, in equal weighting. Explain what you see when this score is applied to same graded texts you used in homework 1.

Load in libraries

In [3]:
#Load Required Functions
from urllib import request
import nltk
from nltk import word_tokenize
from nltk.probability import FreqDist

Load in HW1 lexical diversity function to get vocab size/token count

In [4]:
#Defined function to calculate lexical diversity from last HW
def lexical_diversity(text):
    return len(set(text)) / len(text)

Create function to output prints of key statistics from HW1 and normalized requested values for HW2. This function cleans the text to only include the contents of the story instead of extraneous information. This function incorporates the lexical diversity, unique_long_word_counter, long_word_vocabulary_normalizer, and vocabulary_normalizer functions.

In [5]:
#Defined function to print out all statistics from HW1 and requested normalized values for HW2
def print_book_statistics(url,reading_level,start_text,end_text):
    #Print Reading Level
    print('Reading Level:',reading_level)

    #Fetch text through the URL and tokenize the text
    response = request.urlopen(url)
    raw = response.read().decode('utf8')
    tokens = word_tokenize(raw)
    print('Original Token Count:',len(tokens))

    #Manually identify start and end phrases of book and use their index to clean the raw file of unnecessary text
    start=raw.find(start_text)
    end=raw.rfind(end_text)
    clean_raw=raw[start:end]
    clean_tokens = word_tokenize(clean_raw)
    print('Cleaned Token Count:',len(clean_tokens),'\n')

    #Create NLTK Text so we can use NLTK methods on the text (verified token count remained the same)
    text = nltk.Text(clean_tokens)

    #Calculate Vocabulary Size
    vocab_size=len(set(text))
    print('Vocabulary Size:',vocab_size)
    
    #Long Word Vocabulary Size
    unique_lw_count=unique_long_word_counter(clean_tokens)
    print('Long Word Vocabulary Size',unique_lw_count,'\n')
    
    #Normalized Values
    lex_div=lexical_diversity(text)
    norm_vocab_size=vocabulary_normalizer(vocab_size)
    norm_lw_vocab_size=long_word_vocabulary_normalizer(unique_lw_count)
    print('Lexical Diversity:',round(lex_div,2))
    print('Normalized Vocabulary Size:', round(norm_vocab_size,2))
    print('Normalized Long Word Vocabulary Size:',round(norm_lw_vocab_size,2))
    print('Text Difficulty Score:', round((lex_div+norm_vocab_size+norm_lw_vocab_size)/3,2))

### Below are the 3 books analyzed from HW1 with updated analysis for HW2:

#### The Phoenix and the Carpet by E. Nesbit (Reading Level 5-6)

In [6]:
url = "https://www.gutenberg.org/files/836/836-0.txt"
reading_level='5-6'
start_text="CHAPTER 1. THE EGG"
end_text="End of the Project Gutenberg EBook"
print_book_statistics(url,reading_level,start_text,end_text)

Reading Level: 5-6
Original Token Count: 82874
Cleaned Token Count: 79211 

Vocabulary Size: 6665
Long Word Vocabulary Size 77 

Lexical Diversity: 0.08
Normalized Vocabulary Size: 0.15
Normalized Long Word Vocabulary Size: 0.05
Text Difficulty Score: 0.09


#### Kidnapped by Robert Louis Stevenson (Reading Level 7)

In [7]:
url = "https://www.gutenberg.org/files/421/421-0.txt"
reading_level='7'
start_text="CHAPTER I"
end_text="End of the Project Gutenberg EBook"
print_book_statistics(url,reading_level,start_text,end_text)

Reading Level: 7
Original Token Count: 103147
Cleaned Token Count: 97506 

Vocabulary Size: 7314
Long Word Vocabulary Size 94 

Lexical Diversity: 0.08
Normalized Vocabulary Size: 0.17
Normalized Long Word Vocabulary Size: 0.07
Text Difficulty Score: 0.11


#### Pamela, or Virtue Rewarded by Samuel Richardson (Reading Level 10-12)

In [8]:
url = "https://www.gutenberg.org/files/6124/6124-0.txt"
reading_level= '10-12'
start_text="LETTER I"
end_text="End of Project Gutenberg's Pamela, or Virtue Rewarded"
print_book_statistics(url,reading_level,start_text,end_text)

Reading Level: 10-12
Original Token Count: 272922
Cleaned Token Count: 269161 

Vocabulary Size: 9086
Long Word Vocabulary Size 429 

Lexical Diversity: 0.03
Normalized Vocabulary Size: 0.24
Normalized Long Word Vocabulary Size: 0.51
Text Difficulty Score: 0.26


The text difficulty score follows a trend with the reading scores in my examples unlike the lexical diversity alone in the last HW. This is only a single sample size at each reading level, so we should proceed with caution assuming this methodology is effective. With a larger sample size for each reading level, we could refine the min and max normalization values for the normalized vocabulary size and normalized long word vocabulary size functions. We could also refine the weights for the inputs of the lexical diversity, normalized vocabulary size, and normalized long word vocabulary size functions. Overall, the findings show this method to be a more balanced approach to evaluating text difficulty than lexical diversity alone by balancing text length, vocabulary, and frequent use of more complex vocabulary.