## Homework 2
*Author: Puri Rudick*

In [176]:
import nltk
from urllib import request
import numpy as np
from sklearn.preprocessing import minmax_scale
# from nltk.book import *

Read in three books (same as HW1).

In [177]:
# Read in the txt files
books = {
    '''The Child's World: Third Reader by Browne, Tate, and Withers''' : 'https://www.gutenberg.org/cache/epub/15170/pg15170.txt',
    'Fourth Reader: The Alexandra Readers by Dearness, McIntyre, and Saul' : 'https://www.gutenberg.org/files/51975/51975-0.txt',
    'School Reading By Grades: Fifth Year by James Baldwin' : 'https://www.gutenberg.org/files/51000/51000-0.txt',
}

In [178]:
def read_in_book(url):
    response = request.urlopen(url)
    text = response.read().decode('utf8')
    
    # Find the beginning and the end of the book to remove Meta
    start2 = text.find('Produced by')
    end = text.find('END OF THIS PROJECT GUTENBERG EBOOK')
    text = text[start2-11:end]
    tokens = nltk.word_tokenize(text)
    return nltk.Text(tokens)

In [179]:
b1 = read_in_book(list(books.values())[0])
b2 = read_in_book(list(books.values())[1])
b3 = read_in_book(list(books.values())[2])

### 1. Vocabulary Size: Score and Normalize Score

In Python, create a method for scoring the vocabulary size of a text, and normalize the score from 0 to 1. It does not matter what method you use for normalization as long as you explain it in a short paragraph. 
Some relevant resources that you can leverage:
- https://docs.tibco.com/pub/spotfire/6.5.0/doc/html/norm/norm_scale_between_0_and_1.htm
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html


Normaliztion can be done using the simple normalization formula. The formula is dividing the difference of each text's vocabulary size to the min vocabular size on the list by the difference between the max and the min vocabulary size. Texts from nltk package were used. It can also be done using Min Max Scaler from the sklearn module.

In [180]:
#### Method to get the Vocab Size and normalize the score

def n_vocab_size(*arg):
    vocab_size = np.array([])
    vocab_size_norm = np.array([])
    
    #### Getting the Vocab Size
    for text in arg:
        vocab_size = np.append(vocab_size,len(set(text)))
    
    #### Normalizing using the formula 
    for vsize in vocab_size:
        vocab_size_norm = np.append(vocab_size_norm,(vsize - vocab_size.min()) / (vocab_size.max() - vocab_size.min()))
    
    #### Normalizing using sklearn preprocessing 
    vocab_size_norm_sklearn = minmax_scale(vocab_size, feature_range=(0,1), axis=0)
    
    return(vocab_size,vocab_size_norm,vocab_size_norm_sklearn)

In [181]:
vocab_size = n_vocab_size(b1,b2,b3)

print("Vocabulary Size:", *vocab_size[0],sep='\n\t- ')
print("Normalization using the simple formula:", *vocab_size[1],sep='\n\t- ')
print("Normalization using the sklearn module:", *vocab_size[2],sep='\n\t- ')

Vocabulary Size:
	- 4625.0
	- 11294.0
	- 8372.0
Normalization using the simple formula:
	- 0.0
	- 1.0
	- 0.5618533513270355
Normalization using the sklearn module:
	- 0.0
	- 0.9999999999999999
	- 0.5618533513270355


### 2. Long-word Vocabulary: Score and Normalize Score
After consulting section 3.2 in chapter 1 of Bird-Klein, create a method for scoring the long-word vocabulary size of a text, and likewise normalize (and explain) the scoring as in step 1 above.

Below we used the same normalization methods as we used with the overall vocabulary size. Using the texts from nltk package we computed the long word vocabulary size. The function written below computes the values and normalization values of the long word vocabulary sizes of multiple texts.

In [182]:
def n_long_vocab_size(*arg, longWordLength = 10):
    v_size = np.array([])
    v_size_norm = np.array([])
    
    #### Getting the Vocab Size
    for text in arg:
        V = set(text)
        long_words = [w for w in V if len(w) > longWordLength]
        v_size = np.append(v_size,len(set(long_words)))
    
    #### Normalizing using formula 
    for vsize in v_size:
        v_size_norm = np.append(v_size_norm,(vsize - v_size.min()) /
                                                    (v_size.max() - v_size.min()))
    
    #### Normalizing using sklearn module
    v_size_norm_sklearn = minmax_scale(v_size, feature_range=(0,1), axis=0)
    
    return(v_size,v_size_norm,v_size_norm_sklearn)

In [183]:
long_vocab_size = n_long_vocab_size(b1,b2,b3)

print("Long Vocabulary Size:", *long_vocab_size[0],sep='\n\t- ')
print("Normalization using the simple formula", *long_vocab_size[1],sep='\n\t- ')
print("Normalization using the sklearn module", *long_vocab_size[2],sep='\n\t- ')

Long Vocabulary Size:
	- 106.0
	- 633.0
	- 399.0
Normalization using the simple formula
	- 0.0
	- 1.0
	- 0.5559772296015181
Normalization using the sklearn module
	- 0.0
	- 1.0
	- 0.555977229601518


### 3. Text Difficulty Score

Now create a “text difficulty score” by combining the lexical diversity score from homework 1, and your normalized score of vocabulary size and long-word vocabulary size, in equal weighting. Explain what you see when this score is applied to same graded texts you used in homework 1.


In [186]:
# Lexical_diversity function from nltk book
def lexical_diversity(text):
    return len(set(text)) / len(text)

def txt_difficulty_score(*arg, lex_div_weight=1, long_vocab_size_weight=1, vocab_size_weight=1):

    lex_div_score = np.array([])

    #### Getting the lexical diversity
    for text in arg:
        lex_div_score = np.append(lex_div_score,lexical_diversity(text))
    long_vocab_size = n_long_vocab_size(*arg)
    vocab_size = n_vocab_size(*arg)

    lex_div_score = lex_div_score * lex_div_weight
    # print(lex_div_score)
    vocab_size = vocab_size[1] * vocab_size_weight
    # print(vocab_size)
    long_vocab_size = long_vocab_size[1] * long_vocab_size_weight
    # print(long_vocab_size)
    
    txt_diff_score = (lex_div_score + long_vocab_size + vocab_size) / 3
    
    return(txt_diff_score)

In [200]:
txt_difficulty = txt_difficulty_score(b1,b2,b3)

for i,j in zip(txt_difficulty,books):
    print("Text Difficulty Score for", j, "is ", i)

Text Difficulty Score for The Child's World: Third Reader by Browne, Tate, and Withers is  0.035379613692866706
Text Difficulty Score for Fourth Reader: The Alexandra Readers by Dearness, McIntyre, and Saul is  0.702631612467678
Text Difficulty Score for School Reading By Grades: Fifth Year by James Baldwin is  0.42643003987186195


### Discussion

The 3 books I used in this experiment are for 3rd, 4th, and 5th grade readers.  They are from 3 different authors, also different genes. From the result above I found that normalization can be done using the simple normalization formula or Min Max Scaler from the sklearn module.  Both methods give the same results.

For vocabulary and long vocabulary size, 3rd grade book has the lowest, follows by 5th grade book, and 4th grade book has the highest. The text difficulty score also shows the same trend. The 3rd grade book has slowest score, which makes sense becuase the book for the youngest childrend (among these 3 books) should be the easiest. However, the 4th grade book has higher text difficulty than 5th grade book. These results coresponse with the trend that I've seen from the Homework 1.

The reason behind this maybe becuase I choose 3 different authors and genes, each authors might have different views of how difficult of books for each reader grade should be.  If these books come from the same author, I think it text difficulty should be higher with the higher age of readers.