# Homework 1
Jose Torres, 1/10/2022

## Book List


 | Author     | Name |
| :----------- | :----------- |
| Barrie      | Peter Pan       |
| Baum   | Wizard of OZ        |
| Carroll      | Alice in Wonderland      |
| Doyle   | Sherlock Holmes       |
| Hawthorne      | Scarlet Letter      |
| Homer   | The Iliad        |
| Machiavelli      | The Prince     |
| Verne   | Around the World in 80 Days        |

## Commentary on Vocabulary Size and Lexical Diversity

+ Homer's iliad is very interesting
    + Roughly 2 to 6 times longer than the other books in the corpus, with almost 250,000 words.
        + The book with the 2nd highest word count, Sherlock Holmes, has 45% fewer words (~140,000).
    + it also has the lowest lexical diversity at .065
+ Lexical diversity has a narrower range than I expected (.065 - .101)
+ Peter Pan has a higher lexical diversity than Wizard of Oz, Sherlock Holmes, and The Iliad.  Sharlock Holmes and The Iliad are longer books, so that is not too surprising, but The Wizard of Oz is actually a shorter book.  I would have expected Peper Pan to have the lowest lexical diversity of the set of books.

## Commentary on Best Measures of Text Difficulty

+ Text difficulty is a function of both vocabulary size and lexical diversity.  

+ I like to think of lexical diversity as the number of new vocabulary words I can expect to see in the next sentence or two.  For example, a score of .25 tells me that I can expect, on average, 1 word I have not yet seen for every 4 words in the text.  This raises the text difficulty on average for the entire text.  

+ Vocabulary size also influences text difficulty because there is more information contained in the book.  Recalling all of the events in a story plot that takes place over 100 pages is surely more difficult than recalling the events of a plot that takes place over 10 pages.  

+ I would prefer some combination of vocabulary size and lexical diversity for a single measure of text difficulty.

In [4]:
# books of varying length and difficulty

BOOKS = [
    'barrie-peter-pan.txt',
    'baum-wizard-of-oz.txt',
    'carroll-alice.txt',
    'doyle-sherlock.txt',
    'hawthorne-scarlet-letter.txt',
    'homer-iliad.txt',
    'machiavelli-prince.txt',
    'verne-around-the-world.txt'
]

# folder containing the text files
ROOT_PATH = "/Users/galois/nltk_data/corpora/gutenberg"

# required encoding to properly read the texts
CHAR_ENCODING = 'latin-1'

In [5]:
import nltk
from typing import List, Union
from nltk.corpus.reader.util import StreamBackedCorpusView as CorpusView

def lexical_diversity(text: Union[CorpusView, nltk.Text]) -> float:
    """
    Return the lexical diversity of a text.
    lexical diversity = Num Unique Words / Num Total Workds
    """

    assert isinstance(text, CorpusView) or isinstance(text, nltk.Text)

    if isinstance(text, CorpusView):
        words = text
    else:
        words = text.words()

    return round(len(set(words)) / len(words), 3)

def num_words(text: Union[List[str], nltk.Text]) -> int:
    """
    Return the number of words in a text.
    """

    assert isinstance(text, CorpusView) or isinstance(text, nltk.Text)

    if isinstance(text, CorpusView):
        words = text
    else:
        words = text.words()

    return len(words)


In [6]:
from nltk.corpus import PlaintextCorpusReader

# load books from Gutenberg corpus folder
texts = PlaintextCorpusReader(
    root=ROOT_PATH,
    fileids=BOOKS,
    encoding=CHAR_ENCODING,
)

In [7]:
import pandas as pd

# store book name, word count, and lexical diversity for each book
text_info = [
    (
        name.strip(".txt"),
        num_words(texts.words(name)),
        lexical_diversity(texts.words(name)),
    )
    for name in texts.fileids()
]

# convert data to Pandas dataframe for sorting and pretty printing
diversity_scores = pd.DataFrame.from_records(
    text_info, columns=["book name", "num_words", "lexical_diversity"]
)


In [8]:
# sorted by author

diversity_scores

Unnamed: 0,book name,num_words,lexical_diversity
0,barrie-peter-pan,66248,0.089
1,baum-wizard-of-oz,53699,0.072
2,carroll-alice,34110,0.088
3,doyle-sherlock,138480,0.067
4,hawthorne-scarlet-letter,108954,0.092
5,homer-iliad,248389,0.065
6,machiavelli-prince,61012,0.101
7,verne-around-the-world,85394,0.096


In [9]:
# sorted by word count

(
    diversity_scores
    .sort_values("num_words", ascending=False)
    .assign(
        num_words_rescaled=diversity_scores.num_words / diversity_scores.num_words.max()
    )
)


Unnamed: 0,book name,num_words,lexical_diversity,num_words_rescaled
5,homer-iliad,248389,0.065,1.0
3,doyle-sherlock,138480,0.067,0.557513
4,hawthorne-scarlet-letter,108954,0.092,0.438643
7,verne-around-the-world,85394,0.096,0.343791
0,barrie-peter-pan,66248,0.089,0.266711
6,machiavelli-prince,61012,0.101,0.245631
1,baum-wizard-of-oz,53699,0.072,0.216189
2,carroll-alice,34110,0.088,0.137325


In [10]:
# sorted by lexical diversity

(
    diversity_scores
    .sort_values("lexical_diversity", ascending=False)
    .assign(
        lexical_diversity_rescaled=diversity_scores.lexical_diversity
        / diversity_scores.lexical_diversity.max()
    )
)


Unnamed: 0,book name,num_words,lexical_diversity,lexical_diversity_rescaled
6,machiavelli-prince,61012,0.101,1.0
7,verne-around-the-world,85394,0.096,0.950495
4,hawthorne-scarlet-letter,108954,0.092,0.910891
0,barrie-peter-pan,66248,0.089,0.881188
2,carroll-alice,34110,0.088,0.871287
1,baum-wizard-of-oz,53699,0.072,0.712871
3,doyle-sherlock,138480,0.067,0.663366
5,homer-iliad,248389,0.065,0.643564
