## 

# Summary Statistics

I have shared with you data sets for the books and poems submitted by the students in the course. Add those two data sets to this notebook.

In [None]:
import numpy as np 
import pandas as pd 
from typing import *
from nltk import tokenize
import re
from IPython.display import display, Markdown
import os
from matplotlib import pyplot as plt


## Description

Our first attempt to summarize our documents is to calculate and
visualize statistical information. One useful technique for visualizing information is formatting it in a *table*. Markdown blocks provide an easy way to create tables:

```
| Book              |   Author       |
|-------------------|----------------|
| Le Morte D'Arthur | Thomas Mallory |
| Moby Dick         | Herman Melville|
```
 
 is rendered as:
 
| Book              |   Author       |
|-------------------|----------------|
| Le Morte D'Arthur | Thomas Mallory |
| Moby Dick         | Herman Melville|

This format is a convenient way to summarize information. Using the provided function below, you can create and visualize Markdown tables directly from Python lists:
 

In [None]:
def show_markdown_table(headers: List[str], data: List) -> str:
    s = f"| {' | '.join(headers)} |\n| {' | '.join([(max(1, len(header) - 1)) * '-' + ':' for header in headers])} |\n"
    for row in data:
        s += f"| {' | '.join([str(item) for item in row])} |\n"
    display(Markdown(s))
    
show_markdown_table(['Book', 'Author'], [["Le Morte D'Arthur", "Thomas Mallory"], ['Moby Dick', 'Herman Melville']])

## Accessing Corpus Files

Our books and poems are in a data repository that is included with this notebook. Let's generate a Markdown table to show the book files in the repository:


In [None]:

show_markdown_table(['Book File Name'], [[file] for file in os.listdir('/kaggle/input/csci-270-books-2022')])

And here are the poems:

In [None]:
show_markdown_table(['Poem File Name'], [[file] for file in os.listdir('/kaggle/input/csci-270-poems-2022')])

Write a function to open every file in a given directory, and return a dictionary where the keys are the filenames and the values are the contents of the files.

In [None]:
def file_dictionary(file_path: str) -> Dict[str,str]:
    dict = {}
    for file in os.listdir(file_path):
        dict[file] = open(file_path + '/' +  file).read()
    return dict

Test `file_dictionary()` below. It should display a table of each poem filename along with the number of characters in the file.

In [None]:
poems = file_dictionary('/kaggle/input/csci-270-poems-2022')
poem_lengths = [[filename, len(contents)] for filename, contents in poems.items()]
show_markdown_table(['File', '# Chars'], poem_lengths)

Write a function to return a list of all **letters** in a document. For this and all of the functions we will write, make sure all letters are shifted to lower case.

In [None]:
def all_letters_from(text: str) -> List[str]:
    whitespace = text.replace(' ', '').lower()
    period = whitespace.replace('.', '')
    return list(period)

In [None]:
def all_vowels_from(text: str) -> List[str]:
    vowels = ['a', 'e', 'i', 'o', 'u']
    whitespace = text.replace(' ', '').lower()
    period = whitespace.replace('.', '')
    deez = list(period)
    for x in period:
        if (x not in vowels):
            deez.remove(x)
    return deez

In [None]:
letter_test = all_letters_from('This is a test.')
print(letter_test)
letter_test == list('thisisatest')

In [None]:
vowel_test = all_vowels_from('This is a test.')
print(vowel_test)
vowel_test == list('iiae')

Write a function to return a list of all **tokens** in a document. A **token** is a contiguous sequence of text that consists only of alphanumeric characters.

In [None]:
def all_tokens_from(text: str) -> List[str]:
    token = text.replace('.', '').lower().split()
    return token

In [None]:
token_test = all_tokens_from("This is a test.")
print(token_test)
token_test == ['this', 'is', 'a', 'test']

Write a function to return a list of all **unique** tokens in a document. You are encouraged to call `all_tokens_from()` as part of your solution.

In [None]:
def all_unique_tokens_from(text: str) -> List[str]:
    unique = []
    for word in all_tokens_from(text):
        if word not in unique:
            unique.append(word)
    return unique

In [None]:
unique_test = all_unique_tokens_from("This is a test. This is only a test.")
print(unique_test)
len(unique_test) == 5 and all(word in unique_test for word in ['this', 'is', 'a', 'test', 'only'])

Write a function to return a list of all **sentences** in a document. It is up to you to define a "sentence" for this purpose. Strive to come up with a definition that matches our intuitions as closely as possible. Each sentence should be represented as a list of the words it contains.

In [None]:
def all_sentences_from(text: str) -> List[List[str]]:    
    last_list = [] 
    if "." in text:
        sent_list = tokenize.sent_tokenize(text)
    if "." not in text:
        sent_list = text.splitlines()
    for sent in sent_list:
        new = sent.split(" ")
        last_list.append(new)
    return last_list

    #last_list = [] 
    #wordlist = text.split()
    #for word in text.split():
    #    if "?" in word:
    #        deez = word.split('?')
    #        word = ''.join(deez)
    #    if "?" in word:
    #        deez = word.split('?')
    #        word = ''.join(deez)
    #if "." in text:
    #    sent_list = text.split('.')
    #if "." not in text:
    #    sent_list = text.splitlines()
    #for sent in sent_list:
    #    new_list = sent.split(' ')
    #    last_list.append(new_list)
    #return last_list 
    

In [None]:
def all_sentences_poems(text: str) -> List[List[str]]:
    return [sentences for sentences in text.splitlines()]

Define a "sentence" for our purposes:

**Your answer here**

Add a code box in which you test `all_sentences_from()`. Your tests should show how your implementation matches your definition.

In [None]:
sentence_test = all_sentences_from('This is a test. One test. Not many tests. But it is a test? RIGHT!')
print(sentence_test)

## General Statistics

Write a function that displays a table of the number of characters, letters, sentences, tokens, and unique tokens in each file.

In [None]:
def general_statistics(file2text: Dict[str,str]):
    len_poems = [[filename, len(contents), len(all_letters_from(contents)), len(all_sentences_poems(contents)), len(all_tokens_from(contents)), len(all_unique_tokens_from(contents))] for filename, contents in poems.items()]
    len_books = [[filename, len(contents), len(all_letters_from(contents)), len(all_sentences_from(contents)), len(all_tokens_from(contents)), len(all_unique_tokens_from(contents))] for filename, contents in books.items()]
    show_markdown_table(['File', 'Characters', 'Letters', 'Sentences', 'Tokens', 'Unique Tokens'], len_poems)
    show_markdown_table(['File', 'Characters', 'Letters', 'Sentences', 'Tokens', 'Unique Tokens'], len_books)

In [None]:
books = file_dictionary('/kaggle/input/csci-270-books-2022')
general_statistics(poems)
#general_statistics(books)

Answer the following questions:
1. How do you feel about your definition of a sentence in light of the above numbers? If it is not adequate, go back and modify it, and regenerate the above tables. If applicable, how did your modifications better match our intuitions about sentences?


    I feel as though my definition of a sentence is adequete for our purposes. I could not find a database that lists the number of sentences in any book, so I am not sure how accurate it is. 

2. What might the numbers of unique tokens tell us about how these different works compare with each other?


    Each book/poem seems to have atleast 10% of its total tokens as unique tokens. If we compared all texts together instead of individually I feel like this number would decrease significantly.

3. What other initial insights can you glean from the above tables?


    It seems that the fewer words a text has, the percent of unique tokens it will have increases.

## Frequency Counts

Frequency counts can yield a lot of useful information about a document. We'll begin by writing several functions to create and visualize frequency counts. Then we will examine the frequency counts of letters, token lengths, and tokens per sentence in your documents.

In [None]:
def count(histogram: Dict[Hashable,int], item: Hashable):
    if item not in histogram:
        histogram[item] = 1
    else:
        histogram[item]+=1
    return
    
def count_all(items: Iterable[Hashable]) -> Dict[Hashable,int]:
    return {word:items.count(word) for word in items}

In [None]:
letter_example = ['d', 'b', 'a', 'c', 'b', 'a', 'a', 'a']
count_test = count_all(letter_example)
print(count_test)
count_test == {'d': 1, 'b': 2, 'a': 4, 'c': 1}

When we visualize frequency counts, we would like to have the highest count on the left of the graph, with the remaining counts in descending order. To help do this, we begin by writing a function that takes a dictionary of frequency counts and returns a list of pairs of keys and values, in descending sorted order. 

The `min_count` parameter is the lowest count for inclusion in the output list. This parameter enables us to filter out values that are not well-represented.

In [None]:
def find_ranking(histogram: Dict[Hashable,int], min_count=0) -> List[Tuple[Hashable,int]]:
    return [(key, count) for (count, key) in
            reversed(sorted([(count, key) for (key, count) in histogram.items() 
                             if count >= min_count]))]

In [None]:
ranking_test = find_ranking(count_test)
print(ranking_test)
ranking_test == [('a', 4), ('b', 2), ('c', 1), ('d', 1)]

The `unzip()` function is a useful utility function. Given a list of tuples, it will return a tuple of lists. This is helpful when transforming the result of `ranking()` into a form we can graph.

In [None]:
def unzip(tuple_values):
    # From https://appdividend.com/2020/10/19/how-to-unzip-list-of-tuples-in-python/#:~:text=%20How%20to%20Unzip%20List%20of%20Tuples%20in,zip...%204%202%3A%20Using%20List%20Comprehension%20More%20
    return tuple(zip(*tuple_values))

## Creating a Bar Plot

Using `matplotlib` ("`plt`") to create a bar plot involves the following steps:
* Use `plt.figure` to create an object representing the plot.
* Add axes to the figure.
* Set the `x` and `y` labels of the axes.
* Plot the bars themselves. 
  * This requires two lists: one for the x values, and one for the y values.
  * Make sure the x values are strings before plotting.

In [None]:
def bar_graph_from(x_label: str, keys2counts: List[Tuple[Hashable,int]]):
    figure = plt.figure()
    x = figure.add_axes([0,0,1,1])
    x.set_xlabel("item")
    x.set_ylabel("count")
    x.set_title(x_label)
    keys, values = unzip(keys2counts)
    plt.bar(keys, values)

In [None]:
def frequency_count_graph(x_label: str, items: List[str], min_count=0):
    histogram = count_all(items)
    ranked = find_ranking(histogram, min_count)
    bar_graph_from(x_label, ranked)

In [None]:
frequency_count_graph("Letters", letter_example)

Add code boxes to generate bar graphs using `frequency_count_graph()` for the following frequency counts, for both your book and your poem:
* Find the frequency counts for the letters in each document. 
* Find the frequency counts for the lengths of tokens in each document. 
* Find the frequency counts for the numbers of tokens in the sentences in each document. 

In [None]:
for filename, contents in poems.items():
    frequency_count_graph("Letters in {}".format(filename), all_letters_from(contents))

In [None]:
for filename, contents in poems.items():
    dis_list= []
    for i in all_tokens_from(contents):
        dis_list.append(len(i))
    frequency_count_graph("Length of tokens in {}".format(filename), dis_list)

In [None]:
for filename, contents in poems.items():
    dis_list= []
    for i in all_sentences_poems(contents):
        dis_list.append(len(i))
    frequency_count_graph("Average sentence len in {}".format(filename), dis_list)

In [None]:
for filename, contents in books.items():
    frequency_count_graph("Letters in {}".format(filename), all_letters_from(contents))

In [None]:
for filename, contents in books.items():
    dis_list= []
    for i in all_tokens_from(contents):
        dis_list.append(len(i))
    frequency_count_graph("Length of tokens in {}".format(filename), dis_list)

In [None]:
for filename, contents in books.items():
    dis_list= []
    for i in all_sentences_from(contents):
        dis_list.append(len(i))
    frequency_count_graph("Average sentence len in {}".format(filename), dis_list)

## Frequency Count Analysis

Compare and contrast the frequency count graphs. What insights about these documents can you obtain from them?



## Reading Level

Write a function to calculate the Flesch-Kincaid Grade Level Formula:

[![](https://readable.io/images/content/4_fkgl.png)](%20https://readable.io/content/the-flesch-reading-ease-and-flesch-kincaid-grade-level/)

Use the **number of vowels** in your document as a substitute for the total number of syllables.

In [None]:
def grade_level_formula(text: str) -> float:
    wrds = len(all_tokens_from(text))
    sent = len(all_sentences_from(text))
    vowel = len(all_vowels_from(text))
    return (0.39 * (wrds / sent ) + 11.8 * ( vowel/ wrds) - 15.59)

Assess the reading level of your document using your function. 

Does this match your assumptions? 

**Your answer here**

How does the formula interact with the intuitive basis of your assumptions?

**Your answer here**

In [None]:
# Code for grade level of your poem
grade_level_formula(poems['The road not taken.txt'])

In [None]:
# Code for grade level of your book
grade_level_formula(books['huck-finn_fixed.txt'])

## Zipf's Law

Write a function to create a plot of Zipf's Law. In the function:
* Find the frequncy counts for the tokens in the document. 
* Rank the tokens in descending order based on their counts.
* Create a log-log plot where the x-axis is the rank of the token, and the
y axis is the frequency count for that token. 
  * Make sure that both the `x` and `y` values are integers.
  * Set the axes to log scale after doing everything else to set up the plot.
* Add to your plot a line graph of Zipf's law, where the y value is the
frequency count of the top ranked token divided by the x value.



In [None]:
def token_analysis(text: str):
    # Your code here, to create a plot of Zipf's Law

In [None]:
# Code for Zipf's Law plot of your poem

In [None]:
# Code for Zipf's Law plot of your book

How closely do each of your documents follow Zipf's Law?

**Your answer here**

Write a function to calculate the fraction of tokens found only once in a document. To do this:
* Find the frequency counts of the tokens.
* Find the frequency counts of the frequency counts themselves.
* Divide the number of frequency counts of `1` by the total number of tokens.

These tokens are known as *[hapax legomena](https://en.wikipedia.org/wiki/Hapax_legomenon)* which means "read only once." When translating texts, these tokens are difficult to process because they lack repeated statistical context clues for their meaning.

In [None]:
def hapax_legomena_fraction(text: str) -> float:
    # Your code here

In [None]:
hltest = hapax_legomena_fraction("This is a test. This is only a test. This is really just a test.")
print(hltest)
hltest == (3/15)

Write code to display a table of the *hapax legomena* values for each book and poem in our data set.

In [None]:
# Write code here to display the table for the poems.

In [None]:
# Write code here to display the table for the books.

What might you hypothesize or conclude about the works in the corpus from the *hapax legomena* values in the table?

**Your answer here**

Consolidate all of the text documents into a single string. What is the *hapax legomena* value for the corpus as a whole? 

In [None]:
# Write code here to find the hapax legomena value for the corpus as a whole.

What might you infer about the documents from our corpus in light of this summative value?

**Your answer here**