Let's look at a couple of books to see what we can learn from some simple data mining.
First, we will load our packages

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

from urllib.request import urlopen 
import re
def read_url(url): 
    return re.sub('\\s+', ' ', urlopen(url).read().decode())

For the Huck Finn and Little Women books:
1. Assign the text (data type) value for the URLs to the appropriate name variables.
2. Import the text into the named variables (*._text). Both books are out of copywrite and are in the public domain.
3. Split the text into chapters.

In [None]:
huck_finn_url = 'https://www.inferentialthinking.com/data/huck_finn.txt'
huck_finn_text = read_url(huck_finn_url)
huck_finn_chapters = huck_finn_text.split('CHAPTER ')[44:]

little_women_url = 'https://www.inferentialthinking.com/data/little_women.txt'
little_women_text = read_url(little_women_url)
little_women_chapters = little_women_text.split('CHAPTER ')[1:]

Run the code to split the Huck Finn text into chapters.

In [None]:
huck_finn_chapters

Make a one column table with the Label "Chapters". List the name of each chapter in the table.

In [None]:
Table().with_column('Chapters', huck_finn_chapters)

Using the NumPy package, count the number of times the characters "Tom" appear in each chapter in the Huck Finn corpus. (The corpus is the body of text we are working with.)

In [None]:
# Count how many times Tom appears in each chapter 
# Note, starting a line in the code box with a # makes that line a comment that is not executed.
np.char.count(huck_finn_chapters, 'Tom'),

# *Your turn:*
## Copy and paste the code from above.
## Modify it to count the number of times Jim appears in each chapter.

In [None]:
# Insert your code in this cell


- [x] Make a table that has three columns (attributes).
- [x] The first column will be labeled "Tom".
- [x] The second column will be labeled "Jim".
- [ ] Add a third column and Label it "Huck".
- [ ] Run the code and describe the results that are shown

In [None]:
counts = Table().with_columns([
    'Tom', np.char.count(huck_finn_chapters, 'Tom'),
    'Jim', np.char.count(huck_finn_chapters, 'Jim'),
 #    <insert your code here, include indent and comma>
])
counts.show() #this executes the code above in this cell

?

In [None]:
# Plot the cumulative counts:
# how many times in Chapter 1, how many times in Chapters 1 and 2, and so on.

cum_counts = counts.cumsum().with_column('Chapter', np.arange(1, 44, 1))
cum_counts.plot(column_for_xticks=3)
plots.title('Cumulative Number of Times Name Appears');

Interpret the relationship between the characters based on the visualization.

?

This exercise also serves to illustrate one of the short falls of text mining. Starting in chapter 33 Huck impersonates Tom as part of the storyline. While a reader is able to discern which is being referred to when, it would be very difficult for an algorithm to do so. Our simple counting of names program only counts names, not the person who is meant when the name is used.

# Little Women

Follow along as we carry out the same exercise for Little Women as we performed for Huck Finn. Remember that we imported the corpus for Little Women previously.

In [None]:
# The chapters of Little Women

Table().with_column('Chapters', little_women_chapters)

In [None]:
# Counts of names in the chapters of Little Women

people = ['Amy', 'Beth', 'Jo', 'Laurie', 'Meg']
people_counts = {pp: np.char.count(little_women_chapters, pp) for pp in people}

counts = Table().with_columns([
        'Amy', people_counts['Amy'],
        'Beth', people_counts['Beth'],
        'Jo', people_counts['Jo'],
        'Laurie', people_counts['Laurie'],
        'Meg', people_counts['Meg']
    ])

The information is stored in the named variables. There is no output, but something did happen!

In [None]:
# Plot the cumulative counts

cum_counts = counts.cumsum().with_column('Chapter', np.arange(1, 48, 1))
cum_counts.plot(column_for_xticks=5)
plots.title('Cumulative Number of Times Name Appears');

Who is the main, or dominant, character?

?

Laurie is a man who marries one of the sisters.

From looking at the data, who do you believe he marries?

?

Why do you believe that is the case?

?

If you guessed that Laurie married Amy congratulations! Looking at the chart that is the most likely outcome given how the number of mentions move together, kind of like a couple.

# Let's take a simple look at the style of each book. We can make a simple measure by counting how many characters are in each chapter and dividing that by the number of periods. That will determine the average sentence length.

In [None]:
# In each chapter, count the number of all characters;
# call this the "length" of the chapter.
# Also count the number of periods.

chars_periods_hf = Table().with_columns([
        'HF Chapter Length', [len(s) for s in huck_finn_chapters],
        'Number of Periods', np.char.count(huck_finn_chapters, '.')
    ])
chars_periods_lw = Table().with_columns([
        'LW Chapter Length', [len(s) for s in little_women_chapters],
        'Number of Periods', np.char.count(little_women_chapters, '.')
    ])

In [None]:
# The counts for Huckleberry Finn

chars_periods_hf

In [None]:
# The counts for Little Women

chars_periods_lw

And now let's plot the ratios to visualize how they compare.

In [None]:
plots.figure(figsize=(10,10))
plots.scatter(chars_periods_hf[1], chars_periods_hf[0], color='darkblue')
plots.scatter(chars_periods_lw[1], chars_periods_lw[0], color='gold')
plots.xlabel('Number of periods in chapter')
plots.ylabel('Number of characters in chapter');

Huck Finn has 43 chapters.

Little Women has 47 chapters.

Which book has more long chapters?

?

Which book is likely written with a more complex style?

?

What about the data lead you to select the book you did as the more complex?

?

### Both books are in the public domain and available through Project Gutenberg and other places:
1. The Adventures of Huckleberry Finn By Mark Twain https://www.gutenberg.org/ebooks/76
2. Little Women by Louisa May Alcott https://www.gutenberg.org/ebooks/514