# More Text Analysis: Finding Words that Uniquely Characterize a Book

Relevant resources:
 * [Python & Pandas Field Guide: Chapter 4](https://snakebear.science/04-Functions/toctree.html)

We're going to write some code to do a  text analysis to find words that uniquely characterize a book. 

In previous exercises we have counted letters or words. We're going to do some more counting here but if we only look for words that are used frequently in a book by itself, we will get a lot of common but unimportant words, like "the," "a," "he," "she," etc., that would occur in nearly any English-language book.  We can find more *interesting* important words that are unique to a specific book by excluding words that occur across books and identifying frequently occurring words unique to a specific book.

The notebook below will work through this process.  We've provided some functions that handle low-level parts of the analysis.  You'll use those functions and write some of your own to perform the analysis.

Goals:
  * Practice *using* functions that take arguments and return values.
  * Practice *writing* new functions.
  * Work on a realistic data science task.

First, we have a cell for import statements, which is almost always the first code cell in a notebook.  For the code we've provided, we're importing the `collections` module.  You can add more import statements here if needed.

In [1]:
import collections

## Provided Functions

Each of the functions here is already written and tested; they do not need to be changed.  Once you execute these cells to define the functions, you can use them below.

A comment at the top of each function describes how to use it.  Read those comments.

**Important:** You do not need to understand the code in any of these functions.  Remember *abstraction*: we just want to *use* the functions, and we'll *ignore* the complexity inside them.  (Feel free to look through them later if you want to, though.)

In [None]:
def get_words(book_filename):
    '''
    get_words() reads a file and returns a list of the words in that file.
    
    Call it with one argument: a string containing the name of the file you want to read.
    Returns: the words in that file.
    '''
    with open(book_filename) as f:
        file_contents = f.read()
    words = file_contents.split()
    words = [w.strip(",.-!?'\";:()") for w in words]
    return words

In [None]:
def calc_frequencies(words):
    '''
    calc_frequencies() counts the frequency of every word in a list of words.
    
    Call it with one argument: a list of strings.
    Returns: a collection of frequencies that you can use with get_frequency() (below).
    '''
    return collections.Counter(words)

In [None]:
def get_frequency(word, freqs):
    '''
    get_frequency() gets the frequency of a single word.
    
    Call it with two arguments:
      1) The word whose frequency you want
      2) A collection of frequencies produced by get_frequencies() (above).
    Returns: The number of times that word occurs
    '''
    return freqs[word]

In [None]:
def unique_words(words1, words2):
    '''
    unique_words() finds words that are unique to one book compared to another.
    
    Call it with two arguments: each a list of words.
    Returns: a list of words that *are* in the first list but are *not* in the second list.
    '''
    set1 = set(words1)
    set2 = set(words2)
    return list(set1 - set2)

## 1) Try out the functions

We need to try out the functions to learn how each works before we can use them together.  Read the comment inside each function definition above for guidance on how to use it.

### `get_words()`

First, get lists of the words in two different books using `get_words()`.  The books available are:
  * *A Tale of Two Cities* in file `atotc.txt`
  * *Pride and Prejudice* in file `pandp.txt`
  * *Alice's Adventures in Wonderland* in file `alice.txt`
  * *Dracula* in file `dracula.txt`

Steps:
  * Choose whichever books you like.
  * Store one book's words in a variable `words1` and the other list in `words2`.
  * Then, inspect the lists: look at the first 10 words in each by printing `words1[:10]` and `words2[:10]`.
     * *Adding `[:10]` after the list variable is performing an operation called "slicing."  We'll learn more about it soon.*
  
You should see the words of the title plus a bit more of each book you've chosen.

In [None]:
# Try get_words() here.

### `calc_frequencies()` and `get_frequency()`

Now try out `calc_frequencies()` and `get_frequency()`:
  * Call `calc_frequencies()` with each of the word lists you generated above as an argument (you'll have to call it once for each list)
  * Store the function's return value in `freqs1` and `freqs2`.
  * Then, call `get_frequency()` to get the frequency of "the" in both of the books.
  * Print out the frequency of "the" for both books.

**Remember:** The variables you created in the cell above can be used in all later cells; you don't need to copy that code here.

You should get the following frequencies for "the" in each book:
  * `atotc.txt`: 7379
  * `pandp.txt`: 4048
  * `alice.txt`: 1515
  * `dracula.txt`: 7257
  
Once you have it working, try looking at the frequencies of some other words.

In [None]:
# Try calc_frequencies() and get_frequency() here.

### `unique_words()`

Finally, try out `unique_words()`:
  * Call it with each of the two word lists you made above as its two arguments (whichever is put first is the one from which you will get unique words).
  * Store its return value in a variable named `uniq`.
  * Look at the first 20 words in the list by printing `uniq[:20]`.
  
You should see a list of 20 words that occur in the first book you chose but not the second.

In [None]:
# Try unique_words() here.

## 2) Do some analysis

Words that are found in one book and not in another are interesting, but they might not be *important* to the book.  For example, one of the words found in *A Tale of Two Cities* and not in *Pride and Prejudice* is "exterior."  Okay, sure.

So let's try to find words that are in one book and not in another *and occur frequently* in the first book.

Here's the idea:

1. Choose two books.
2. Get the words and word-frequencies for each book chosen.
3. Find the words that are unique to the first book and not in the second.

   *[Until now, you've practiced this all above.]*
5. For every word in the set of unique words:
   1. Find its frequency within the first book.
   2. If its frequency is above some lower limit: print out the word and its frequency.
   
A good lower limit to choose at first is 50.  That is, with a lower limit of 50, you will find all words that are only in the first book *and* occur at least 50 times in that book.

**Caution:** if you set the lower limit *too* low, it will find and print out a *lot* of words.  If you lower it below 50, try 40 or 30 before you jump to 10 or something.

Write and test code for those steps in the cell below.  You should end up with a list of frequent, unique words for the book chosen.  You might notice a common pattern in what type of words show up if you run your code with different books.

In [None]:
# Write and test your code here.

As an example, your output might look like this (you can probably guess which book this is analyzing):
```
Word: Hatter  frequency: 54
Word: Queen  frequency: 67
Word: Gryphon  frequency: 55
Word: Alice  frequency: 385
Word: Mock  frequency: 56
Word: Turtle  frequency: 56
```

## 3) Make it reusable

In the code above, you have to **choose** three values: two book filenames and one lower limit on word frequency.  Everything else is calculated using those three values.

That means we can make a **function** for the above code, and it would only need us to pass in three values as arguments.  If we do that, we have a reusable function that we can call easily with different arguments to run the code on multiple books or to try out different lower limits, and generally we can work with it much more easily as a function.

Define a function in the cell below called `print_frequent_words()`:
  * It should have three **parameters**: two filenames and one lower limit on the frequency
  * It should use those three parameters to do the same thing the earlier analysis did
  * It should print out the analysis results, just like the above code did.  It will therefore be a **void** function.

In [None]:
# Define print_frequent_words() here.

Test the function below by calling it multiple times with different filenames and different lower limits on word frequency.

In [None]:
# Test your print_frequent_words() function here.

## Additional consideration: More efficient code

When dealing with large sets of data, it's important to think about how *efficient* the code is.  Every time we call the function above, it has to read both files specified and compute the frequencies of their words again.  If we call it multiple times, that is a lot of duplicate, redundant work!

A better approach would be to separate out reading the data and calculating the frequencies from the work of finding frequent, unique words.  The function you might write then would have parameters for two word lists, two sets of frequencies, and the lower frequency limit.  The idea is you would have the word and frequency lists calculated and stored in variables, and you could just pass those variables into the function without having to recalculate them.

If you'd like more practice, try writing and testing that variant of the function.  You might even notice that it runs faster than the one above.

In [None]:
# Optional: Test your faster more efficient code here