# a5 - Data Analysis
Fill in the below code cells as specified. Note that cells may utilize variables and functions defined in previous cells; we should be able to use the `Kernal > Restart & Clear All` menu item followed by `Cell > Run All` to execute your entire notebook and see the correct output.

## Part 1. Numbers
For this part of the assignment, you will analyze some numeric data (counts of library holdings) to investiate how the distribution of numbers in natural data sets obeys the counter-intuitive [Benford's Law](https://plus.maths.org/content/os/issue9/features/benford/index). 

<small>(This exercise was adapted from Steve Wolfman).</small>

Create a variable **`holdings_data`** which is a **list** of the contents of the **`data/libraryholdings.txt`** file included in the repository (each line in the file should be a single element in the list). You will need to open up the file and read its contents into a list. You should specify a _local path_ to the file (from this notebook's location).

Print out the first **ten** items from the `holdings_data` list, each on its own line. (Note that there may be extra line breaks that are included in the data items themselves).

Use the **slice operator (`:`)** to remove the "heading" and blank elements from the beginning of the data list, leaving only the list of numbers. The remaining values should continue to be stored (re-stored) in the `holdings_data` variable. Output the new first element in `holdings_data` to demonstrate that it is the first number in the data set.
- The values in the list _should_ be strings rather than an integers

Create a variable **`lead_digit_counts`** that is a dictionary whose keys are _strings_ of each digit (`"0"`, `"1"`, `"2"`, etc.), and whose values are all the number `0`. You can do this directly or with a loop. Print out the variable after you create it.

Calculate the number of times each digit appears as the _first digit_ in a value of the `holdings_data` list, storing those counts in the `lead_digit_counts` dictionary.

Use a loop to print out each count in `lead_digit_counts` with the format:
```
X values have a leading digit of digit Y
```

Print the _percentage_ of values in the the library holdings data set that have a leading digit **`1`** (round to 2 decimal places). Is this value as predicted by Benford's law?

***Extra credit challenge:*** Create a single variable `digit_position_counts` that contains the number of times that each digit 0 through 9 appears in _each_ position in the data set. E.g., a `1` appears in the 1st position 3056 times and in the second position 1005 times; a `2` appears in the 1st position 1606 times and in the second position 1044 times.

Use this variable to print a "table" of the percentage of the time each position contains each digit (e.g., the 1st digit is a `1` 33.44% of the time, a `2` 17.57% of the time, etc).

Note that for this extra challenge it is up to you to determine an appropriate data structure (e.g., how to combine dictionaries and lists and tuples) for representing this table. Be sure and include comments explaining your work.

Only attempt this problem once you have completed everything else!

## Part 2. Life Expectancy
For this part of the assignment, you'll work with data about the life expectancy (in years) for each country in the world in the years 1960 and 2013. Note that this can be really [fun](http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen.html) data!

The data is found in a [.csv](https://en.wikipedia.org/wiki/Comma-separated_values) file: a plain-text data format where each line represents a record (row) of data and where feature (column) is separated by a comma.

Read in the contents of the **`data/life_expectancy.csv`** data file, and use it to construct a **list** called **`life_expectancy_list`**. Each element in this list should be a **dictionary** (one for each row in the `csv` file) with the following keys and values:

- a key `'country'` whose value is the name of the country (as a string)
- a key `'le_1960'` whose value is the life expectancy in 1960 (as a float)
- a key `'le_2013'` whose value is the life expectancy in 2013 (as a float)

Thus the first record should look like:
```
{'country': 'Aruba', 'le_1960': 65.56936585, 'le_2013': 75.33217073}
```

You should use the **`csv`** module to read this file and break up each row into different values. See [the documentation](https://docs.python.org/3/library/csv.html) for examples of how to do this (scroll down). Print out the _first row_ of your list as a demonstration that you've processed the data correctly.

Add another item to each dictionary in the `life_expectancy_list` whose **key** is `change` and whose **value** is the change in life expectancy from 1960 to 2013.

Create a variable **`num_small_gain`** that stores the **number of countries** whose life expectancy did not improve by 5 years or more between 1960 and 2013. This will include counties whose life expectancy has worsened. Print out this variable.

Create a variable **`most_improved`** that is the **name of the country** with the largest gain in life expectancy (between 1960 and 2013). Print out this variable.

Define a function **`compare_country_le()`** that takes in the names of _two_ countries, and returns a **tuple** containing the following information:
- the name of the country with the greater life expectancy,
- the life expectancy in 2013 of that country
- the difference between the life expectancies in 2013

Use your function to print the comparison between the life expectancies of the _United States_ and _Cuba_.  

## Part 3. Readability
For this part of the assignment, you will calculate the [readability](https://en.wikipedia.org/wiki/Readability) of a text document using the [Dale-Chall Readability Formula](http://www.readabilityformulas.com/new-dale-chall-readability-formula.php). This method determines how "easy" it is to read a particular (English) document by considering the length of sentences and how many of the words used are "easy" to understand (based on a pre-defined list of "easy" words).
- Note that this part of the assignment involves researching and using an additional set of modules. If you have any questions or get stuck, ask for help!

You will first need to load the list of "easy" words into memory. This list can be found in the **`data/dale-chall.txt`** file. Open this file and read its entire contents into a **list** variable (e.g., `easy_words_list`), where each element in the list is a single line (word) in the file.

Print out the _length_ of this list variable to check your work. It should have 2942 entries (words).

In order to "look up" easy words, convert the easy words list into a **dictionary** (e.g., `easy_words_dict`), where each **key** is a word, and each **value** is `True` (that the word is in the list).
- Make sure you do not include newline characters in your keys!

Use your `easy_words_dict` to check if the word "information" is in the set of easy words. Use the `get()` method to return a value of `False` if the word is not there (instead of producing a `KeyError`). _You don't need to use a loop!_

Additionally, define a dictionary **`readability_grade_dict`** to use for looking up the "grade level" associated with the readability score you eventually compute (see [this table](https://en.wikipedia.org/wiki/Dale%E2%80%93Chall_readability_formula#Formula)). This dictionary should have **keys** that are ___tuples___ giving the range of score for a particular grade (e.g., `(5.0, 5.9)`), and **values** that are ___strings___ representing the grade level of the text (e.g., `"5th or 6th grade"`). 

Define a function **`print_grade()`** that takes in a readability score (a number greater than or equal to 0), and **prints** a string representing the grade associated with that score (from your `readability_grade_dict` dictionary). _Hint:_ you will need to loop through the items in the dictionary and determine which "tuple" key has elements that the score falls between. Be sure and round to the nearest decimal when considering the score to avoid errors with `5.95`. Test your function by printing out the "grade level" for a score of 6.4.

Calculating the readability score of a document involves considering the individual words and sentences of that document. However, splitting real-world text documents into words and sentences is non-trivial (English is _hard_!)--you need much more than the `split()` method. In order to split up real-world text documents, in this section you will be using the [Natural Language Toolkit (nltk)](http://www.nltk.org/index.html) module. This module is installed along with Anacaonda, but does require some additional data source files to be installed on your computer for it to work properly. You should be able to do this by running the below cell (you only need to run it once):

In [None]:
from nltk import download
download('punkt')
download('wordnet')

Now to calculate the readability scores! Define a function **`count_sentences()`** that takes as an argument a _string_ of text (which may contain mutliple sentecnes), and counts the number of sentences in a string. The function should **return** that count (a number). Use the [sent_tokenize()](http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.sent_tokenize) function from the `nltk.tokenize` module to break up a string into sentences (this is like the string `split()` function, but it splits into sentences rather than dividing by spaces).
- For help and an example of the `sent_tokenize()` function, see [this guide](http://textminingonline.com/dive-into-nltk-part-ii-sentence-tokenize-and-word-tokenize).
- You should *not* do any extra processing beyond that provided by the `sent_tokenize()` function at this point!
- Test your function on a simple pair or trio of sentences to make sure it's working correctly.

The next thing you'll need to do is to count the number of easy words. Define a function **`count_easy_words()`** that takes as an argument a _list_ of words as an argument and **returns** the number of words that are "easy".

- Your function should go through each word in the list, and look it up in the `easy_words_dict` you defined earlier (use the `.get()` method!). _Do not look up words in the origninal easy words list_ (the dictionary is much faster!). Be careful to look up lowercase versions of the word (hint: convert the word to lower case.

- Your function will also handle detecting different parts of speech (e.g., plurals, different verb conjugations, etc.). You can do this by using the **`WordNetLemmatizer()`** function from the `nltk.stem.wordnet` module&mdash;which produces a "lemmatizer" object. You can call the **`lemmatize()`** method on this object to reduce a word to its "base" form. See [this example](https://pythonprogramming.net/lemmatizing-nltk-tutorial/) for details. Note that you should reduce words to both their basic noun AND verb forms (meaning you will need to call the function twice: once with `'n'` (noun) and once with `'v'` (verb) as the second argument--and then check if _either_ form is is an "easy word").

- You can test your function on the word list: `['My','words','spoken','have','consequences']`, which should have 4 of the 5 words considered easy (not "consequences").

As with sentences, splitting up natural language into a list of words is tricky because of complex punctuation, contractions, etc. Thus you should use the below `extract_words()` function to "split" the text into a list of words to consider. This will handle punctuation/etc. in a consistent (if not overly robust) way.
- Thus function uses the the [word_tokenize()](http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.word_tokenize) function from the `nltk.tokenize` module to break up the string into words. It includes each punctuation character (e.g., commas, periods) as individual "words"; the `extract_words()` function removes these from consideration so you don't need to worry about them.

In [None]:
from nltk.tokenize import word_tokenize
def extract_words(text):
    raw_words = word_tokenize(text)
    words = []
    for word in raw_words:
        if(word[0].isalpha()):
            words.append(word)
    return words

Finally, define a function **`calculate_readability_score()`** that takes in a string of text and returns a readability "score" (a number) for the test based on the [Dale-Chall readability formula](https://en.wikipedia.org/wiki/Dale%E2%80%93Chall_readability_formula#Formula). Call your previous functions to calculate the number of sentences, total words, and number of difficult (not easy) words:
- Start by counting the number of sentences, then by extracting the words and counting the number of easy ones. Follow the formula to weight these values together. 
- Don't forget to adjust the score if the text if more than 5% of the words in the text are difficult!

Read in the text of the `data/alice.txt` file (the full text of Alice in Wonderland) _as a single string_. 

Calculate the readability score for the `alice.txt` file and print it out. Then use your previous function to print out the reading grade associated with that score. Use your previously-defined functions!
- For testing, my calculations show `alice.txt` has 977 sentences and 27198 words, of which 3610 are difficult. This leads to a readability score of ~7.113.

_Note that this result may not be an especially accurate model of a text's readability&mdash;after all, it's just based on a simple estimation!_