# Module 4 Self-Assessment: Word Counts
This assignment walks you through the steps required to perform an advanced frequency analysis on words in a given text source. It includes the following steps:

1. Convert a text file into a string.
1. Split a string into words, excluding punctuation marks.
1. Remove stop words from the string.
1. Lemmatize the words in the string so that all words are stem words.
1. Count the frequency of each stem word and store the results in a dictionary.
1. Convert the dictionary to a JSON file.

You may use any text file you wish, including files used in lessons and exercises in this course, files downloaded from a website like Project Gutenberg, or a file you create specifically for this assignment. After completing the activity, you should test it using at least one other file.

You may create this as a single script that includes all steps, or you can split the steps into individual scripts.

In [23]:
import string
import re
import nltk

nltk.download("stopwords")
nltk.download("wordnet")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mackachoo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/mackachoo/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Step 1: Convert a Text File to a String
Create a function that takes as input the path to a text file and returns the contents of the file as a string.

In [24]:
def read_text_file(file_path):
    return open(file_path).read()


text = read_text_file("../data/frankenstein.txt")
print(text)

The Project Gutenberg eBook of Frankenstein; Or, The Modern Prometheus
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: Frankenstein; Or, The Modern Prometheus

Author: Mary Wollstonecraft Shelley

Release date: October 1, 1993 [eBook #84]
                Most recently updated: December 2, 2022

Language: English

Credits: Judith Boss, Christy Phillips, Lynn Hanninen and David Meltzer. HTML version by Al Haines.
        Further corrections by Menno de Leeuw.


*** START OF THE PROJECT GUTENBERG EBOOK FRANKENSTEIN; OR, THE MODERN PROMETHEUS ***




Frankenstein;

or, the Modern Prometheu

### Step 2: Split the String into Words
Create a function that takes as input a string and returns a list of strings representing the words in the text file.

The function should divide the string into words based on any type of punctuation.

The function should convert all words into lowercase.

In [25]:
def split_text(text):
    return re.split(f"[{string.punctuation} ]+", text)


words = split_text(text)
print(words)

['\ufeffThe', 'Project', 'Gutenberg', 'eBook', 'of', 'Frankenstein', 'Or', 'The', 'Modern', 'Prometheus\n', '\nThis', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and\nmost', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions\nwhatsoever', 'You', 'may', 'copy', 'it', 'give', 'it', 'away', 'or', 're', 'use', 'it', 'under', 'the', 'terms\nof', 'the', 'Project', 'Gutenberg', 'License', 'included', 'with', 'this', 'ebook', 'or', 'online\nat', 'www', 'gutenberg', 'org', 'If', 'you', 'are', 'not', 'located', 'in', 'the', 'United', 'States', '\nyou', 'will', 'have', 'to', 'check', 'the', 'laws', 'of', 'the', 'country', 'where', 'you', 'are', 'located\nbefore', 'using', 'this', 'eBook', '\n\nTitle', 'Frankenstein', 'Or', 'The', 'Modern', 'Prometheus\n\nAuthor', 'Mary', 'Wollstonecraft', 'Shelley\n\nRelease', 'date', 'October', '1', '1993', 'eBook', '84', '\n', 'Most', 'recently', 'updated

### Step 3: Exclude Stop Words
When searching or indexing text content (such as web pages or large documents), we typically want to exclude frequently-used words like "the," "a," or "and" so that the search or analysis includes only the words that are more likely to produce meaningful results. We use the term "stop words" to reference this collection of words.

Because this is a common task when working with text, Python has an nltk module that includes stop words for a variety of languages. We can use this module to remove stop words from text we want to search or analyze.

You may need to download extra parts of this module. To do this, run the following snippet in a cell by itself.
 
```Py
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
```

Create a function that takes as input a list of words and removes all stop words. The basic steps of importing the stopwords module are provided for you, but you may find it useful to do more research on stop words before completing this step.


In [30]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))


def remove_stop_words(words, stop_words):
    # return list(filter(lambda word: word not in stop_words, words))
    return [word for word in words if word not in stop_words]


words_clean = remove_stop_words(words, stop_words)
print(words_clean)

['\ufeffThe', 'Project', 'Gutenberg', 'eBook', 'Frankenstein', 'Or', 'The', 'Modern', 'Prometheus\n', '\nThis', 'ebook', 'use', 'anyone', 'anywhere', 'United', 'States', 'and\nmost', 'parts', 'world', 'cost', 'almost', 'restrictions\nwhatsoever', 'You', 'may', 'copy', 'give', 'away', 'use', 'terms\nof', 'Project', 'Gutenberg', 'License', 'included', 'ebook', 'online\nat', 'www', 'gutenberg', 'org', 'If', 'located', 'United', 'States', '\nyou', 'check', 'laws', 'country', 'located\nbefore', 'using', 'eBook', '\n\nTitle', 'Frankenstein', 'Or', 'The', 'Modern', 'Prometheus\n\nAuthor', 'Mary', 'Wollstonecraft', 'Shelley\n\nRelease', 'date', 'October', '1', '1993', 'eBook', '84', '\n', 'Most', 'recently', 'updated', 'December', '2', '2022\n\nLanguage', 'English\n\nCredits', 'Judith', 'Boss', 'Christy', 'Phillips', 'Lynn', 'Hanninen', 'David', 'Meltzer', 'HTML', 'version', 'Al', 'Haines', '\n', 'Further', 'corrections', 'Menno', 'de', 'Leeuw', '\n\n\n', 'START', 'OF', 'THE', 'PROJECT', 'GUTE

### Step 4: Lemmatize the Words
We can also use the nltk module to lemmatize words in a text file. The term lemmatize refers to the process of identifying words that are inflected versions of the same stem word, so that only the stem word is included in the analysis.

For example, each of the following phrases includes an inflected form of the stem word "walk":

I walked to the coffee shop last night.
Helen regularly walks her dog in the evening.
They saw the boys walking toward the house.
A strict textual analysis would count each of these as a separate word, but they are all actually different forms of the same stem word, "walk." Lemmatizing the words reduces the number of words that a process must analyze, making the process more efficient and the results more meaningful.

The following code imports WordNetLemmatizer from the nltk.stem module and creates a lemmatizer. We can then use the lemmatizer to identify the lemma (or root form) of an inflected word, as shown in the example.

```Py
# example code
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 
 
word = "priorities"
word_lemmatized =  lemmatizer.lemmatize(word)
print(word) #original word
print(word_lemmatized) #lemmatized word
Using this code as a starting point, create a function to lemmatize each word in a list of words produced in the previous step of this activity.

# use this cell to complete the activity
def lemmatize_words(words_clean):
    pass
 
words_lemmatized = lemmatize_words(words_clean)
print(words_lemmatized)


### Step 5: Count the Words
Create a function that takes as input a list of lemmatized words and returns a dictionary that has the frequency of occurrence of each lemma.

```Py
def compute_frequency_words(words_lemmatized):
    pass
words_frequency = compute_frequency_words(words_lemmatized)
print(type(words_frequency)) #should print dict
print(words_frequency)


### Step 6: Export the Results to JSON
Create a function that takes as input a dictionary where the key is a word and the value is the frequency of occurrence of that word in an input text.

The function should store the dictionary in a JSON file named words_frequency.json.

```Py
def save_words_frequency(words_frequency,file_path="data/words_frequency.json"):
    pass
 
save_words_frequency(words_frequency,file_path="data/words_frequency.json")

### Step 7: Combine All Steps in a Single Program
Using the skeleton below, combine all of the previous steps into a single script that will perform the following steps:

1. Convert a text file to a string.
1. Split the string into words, excluding punctuation marks.
1. Remove stop words from the list of strings.
1. Lemmatize the words in the list so that all words are stem words.
1. Count the frequency of each stem word and store the results in a dictionary.
1. Convert the dictionary to a JSON file.

## Requirements
After completing all steps in this assignment, verify that your code meets the following requirements:

- Your name and a current date appear as a comment in the first line of code.
- The final version of the file successfully completes each of the following tasks:
    1. Convert a text file to a string.
    1. Split the string into words, excluding punctuation marks.
    1. Remove stop words from the list of strings.
    1. Lemmatize the words in the list so that all words are stem words.
    1. Count the frequency of each stem word and store the results in a dictionary.
    1. Convert the dictionary to a JSON file.
- Include appropriate exception handling for predictable errors such as missing files.


In [None]:
import string


def read_text_file(file_path):
    pass


def split_text(text):
    pass


def remove_stop_words(words, stop_words):
    pass


def lemmatize_words(words_clean):
    pass


def compute_frequency_words(words_lemmatized):
    pass


def save_words_frequency(words_frequency, file_path="data/words_frequency.json"):
    pass


text = read_text_file("data/text.txt")
words = split_text(text)
words_clean = remove_stop_words(words, stop_words)
words_lemmatized = lemmatize_words(words_clean)
words_frequency = compute_frequency_words(words_lemmatized)
save_words_frequency(words_frequency, file_path="data/words_frequency.json")