# Lab 1

We will do text analysis of a public free text to demonstrate text analysis techniques.

Project Gutenberg is a great resource for these texts.

We will use [The Land That Time Forgot](https://www.gutenberg.org/ebooks/551) and download the [TXT version](https://www.gutenberg.org/ebooks/551.txt.utf-8), which we have included in the support files for this training.

Take a look at the file and understand the spacing and formatting of the text. 

**Facilitator: give the learners between 5 and 10 minutes to open the file, and review the text itself. Encourage them to consider the contents: how would they separate the narrative text from the errata that is included in a book? Does the author spell out numbers? Or will they need to figure out how to manage integers inside the text?**

There's a boilerplate at the beginning that describes the document and the license at the end for using the text.



## Count the words of the text

We want to count the number of words in the text of the book. 

To do this we'll split the text on key phrases present in the text.

To separate the header from the text, we'll use the phrase
`*** START OF THE PROJECT GUTENBERG EBOOK THE LAND THAT TIME FORGOT ***`

To separate the text from the license, we'll use the phrase
`End of Project Gutenberg's The Land That Time Forgot, by Edgar Rice Burroughs`

Since we only want to count words, we'll need to use *regular expressions* to remove punctuation and quotation marks.
The only characters we want to keep are alphabetic characters a-z and A-Z.

Here is the overview of the steps we're going to take

1. Import the `re` module from the Python standard library
2. Read in the file from the `./support_files/datasets/land_time_forgot.txt` path
3. Split the file into the header and the remainder of the text
4. Split the remainder into the text and the license at the end of the novel
5. Parse the lines from the text one by one and only keep alphanumeric characters
6. Split each line into words and store the words in an array
7. Print the length of the array to find the number of words

Start with steps 1 and 2

In [1]:
# load the regu from the Python standard library
import re

# open the file and read it into a variable
with open('./support_files/datasets/land_time_forgot.txt') as f:
   full_lines = f.readlines()


What questions can you ask about this data?

- How many lines are there in the total file?
- How many characters?


In [18]:

# How many lines in total are there?
print(len(full_lines))

# How many characters are there?

total_count = 0
for line in full_lines:
    total_count += len(line)
    
print(total_count)        

3970
220904


**Facilitator: do a quick check here to ensure that everyone has the same character count. Occasionally typos can introduce errors. Since we are using a text file that is included in the course there should not be any issues with trying to fetch the text from a remote source**

## Split the file into sections

Split the file into the header, the text, and the license

If you've used *readlines* or split the text into lines for the previous question, make sure to **[join](https://www.w3schools.com/python/ref_string_join.asp)** the lines together.

We want to remove all of the extra new lines before we join the text together again.

We'll use **sub** for this purpose with the newline character `\n` and the `+` to make it substitute all repeating characters.

```
re.sub(r'\n+', '\n', TEXT)
```

In [3]:
# header key phrase
header_phrase = "*** START OF THE PROJECT GUTENBERG EBOOK THE LAND THAT TIME FORGOT ***"

# license key phrase
license_phrase = "End of Project Gutenberg's The Land That Time Forgot, by Edgar Rice Burroughs"

# join the lines
full_text = "".join(full_lines)

# separate the header from the rest of the text
header, remainder = full_text.split(header_phrase)

# separate the text from the license
text, license = remainder.split(license_phrase)

# remove the extra newlines
sub_text = re.sub(r'\n+', '\n', text)

# how many lines is the text ?
text_lines = sub_text.split("\n")
len(text_lines)


3102

**Facilitator: as before this is a good place to stop and make sure everyone has the same number of lines. The regular expression will be the most likely cause of differences.**

## Verify your work

Look at the first 10 lines of the file.  What do you see?

In [4]:
text_lines[0:9]

['',
 'Produced by Judith Boss.  HTML version by Al Haines.',
 'The Land that Time Forgot',
 'By',
 'Edgar Rice Burroughs',
 'Chapter 1',
 "It must have been a little after three o'clock in the afternoon that it",
 'happened--the afternoon of June 3rd, 1916.  It seems incredible that',
 'all that I have passed through--all those weird and terrifying']

**Eek!**

What we see is that there are some header lines still present in the text.

We want to remove the empty lines, the Produced by value, and the title and author information.

We will keep the chapter information.

How many lines do you have to remove from the front of the array?



In [5]:
# remove the rest of the header information
text_lines = text_lines[5:]

## Clean up the text

We want to get the words from the text file. This means only keeping the alphabet characters, so use the **sub** function again to only keep those characters.

```
re.sub('[^a-zA-Z]', '', TEXT)
```

In [6]:
# define your array of words
list_of_words = []

# iterate through all of the lines
for line in text_lines:
    # keep only the alphabet characters
    fixed_line = re.sub('[^a-zA-Z]', ' ', line)
    # split the line on spaces and add the words onto our array
    list_of_words.extend(fixed_line.split())

print(len(list_of_words))
# look at the first 15 words to verify
list_of_words[0:15]

37826


['Chapter',
 'It',
 'must',
 'have',
 'been',
 'a',
 'little',
 'after',
 'three',
 'o',
 'clock',
 'in',
 'the',
 'afternoon',
 'that']

**Facilitator: stop and check. The regex in this step may cause problems. The previous step in which learners needed to remove the first five lines is sometimes skipped.**

## Additional cleaning

The first word is now `Chapter` but there is no number.  We know that there are several of these extraneous words throughout the text.  

What other words would you remove from the text?

Let's create an array of words that we know from looking at the text shouldn't be counted.

Then print the number of words in the book.

In [7]:
# create a list of words to skip
skip_words = ['Chapter']

# create a list that holds the updated text
updated_words = []

# go through each word in the list
for word in list_of_words:
    # if it's not in the list of skippable words
    if word not in skip_words:
        # add the word to the new list that houses the text we want to work with
        updated_words.append(word)

# print the number of words
print(f'Total number of words is {len(updated_words)}.')

Total number of words is 37816.


## Data cleanliness

When you look at the full text of the book and then the updated_words list, you will find discrepancies or places where you might disagree with the way the words have been split.  For example, in the first 15 lines there are the three words `three` `o` `clock`.  

Are these really 3 different words or would you combine `o` and `clock` into one word?

When you're analyzing your real data, you'll need to develop hypotheses and rules to clean your data without doing too much to alter the stringency of your analysis.



**Facilitator: stop and check. Ensure that all students have the correct number of words.**

# Word count

What is the word that appears most frequently in the text?

1. Create a dictionary to store the word as the key and the count as the value
2. Go through all of the words
3. If the word exists in the dictionary, increment the count
4. If the word does not exist, set the count to 1

We'll then look at a specific value for a word.  See how many times "It" appears in the dictionary

Use `get` to find the value.

5. Get the max value by using max with a key that indicates it should use `get` for to find the value
6. Get the count from the dictionary for the max word
7. Print the word and count

**Faciitator: pay special attention to this section. Counting things is an unexpectedly common operation and the learners will derive a lot of value from knowing how to use a dictionary in this manner.**

In [8]:
# create a dictionary
count_of_words = {}

# go through all the words
for word in updated_words:
    # check if the word is NOT in the dictionary
    if count_of_words.get(word) is None:
        # set the word in the dictionary to be 1
       count_of_words[word] = 1
    # else
    else:
        # increment the count of the word
        count_of_words[word] += 1

# get the value for "It"
print(f'The word It appears {count_of_words.get("It")} times')

# use max with a key function that gets the dictionary value
max_key = max(count_of_words, key=count_of_words.get)

max_count = count_of_words[max_key]
print('The word that appears most frequently is "{0:s}" ({1:5,d} times)'.format(
    max_key,max_count))

The word It appears 120 times
The word that appears most frequently is "the" (2,301 times)


## Which words appear the least frequently?

Use the `min` method to find the lowest number of times a word appears

1. Get the minimum number
2. Go through the dictionary
3. Check if the count of the key matches the minimum number
4. Add the word to the list if it does
5. Print the results


In [9]:
# Get the minimum frequency (hint: use values)
min_freq = float(min(count_of_words.values()))

# create a list to store the words
min_freq_words = []

# Go through all of the word counts (hint: use 'items')
for word, count in count_of_words.items():
    # check if the value matches the frequency
    if count == min_freq:
        # add it to the list if it does
        min_freq_words.append(word)

# Print a sentence of the minimum frequency and the number of words with that frequency
print('The minimum number of times a word appears in the book is {}. There are {} words with that frequency.'
      .format(min_freq, len(min_freq_words)))

The minimum number of times a word appears in the book is 1.0. There are 2494 words with that frequency.


**Facilitator: stop and check. Does everyone have the same max and min values? The good news is these are straightforward operations, so the most common issue will be indentation or typos, but the error handler should help the learner in those instances.**

## Stopwords

The word that appears the most times is 'the' which is considered a stopword in sentiment analysis.  This word isn't useful to give you an idea of what the text is about so when we do analysis we want to remove these kinds of words.

1. Go through the list of stopwords
2. remove them from the dictionary of word counts
3. find the new word that occurs most frequently



In [10]:
# stopwords
stopwords = ["a", "A", "an", "and", "are", "as", "at", "be", "but", "for", "from", 
             "had", "have", "he", "her", "his", "if", "in", "it", "I", "into", "me", 
             "my", "of", "on", "not", "she", "that", "the", "to", "was", "which", 
             "we", "were", "with", "upon", "us"]

# go through the list
for word in stopwords:
    # if the stopword is in the dictionary of word counts
    if word in count_of_words:
        # remove the word from the dictionary
        del count_of_words[word]

# get the max key of the word counts again
max_key = max(count_of_words, key=count_of_words.get)

# get the count of the key
max_count = count_of_words[max_key]
# print the key and value
print('The word that appears most frequently is "{0:s}" ({1:5,d} times)'.format(max_key,max_count))

The word that appears most frequently is "all" (  158 times)


## Number of unique words

How many unique words are there in the book?

Use a `set` to figure this out.

In [11]:
# create a set from the words
words_set = set(updated_words)

# print the number of words
print('This book has {} total words and {} unique words.'.format(len(updated_words), len(words_set)))

# Print a sentence of the percentage of words that are unique in the book (hint: use :.1f in your format)
num_unique_words = len(words_set)
num_words = len(updated_words)
percentage = (num_unique_words / num_words) * 100
print('The percentage of unique words in the book: %.1f' % percentage)

This book has 37816 total words and 4962 unique words.
The percentage of unique words in the book: 13.1


**Facilitator: This is another good place to stop. The stopwords list will ideally be copy/pasted but that's no guarantee there won't be a mismatch with the results. Are there other words that should be removed from consideration?**

# Header information

Let's examine the header data and identify information using regular expressions.

There is a set of lines in the header that have the `Title`, `Author`, `Release date`, and `Language`.

Use [search()](https://docs.python.org/3/library/re.html#re.search) to find the data for these fields.

You will need to use non-greedy matching to only match the single line

https://www.pythoncheatsheet.org/cheatsheet/regular-expressions#greedy-and-non-greedy-matching

Make sure the date is just the month, date, and year.

In [12]:
header

'\ufeffThe Project Gutenberg eBook of The Land That Time Forgot\n    \nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this ebook or online\nat www.gutenberg.org. If you are not located in the United States,\nyou will have to check the laws of the country where you are located\nbefore using this eBook.\n\nTitle: The Land That Time Forgot\n\nAuthor: Edgar Rice Burroughs\n\nRelease date: June 1, 1996 [eBook #551]\n                Most recently updated: January 1, 2021\n\nLanguage: English\n\nCredits: Produced by Judith Boss.  HTML version by Al Haines.\n\n\n'

**Facilitator: if the class has been managing this lab with relative ease, it would be a good challenge to see if the learners can come up with the non-greedy matching script on their own.**

In [13]:
import re

# compile a pattern to match Title (hint: non greedy matching)
title_pattern = re.compile("Title: (.*?)\\n")
# use the pattern to search
title_matches = re.search(title_pattern, header)
# get the first group
title_matches.group(1)

'The Land That Time Forgot'

In [14]:
# compile a pattern to match Author (hint: non greedy matching)
author_pattern = re.compile("Author: (.*?)\\n")
# use the pattern to search
author_matches = re.search(author_pattern, header)
# get the first group
author_matches.group(1)

'Edgar Rice Burroughs'

In [15]:
# compile a pattern to match Release date (hint: non greedy matching)
rd_pattern = re.compile("Release date: (.*?) \[e")
# use the pattern to search
rd_matches = re.search(rd_pattern, header)
# get the first group
rd_matches.group(1)

'June 1, 1996'

In [16]:
# compile a pattern to match Language (hint: non greedy matching)
language_pattern = re.compile("Language: (.*?)\\n")
# use the pattern to search
language_matches = re.search(language_pattern, header)
# get the first group
language_matches.group(1)

'English'

# Reusable code

Now that we've finished our book analysis, let's restructure our code into functions.

This will help us re-use the same code again in the future if we're doing a similar analysis

The functions we'll create are:
1. Read the file and return the lines
2. Take the lines and return the text
3. Take the text and return a clean list of words
4. Take the clean list of words and calculate a count
5. Take the count and return the max word
6. Take the count and return the min words
   

In [17]:
# header key phrase
header_phrase = "*** START OF THE PROJECT GUTENBERG EBOOK THE LAND THAT TIME FORGOT ***"

# license key phrase
license_phrase = "End of Project Gutenberg's The Land That Time Forgot, by Edgar Rice Burroughs"

# read a file and return the lines
def reader(fname):
    with open(fname) as file:
        lines = file.readlines()
        return lines

# take the lines and return the text
def lines_to_text(list_of_lines):
    full_text = "".join(full_lines)
    header, remainder = full_text.split(header_phrase)
    text, license = remainder.split(license_phrase)
    sub_text = re.sub(r'\n+', '\n', text)
    text_lines = sub_text.split("\n")
    return text_lines

# take the lines and return words
def lines_to_words(list_of_lines):
    word_list = []
    for line in list_of_lines:
        word_list.extend(line.split(" "))
    # remove the rest of the header information
    text_lines = word_list[5:]
    return word_list

def clean_words(word_list):
    list_of_words = []
    for line in word_list:
       fixed_line = re.sub('[^a-zA-Z]', ' ', line)
       list_of_words.extend(fixed_line.split())
    return list_of_words

# take the word list and return counts
def list_to_count(word_list):
    word_counts = {}
    for word in word_list:
        if word in word_counts:
            word_counts[word] = word_counts[word] + 1
        else:
            word_counts[word] = 1
    return word_counts

# take the word counts
def get_min_word(word_counts):
    min_word_freq = min(word_counts.values())
    min_words = []
    for word, fr in word_counts.items():
        if fr == min_word_freq:
            min_words.append(word)

    return min_word_freq, min_words

# run the functions
if __name__ == "__main__":
    # call the function to read the file and return lines
    my_lines = reader("./support_files/datasets/land_time_forgot.txt")
    # call the function that converts lines to text
    text_list = lines_to_text(my_lines)
    # call the function that converts lines to words
    words = lines_to_words(text_list)
    # call the function that cleans the words
    my_words = clean_words(words)

    # create a set of the words and output the values
    word_set = set(my_words)
    print("There are {} words in the book and {} of them are unique".format(
        len(my_words), len(word_set)))

    # call the function that calculates a count from words
    my_freq = list_to_count(my_words)

    # use max with a key to get the word with the maximum value
    max_word = max(my_freq, key=my_freq.get)
    # print the most frequent word and value
    print("Most frequent word is '{}' with frequency {}".format(
        max_word, my_freq[max_word]))

    # call the function to get the minimum words
    mw_freq, mw_list = get_min_word(my_freq)
    # print the minimum words
    print("The lowest word_frequency is {} and there are {} words in the book with that word_frequency".format(mw_freq, len(mw_list)))


There are 37844 words in the book and 4975 of them are unique
Most frequent word is 'the' with frequency 2301
The lowest word_frequency is 1 and there are 2506 words in the book with that word_frequency


**Faciitator: This probably has the most probablility of the learners not coming up with the same results. This is a rather long function. Give them time to work through their errors, and be prepared to do a fair amount of proofreading.**

# Conclusion

You'll notice that the numbers output by the script version and the function version are different.

This happens often in analysis.  In this particular case, one of the differences is that we didn't remove the `Chapter` from the word list.

This is why you'll want to use functions that can be tested so that your code is behaving the way that you expect it will.