# Module 2 Self-Assessment: Text Count Analysis

This exercise asks you to combine the Python skills you have learned into a single application. We will start with a paragraph of text and end up with a frequency count of the distinct words in that text.

This assessment includes the following steps:

- Convert a string to lowercase characters.
- Split the lowercase string into individual words.
- Count the number of words in the lowercase string.
- Determine the number of distinct words in the lowercase string.
- Calculate the number of times each word appears in the lowercase string.
- Remove the punctuation from the lowercase string.
- Perform a count analysis on the text without punctuation characters.

Complete each step individually, moving to the next step only after the current step works as expected. Because later steps depend on variables created in earlier steps, you may need to run all steps up to the current step before running the code in the current step. Jupyter Notebook includes a Run All Above command in the Cell menu that will allow you to run all previous steps, making the earlier variables available to the current cell.

### Step 1: Normalize the Letter Casing
Because all content in Python is case-sensitive, start by normalizing the case so that all characters are in the same case.

Here, we start with a string of text taken from E. Abbott's 1884 novel, Flatland (link to book). Convert the existing text to lowercase characters, stored in a variable named s_lower.

The output should look like this:

```text
imagine a vast sheet of paper on which straight lines, triangles, squares, pentagons, hexagons, and other figures, instead of remaining fixed in their places, move freely about, on or in the surface, but without the power of rising above or sinking below it, very much like shadows — only hard and with luminous edges — and you will then have a pretty correct notion of my country and countrymen. alas, a few years ago, i should have said "my universe": but now my mind has been opened to higher views of things.
```

Here's the code to start with:

In [4]:
s = """Imagine a vast sheet of paper on which straight Lines, Triangles, Squares, Pentagons, Hexagons, and other figures, instead of remaining fixed in their places, move freely about, on or in the surface, but without the power of rising above or sinking below it, very much like shadows - only hard and with luminous edges - and you will then have a pretty correct notion of my country and countrymen. Alas, a few years ago, I should have said "my universe": but now my mind has been opened to higher views of things."""

s_lower = s.lower()
print(f"Flatland in lowercase: {s_lower}")

Flatland in lowercase: imagine a vast sheet of paper on which straight lines, triangles, squares, pentagons, hexagons, and other figures, instead of remaining fixed in their places, move freely about, on or in the surface, but without the power of rising above or sinking below it, very much like shadows - only hard and with luminous edges - and you will then have a pretty correct notion of my country and countrymen. alas, a few years ago, i should have said "my universe": but now my mind has been opened to higher views of things.


### Step 2: Split the String into Words
Using the str.split() method, divide the s_lower string into separate words. The output should be a list of strings.

Store the results in a list called words.

Printing the list should look like this:

```Python
['imagine', 'a', 'vast', 'sheet', 'of', 'paper', 'on', 'which', 'straight', 'lines,', 'triangles,', 'squares,', 'pentagons,', 'hexagons,', 'and', 'other', 'figures,', 'instead', 'of', 'remaining', 'fixed', 'in', 'their', 'places,', 'move', 'freely', 'about,', 'on', 'or', 'in', 'the', 'surface,', 'but', 'without', 'the', 'power', 'of', 'rising', 'above', 'or', 'sinking', 'below', 'it,', 'very', 'much', 'like', 'shadows', '—', 'only', 'hard', 'and', 'with', 'luminous', 'edges', '—', 'and', 'you', 'will', 'then', 'have', 'a', 'pretty', 'correct', 'notion', 'of', 'my', 'country', 'and', 'countrymen.', 'alas,', 'a', 'few', 'years', 'ago,', 'i', 'should', 'have', 'said', '"my', 'universe":', 'but', 'now', 'my', 'mind', 'has', 'been', 'opened', 'to', 'higher', 'views', 'of', 'things.']
```

Here's the code to start with:

In [5]:
words = list()
words = s_lower.split()

print(f"Split words: {words}")

Split words: ['imagine', 'a', 'vast', 'sheet', 'of', 'paper', 'on', 'which', 'straight', 'lines,', 'triangles,', 'squares,', 'pentagons,', 'hexagons,', 'and', 'other', 'figures,', 'instead', 'of', 'remaining', 'fixed', 'in', 'their', 'places,', 'move', 'freely', 'about,', 'on', 'or', 'in', 'the', 'surface,', 'but', 'without', 'the', 'power', 'of', 'rising', 'above', 'or', 'sinking', 'below', 'it,', 'very', 'much', 'like', 'shadows', '-', 'only', 'hard', 'and', 'with', 'luminous', 'edges', '-', 'and', 'you', 'will', 'then', 'have', 'a', 'pretty', 'correct', 'notion', 'of', 'my', 'country', 'and', 'countrymen.', 'alas,', 'a', 'few', 'years', 'ago,', 'i', 'should', 'have', 'said', '"my', 'universe":', 'but', 'now', 'my', 'mind', 'has', 'been', 'opened', 'to', 'higher', 'views', 'of', 'things.']


### Step 3: Count the Words
Use the len method to return the number of words in the string.

At this point, for the initial string, the count should be 92; this includes duplicate instances of words.

In [6]:
print(f"Word count: {len(words)}")

Word count: 92


### Step 4: Count the Distinct Words
Use a set to compute the number of distinct words in the string, meaning that only one instance of each word is counted and duplicates are ignored.

At this point, for the initial string, the count of distinct words should be 75.

In [7]:
wordset = set(words)

print(f"Distinct word count: {len(wordset)}")

Distinct word count: 75


### Step 5: Compute the Word Frequencies
The next step is to compute the frequency of the distinct words in our string. In this step, we look at the list of words (including duplicates) and count each instance of each word. The result will be a dictionary where the keys are the distinct words in the list and the values are the number of times each word appears in the list.

In [13]:
def calcWordFreq(words):
    result = {}
    for word in words:
        if word in result:
            result[word] += 1
        else:
            result[word] = 1
    return result

For practice, start with the list saved as `w` in the example code below. Some of the words in this list appear more than once, and we need a way to identify distinct words and count their frequency.

To this end, the script should perform the following steps:

1. Create an empty dictionary.
    - We use a dictionary here so that we can create key-value pairs where the key is the word and the value represents the number of times that word appears in the original list.
2. Create a loop that does the following:
    - Look at each word in the original list and compare that word to existing keys in the dictionary.
    - If the word does not match a key in the dictionary, add it to the dictionary as a new key with the value 1.
    - If the word already exists as a key in the dictionary, the value for that key increments by 1 to count the number of occurrences of the word.

The output for the original starting list should look like this:

```Python
{'haythem': 2, 'is': 1, 'eating': 1, 'tacos.': 1, 'loves': 1, 'tacos': 1, '': 1, ':': 1}
```

In the results, you will notice that "tacos." is distinct from "tacos" because of the period. Also, both the colon item and the empty item appear in the results. We will ignore punctuation for now and fix those problems in the next step.

Here's the code to start with:

In [14]:
w = ["haythem", "is", "eating", "tacos.", "haythem", "loves", "tacos", "", ":"]

freq_occur = dict()
freq_occur = calcWordFreq(w)

print(freq_occur)

{'haythem': 2, 'is': 1, 'eating': 1, 'tacos.': 1, 'loves': 1, 'tacos': 1, '': 1, ':': 1}


Once you get the example above working, use the pattern to compute the frequency count for each word in the original string s_lower.

The dictionary should look like this when finished:

```Python
{'imagine': 1, 'a': 3, 'vast': 1, 'sheet': 1, 'of': 5, 'paper': 1, 'on': 2, 'which': 1, 'straight': 1, 'lines,': 1, 'triangles,': 1, 'squares,': 1, 'pentagons,': 1, 'hexagons,': 1, 'and': 4, 'other': 1, 'figures,': 1, 'instead': 1, 'remaining': 1, 'fixed': 1, 'in': 2, 'their': 1, 'places,': 1, 'move': 1, 'freely': 1, 'about,': 1, 'or': 2, 'the': 2, 'surface,': 1, 'but': 2, 'without': 1, 'power': 1, 'rising': 1, 'above': 1, 'sinking': 1, 'below': 1, 'it,': 1, 'very': 1, 'much': 1, 'like': 1, 'shadows': 1, '-': 2, 'only': 1, 'hard': 1, 'with': 1, 'luminous': 1, 'edges': 1, 'you': 1, 'will': 1, 'then': 1, 'have': 2, 'pretty': 1, 'correct': 1, 'notion': 1, 'my': 2, 'country': 1, 'countrymen.': 1, 'alas,': 1, 'few': 1, 'years': 1, 'ago,': 1, 'i': 1, 'should': 1, 'said': 1, '"my': 1, 'universe":': 1, 'now': 1, 'mind': 1, 'has': 1, 'been': 1, 'opened': 1, 'to': 1, 'higher': 1, 'views': 1, 'things.': 1}
```

Here's the starting code for this step:

In [17]:
word_freq = dict()
word_freq = calcWordFreq(words)

print(word_freq)

{'imagine': 1, 'a': 3, 'vast': 1, 'sheet': 1, 'of': 5, 'paper': 1, 'on': 2, 'which': 1, 'straight': 1, 'lines,': 1, 'triangles,': 1, 'squares,': 1, 'pentagons,': 1, 'hexagons,': 1, 'and': 4, 'other': 1, 'figures,': 1, 'instead': 1, 'remaining': 1, 'fixed': 1, 'in': 2, 'their': 1, 'places,': 1, 'move': 1, 'freely': 1, 'about,': 1, 'or': 2, 'the': 2, 'surface,': 1, 'but': 2, 'without': 1, 'power': 1, 'rising': 1, 'above': 1, 'sinking': 1, 'below': 1, 'it,': 1, 'very': 1, 'much': 1, 'like': 1, 'shadows': 1, '-': 2, 'only': 1, 'hard': 1, 'with': 1, 'luminous': 1, 'edges': 1, 'you': 1, 'will': 1, 'then': 1, 'have': 2, 'pretty': 1, 'correct': 1, 'notion': 1, 'my': 2, 'country': 1, 'countrymen.': 1, 'alas,': 1, 'few': 1, 'years': 1, 'ago,': 1, 'i': 1, 'should': 1, 'said': 1, '"my': 1, 'universe":': 1, 'now': 1, 'mind': 1, 'has': 1, 'been': 1, 'opened': 1, 'to': 1, 'higher': 1, 'views': 1, 'things.': 1}


### Step 6: Remove Punctuation Marks
To fix the problems caused by punctuation, we need to remove the punctuation marks from the items in the list.

In [31]:
import string

print(list(string.punctuation))


def removePunc(words):
    stripped = [word.strip(string.punctuation) for word in words]
    return [word for word in stripped if word != ""]

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']


Again, as a practice activity, we will start with the same list `w` from the previous exercise. The existing code performs the following steps:

- Import the Python string module, which includes all punctuation characters.
- Generate a list of punctuation characters from string.punctuation and print the list to verify that it includes the characters we want to exclude from our results.
- Define the list we want to start with.
- Create a new, empty w_clean list that will hold the words after we have removed punctuation from each item in w.

Complete this script so that it looks at each item in `w` and performs the following tasks:

- If the item is empty, ignore it.
- If the item includes only a punctuation mark, ignore it.
- If the item includes a punctuation mark as the first character or as the last character, remove that character from the word and add the cleaned word to w_clean.
- If the item does not include punctuation, add it to the list w_clean as a separate token.

The final version of w_clean should include seven items:

```Python
['haythem', 'is', 'eating', 'tacos', 'haythem', 'loves', 'tacos']
```

Here's the starting code for this short example:

In [32]:
w = ["haythem!", "is", "eating", "tacos.", ".haythem", "loves", "tacos", "", ":"]

w_clean = list()
w_clean = removePunc(w)

print(w_clean)
print(len(w_clean))

['haythem', 'is', 'eating', 'tacos', 'haythem', 'loves', 'tacos']
7


Use the same pattern to remove the punctuation from the initial s string and count the number of times each word appears in that string.

The output should include each distinct word, regardless of case or punctuation, along with the number of times each word appears. Also display the number of distinct words in the string.

The output for the original string should look like:
```Python
['imagine', 'a', 'vast', 'sheet', 'of', 'paper', 'on', 'which', 'straight', 'lines', 'triangles', 'squares', 'pentagons', 'hexagons', 'and', 'other', 'figures', 'instead', 'of', 'remaining', 'fixed', 'in', 'their', 'places', 'move', 'freely', 'about', 'on', 'or', 'in', 'the', 'surface', 'but', 'without', 'the', 'power', 'of', 'rising', 'above', 'or', 'sinking', 'below', 'it', 'very', 'much', 'like', 'shadows', 'only', 'hard', 'and', 'with', 'luminous', 'edges', 'and', 'you', 'will', 'then', 'have', 'a', 'pretty', 'correct', 'notion', 'of', 'my', 'country', 'and', 'countrymen', 'alas', 'a', 'few', 'years', 'ago', 'i', 'should', 'have', 'said', 'my', 'universe"', 'but', 'now', 'my', 'mind', 'has', 'been', 'opened', 'to', 'higher', 'views', 'of', 'things']
90
```

Here's your starting code for this step:

In [34]:
print(words)
w_clean = list()
w_clean = removePunc(words)

print(w_clean)
print(len(w_clean))

['imagine', 'a', 'vast', 'sheet', 'of', 'paper', 'on', 'which', 'straight', 'lines,', 'triangles,', 'squares,', 'pentagons,', 'hexagons,', 'and', 'other', 'figures,', 'instead', 'of', 'remaining', 'fixed', 'in', 'their', 'places,', 'move', 'freely', 'about,', 'on', 'or', 'in', 'the', 'surface,', 'but', 'without', 'the', 'power', 'of', 'rising', 'above', 'or', 'sinking', 'below', 'it,', 'very', 'much', 'like', 'shadows', '-', 'only', 'hard', 'and', 'with', 'luminous', 'edges', '-', 'and', 'you', 'will', 'then', 'have', 'a', 'pretty', 'correct', 'notion', 'of', 'my', 'country', 'and', 'countrymen.', 'alas,', 'a', 'few', 'years', 'ago,', 'i', 'should', 'have', 'said', '"my', 'universe":', 'but', 'now', 'my', 'mind', 'has', 'been', 'opened', 'to', 'higher', 'views', 'of', 'things.']
['imagine', 'a', 'vast', 'sheet', 'of', 'paper', 'on', 'which', 'straight', 'lines', 'triangles', 'squares', 'pentagons', 'hexagons', 'and', 'other', 'figures', 'instead', 'of', 'remaining', 'fixed', 'in', 'the

>Challenge
>Note that in the output above, universe" still includes a quotation mark. Why? How can you fix it? The solution you come up with for this challenge should not remove punctuation marks inside words, like can't or well-known.


### Step 7: Put It All Together
Finally, create a single script that performs all of the following operations on the original s string.

1. Convert the string to lowercase characters.
2. Split the lowercase string into individual words.
3. Remove the punctuation from the lowercase words. Assume that all punctuation is either the first character or the last character of each item in the list.
4. Perform a count analysis on the words without punctuation characters.
5. Display the dictionary with the word counts and the number of distinct words in the original string.


In [36]:
def allTogether(s):
    new_s = removePunc(s.lower().split(" "))
    print(f"There are {len(new_s)} words")
    print(calcWordFreq(new_s))
    print(f"with {len(set(new_s))} distinct words.")


allTogether(s)

There are 90 words
{'imagine': 1, 'a': 3, 'vast': 1, 'sheet': 1, 'of': 5, 'paper': 1, 'on': 2, 'which': 1, 'straight': 1, 'lines': 1, 'triangles': 1, 'squares': 1, 'pentagons': 1, 'hexagons': 1, 'and': 4, 'other': 1, 'figures': 1, 'instead': 1, 'remaining': 1, 'fixed': 1, 'in': 2, 'their': 1, 'places': 1, 'move': 1, 'freely': 1, 'about': 1, 'or': 2, 'the': 2, 'surface': 1, 'but': 2, 'without': 1, 'power': 1, 'rising': 1, 'above': 1, 'sinking': 1, 'below': 1, 'it': 1, 'very': 1, 'much': 1, 'like': 1, 'shadows': 1, 'only': 1, 'hard': 1, 'with': 1, 'luminous': 1, 'edges': 1, 'you': 1, 'will': 1, 'then': 1, 'have': 2, 'pretty': 1, 'correct': 1, 'notion': 1, 'my': 3, 'country': 1, 'countrymen': 1, 'alas': 1, 'few': 1, 'years': 1, 'ago': 1, 'i': 1, 'should': 1, 'said': 1, 'universe': 1, 'now': 1, 'mind': 1, 'has': 1, 'been': 1, 'opened': 1, 'to': 1, 'higher': 1, 'views': 1, 'things': 1}
with 73 distinct words.


### Check Yourself
After you have completed the last step, verify that your code works correctly without any errors. Use the following checklist to verify the results:

- Use print to display the list that contains the individual words from the original string.
  - Verify that all letters are lowercase.
  - Verify that none of the words include a punctuation mark as the first or last character.
- Export the list to a set and use len to count the words in the set.
- Use len to count the number of items in the dictionary used to calculate frequency. The result should be the same as the number of words in the set.
- Choose a few words in the original string and manually count the number of times each of those words appears in the string.
- Use print to display the dictionary.
  - Verify that your count of the selected words matches the results shown in the dictionary.
  - Verify that the keywords in the dictionary are all lowercase and do not include punctuation as the first or last character.
- You should be able to run the same script using any string of words. Substitute your own string for the original string and run the script again.