<a href="https://colab.research.google.com/github/ProfessorPatrickSlatraigh/CST2312/blob/main/CST2312_HD32_Spr2023_Class_12_DictionariesDecorators.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CST2312 Class12 - Dictionaries and Decorators**

  - Word count with dictionaries    
  - Anmol Tomar's Next Level Python on Decorators with a primer on foundational concepts used.     

Updated content as of 07-Mar-2023   

*__*

##Dictionary Word Count Example     

### Housekeeping    


In [21]:
import string    # for string processing, including punctuation    
import itertools # to take a slice of an iterable, including a dictionary slice
import pprint    # for pretty printing of a dictionary   



---



### File Load and Handle    

1. Take a copy of the UTF-8 text file of the book The Square Pegs by Ray Bradbury, 1948 from Professor Patrick's `data` repo on **GitHub** and load it to the current working directory with the file name `theSquarePegs.txt`.   

2. Open a file handle `sqrpegs_handle` as *read-only* for the text file of the book.       

In [None]:
!curl 'https://raw.githubusercontent.com/ProfessorPatrickSlatraigh/data/main/theSquarePegs_RayBradbury.txt' -o 'theSquarePegs.txt'



---



### Iterate through the text    
    
* <u>Open the File</u>    
1. With `open()` and skip to line #25    
    
* <u>Wrangle</u>
2. Remove punctuation    
3. Transform to lower case    
4. Split wrangled lines into words    
    
* <u>Process words</u>    
5. Drop words from fluff list of words
6. Use dictionary and `.get()` to count instance of word    

In [35]:
                                                           #5 build a list of fluff words to drop
fluff_word_lst = ['the','to', 'of', 'and', 'a', 'in', 'you',
                  'or', 'her', 'she', 'with', 'this', 'on',
                  'it', 'for', 'was', 'not', 'i', 'is',
                  'he', 'be', 'that', 'any', 'we', 'there',
                  'are', 'at', 'by', 'from', 'do', 'be',
                  'at', 'project', 'gutenberg',
                  'gutenbergtm', 'they', 'as', 'them',
                  'your', 'if', 'no', 'were',
                  'an', 'into', 'us', 'where', 'what',
                  'who', 'can', 'then', 'other', 'dont',
                  'his', 'him', 'its', 'which', 'did']

sqrpegs_words_dict = dict()                                 #6  initialize word count dictionary

with open('/content/theSquarePegs.txt') as sqrpegs_handle:  #1  open the file
    for skipped in range(25):                               #1  counting skipped lines
        next(sqrpegs_handle)                                #1  skipping lines counted
    for line_str in sqrpegs_handle:                         #   iterate lines to wrangle
        # the following statement does the #2, #3, and #4 wrangling and split into a list of words 
        wrangled_words = line_str.translate(str.maketrans('', '', string.punctuation)).lower().split()
        for word in wrangled_words:                         #   process lists of wrangled words
            if word in fluff_word_lst:                      #5  drop fluff word
                continue
            sqrpegs_words_dict[word] = sqrpegs_words_dict.get(word, 0) +1  #6 keep a count in dictionary

### Word Count Results    

#### Words counted    

How many words did we process?    
*Remember that we skipped fluff words whenever we read them.*   

In [None]:
len(sqrpegs_words_dict)

#### Descending order list of words by frequency    

Let's create a dictionary sorted in descending order by word count.    
    
We will use a `lambda` function to pass the `value` from `key:value` pairs to the `sorted()` keyword argument for the `key` to sort on.    

In the following snippet of code:
 - The `sorted()` function takes three arguments -     
   1. an iterable(dictionary) to sort    
   2. `key=` specifies what value to use as a sort key (to sort by)    
   3. `reverse=True` specifies that we want descending order    
    
 - `lambda kv:kv[1]` takes the element in position 1 of the `key:value` tuple returned by the dictionary `.items()` method and provides that value to the `sorted()` argument for `key=`  
     
It can seem confusing that the term <b>*key*</b> here is used to refer to two different things:
 1. The sort-by value for the `sorted()` function    
 2. A `key` in the `key:value` pair of a dictionary     
    
Remember that the `.items()` method a **dictionary** returns the `key:value` pairs of that dictionary.  The tuples returned have `key` as the element in position 0, and the `value` as the element in position 1.     

The tuples that are returned by `.items()` and then sorted by `sorted()` are an iterable of tuples which populate the list `sorted_sqrpegs_words_lst`.   

In [60]:
sorted_sqrpegs_words_lst = sorted(sqrpegs_words_dict.items(), key=lambda kv:kv[1], reverse=True)

We can use **prettyprint** (`pprint.pprint`) to print our list of words used in descending order.    

In [None]:
pprint.pprint(sorted_sqrpegs_words_lst)

#### Words used more than ten times    


We can use a conditional test in an assignment to take all of the words used more than ten times in the book (ignoring the fluff words) and store them in a new dictionary in descending order.    
    
Let's use **dictionary comprehension** to do this:        


In [63]:
sqrpegs_10x_words_dict = {k: v for k, v in sqrpegs_words_dict.items() if v > 10}

Then sort the resultant dictionary using `sorted()` again.    

In [64]:
sorted_sqrpegs_10x_words_lst = sorted(sqrpegs_10x_words_dict.items(), key=lambda kv:kv[1], reverse=True)

In [None]:
pprint.pprint(sorted_sqrpegs_10x_words_lst)

#### Top-20 words used    

If we go back to the list `sorted_sqrpegs_words_lst` that contains tuples of `key:value` pair data sorted in descending order by `value` then we can easily identify the top-20 words used in the book (ignoring fluff words) by taking a slice of the first twenty elements in the list:    

In [None]:
pprint.pprint(sorted_sqrpegs_words_lst[:20])

#### Top-20 words usage as a percentage of total words    

What if we wanted to know what percentage of the total (non-fluff) words for each of the top-20 words used in the book?    

In [None]:
for word, count in sorted_sqrpegs_words_lst[:20]:
    word_pct = '{:.2f}'.format(count/len(sqrpegs_words_dict)*100)  # calculate the percentage to 2 decimal places
    print(f'"{word}" is used {count} times, which is {word_pct} percent of the total words.')  # print the 3 variables



---



## Next Level Python (continued): Decorators     




---

