<a href="https://colab.research.google.com/github/ProfessorPatrickSlatraigh/CST2312/blob/main/CST2312_D222_Fall2022_Class_08_28_Sep_2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CST2312 - Class #08,     
**Dictionaries (from Py4E.com, lesson 10 - former lesson 9)**   
**Tuples (from Py4E.com, lesson 11 - former lesson 10)**    

by Professor Patrick, 28-Sep-2022    


This notebook on Colab at `https://bit.ly/cst2312cl08`    

This notebook works with shared links to the following files on Google Drive:    

 o [words.txt](https://bit.ly/py4e-words)     
 o [romeo.txt](https://bit.ly/romeo-short)    
 o [romeo-full.txt](https://bit.ly/romeo-full)    




---



# **Dictionaries**
**(from Py4E.com, lesson 10 - former lesson 9)**  

A dictionary is like a list, but more general. In a list, the index positions have to be integers; in a dictionary, the indices can be (almost) any type.

You can think of a dictionary as a mapping between a set of indices (which are called keys) and a set of values. Each key maps to a value. The association of a key and a value is called a *key-value pair* or sometimes an item.

As an example, we’ll build a dictionary that maps from English to Spanish words, so the keys and the values are all strings.

The function dict creates a new dictionary with no items. Because dict is the name of a built-in function, you should avoid using it as a variable name.

In [None]:
eng2sp = dict()
print(eng2sp)


The curly brackets, {}, represent an empty dictionary. To add items to the dictionary, you can use square brackets:

In [None]:
# create a string and refer to an index position in the string

our_str = 'hello there'

print(our_str[6])

In [None]:
eng2sp['one'] = 'uno'

This statement above creates an dictionary entry that maps from the key `one` to the value `'uno'`. If we print the dictionary again, we see a *key-value pair* with a colon between the key and value:

In [None]:
print(eng2sp)

In [None]:
eng2sp['two'] = 'segundo'

In [None]:
print(eng2sp)

In [None]:
eng2fr = {'one': 'un', 'two': 'deux'}

In [None]:
print(eng2fr)

This output format is also an input format. For example, you can create a new dictionary with three items. But if you were to print `eng2sp` in versions of Python before Python 3.0, you might be surprised to see the output in what appeared to be a random order.  That was because of the way keys were hashed and stored. 

In [None]:
eng2sp = {'one': 'uno', 'two': 'dos', 'three': 'tres'}
print(eng2sp)

Since Python 3.0, dictionary entries are printed (returned) in the order in which the *key-value pairs* were entered into the dictionary.   

But that’s not a problem because the elements of a dictionary are never indexed with integer indices. Instead, you use the keys to look up the corresponding values:

In [None]:
print(eng2sp['two'])

The key 'two' always maps to the value “dos” so the order of the items doesn’t matter.

If the key isn’t in the dictionary, you get an exception:

In [None]:
print(eng2sp['four'])

The `len` function works on dictionaries; it returns the number of *key-value pairs*:

In [None]:
len(eng2sp)

The `in` operator works on dictionaries; it tells you whether something appears as a *key* in the dictionary (appearing as a *value* does not satisfy the `in` operator and returns a Bool of False unless a *key* in the dictionary contains the search argument).

In [None]:
'one' in eng2sp

In [None]:
'uno' in eng2sp

In [None]:
'uno' not in eng2sp

To see whether something appears as a value in a dictionary, you can use the method `values`, which returns the values as a type that can be converted to a list, and then use the `in` operator:

In [None]:
vals = list(eng2sp.values())
print(len(vals))
print(vals)

In [None]:
'uno' in vals

The `in` operator uses different algorithms for lists and dictionaries. For lists, it uses a linear search algorithm. As the list gets longer, the search time gets longer in direct proportion to the length of the list. For dictionaries, Python uses an algorithm called a hash table that has a remarkable property: the in operator takes about the same amount of time no matter how many items there are in a dictionary. I won’t explain why hash functions are so magical, but you can read more about it at [**Wikipedia.org on Hash tables**](wikipedia.org/wiki/Hash_table).

**Exercise 1: Download a copy of the file www.py4e.com/code3/words.txt**    
    
**You may also access the file here [words.txt](https://bit.ly/py4e-words)**    

**Write a program that reads the words in words.txt and stores them as keys in a dictionary. It doesn’t matter what the values are. Then you can use the in operator as a fast way to check whether a string is in the dictionary.**

## **Dictionary as a set of counters**

Suppose you are given a string and you want to count how many times each letter appears. There are several ways you could do it:

1. You could create 26 variables, one for each letter of the alphabet. Then you could traverse the string and, for each character, increment the corresponding counter, probably using a chained conditional.

2. You could create a list with 26 elements. Then you could convert each character to a number (using the built-in function ord), use the number as an index into the list, and increment the appropriate counter.

3. You could create a dictionary with characters as keys and counters as the corresponding values. The first time you see a character, you would add an item to the dictionary. After that you would increment the value of an existing item.


Each of these options performs the same computation, but each of them implements that computation in a different way.

An implementation is a way of performing a computation; some implementations are better than others. For example, an advantage of the dictionary implementation is that we don’t have to know ahead of time which letters appear in the string and we only have to make room for the letters that do appear.

Here is what the code might look like:

In [None]:
# example code to count characters in a string  
# USING A FOR LOOP AND EMBEDDED IF/ELSE
# word_str   -- a string with a word as its value
# histo_dict -- a histogram dictionary of counts of letters
# char_str   -- an individual character as a string

# store a word in a string variable
word_str = 'brontosaurus'
# try with a different string assignment later
# word_str = 'tyranosaurus'
# word_str = 'tyranosaurus are not the content of bronto burgers'

# initialize an empty histogram dictionary
histo_dict = dict()

# loop through each character in the word and
# add a count of 1 to each letter found
for char_str in word_str:
    if char_str not in histo_dict:
        histo_dict[char_str] = 1
    else:
        histo_dict[char_str] = histo_dict[char_str] + 1

# print the histogram dictionary of counts of letters
print(histo_dict)

We are effectively computing a histogram, which is a statistical term for a set of counters (or frequencies).

The for loop traverses the string. Each time through the loop, if the character `char_str` is not in the dictionary, we create a new item with key of `char_str` and the initial value `1` (since we have seen this letter once). If `char_str` is already in the dictionary we increment `histo_dict[char_str]`.  The histogram indicates that the letters “a” and “b” appear once; “o” appears twice, and so on.

Dictionaries have a method called `get` that takes a `key` and a default value. If the `key` appears in the dictionary, then `get` returns the corresponding value; otherwise it returns the default value.     

For example:    

In [None]:
counts = { 'chuck' : 1 , 'annie' : 42, 'jan': 100}

In [None]:
print(type(counts))
print(counts)

In [None]:
print(counts['jan'])

In [None]:
print(counts.get('jan', 0))

What if the `get` methods does not find the `key` given as an argument?   

In [None]:
print(counts['tim'])

In [None]:
print(counts.get('tim', 0))

We can use `get` to write our histogram loop more concisely. Because the `get` method automatically handles the case where a `key` is not in a dictionary, we can reduce four lines down to one and eliminate the if statement.

In [None]:
# example code to count characters in a string 
# USING THE .GET METHOD  
# word_str   -- a string with a word as its value
# histo_dict -- a histogram dictionary of counts of letters
# char_str   -- an individual character as a string

# store a word in a string variable
word_str = 'brontosaurus'

# initialize an empty histogram dictionary
histo_dict = dict()
# or, initialize a dictionary with some k-v pairs
# histo_dict = {'x' : 1, 'y' : 1,'z' : 1, 'b' : 90, 'u' : 80}

# use a for loop
for char_str in word_str:
    histo_dict[char_str] = histo_dict.get(char_str, 0) + 1
# try using a different value for the second argument to .get() 
#    histo_dict[char_str] = histo_dict.get(char_str, 100) + 1
# the following attempt would return a key error on 'b'
#    histo_dict[char_str] = histo_dict[char_str] + 1

# print the histogram dictionary of counts of letters
print(histo_dict)

The use of the `get` method to simplify this counting loop ends up being a very commonly used *idiom* in Python and we will use it many times in the rest of the class. So you should take a moment and compare the loop using the `if` statement and `in` operator with the loop using the `get` method. They do exactly the same thing, but one is more succinct.

## **Dictionaries and files**

One of the common uses of a dictionary is to count the occurrence of words in a file with some written text.     

Let’s start with a very simple file of words taken from the text of **Romeo and Juliet**.

For the first set of examples, we will use a shortened and simplified version of the text with no punctuation. Later we will work with the text of the scene with punctuation included.

~~~
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
~~~

We will write a Python program to read through the lines of the file, break each line into a list of words, and then loop through each of the words in the line and count each word using a dictionary.

You will see that we have two `for` loops. The outer loop is reading the lines of the file and the inner loop is iterating through each of the words on that particular line. This is an example of a pattern called nested loops because one of the loops is the outer loop and the other loop is the inner loop.

Because the inner loop executes all of its iterations each time the outer loop makes a single iteration, we think of the inner loop as iterating “more quickly” and the outer loop as iterating more slowly.

The combination of the two nested loops ensures that we will count every word on every line of the input file.

**Upload the Py4E exercise file "words.txt" to your Colab content area then enter the file name "words.txt" when prompted by the following code.  You may download the file here [words.txt](https://bit.ly/py4e-words), then upload it to your current working directory.**   


**Or, download [romeo.txt](https://bit.ly/romeo-short), upload it to your current working directory, and enter the file name "romeo.txt" when prompted by the following code.**  

In [None]:
fname = input('Enter the file name: ')
try:
    fhand = open(fname)
except:
    print('File cannot be opened:', fname)
    exit()

counts = dict()
for line in fhand:
    words = line.split()
    for word in words:
        if word not in counts:
            counts[word] = 1
        else:
            counts[word] += 1

print(counts)

# Code: http://www.py4e.com/code3/count1.py

In our else statement, we use the more compact alternative for incrementing a variable.     

`counts[word] += 1` *is equivalent to*         
`counts[word] = counts[word] + 1`.      

Either method can be used to change the value of a variable by any desired amount. Similar alternatives exist for `-=`, `*=`, and `/=`.

When we run the program, we see a raw dump of all of the counts in unsorted hash order. (the **romeo.txt** file is available at **www.py4e.com/code3/romeo.txt** or using the link **[romeo.txt](https://bit.ly/romeo-short)**)

**Upload the Py4E exercise file "romeo-full.txt" to your Colab content area then enter the file name "romeo-full.txt" when prompted by rerunning the code above.**     

**You can find the file here: [romeo-full.txt](https://bit.ly/romeo-full)**

It is a bit inconvenient to look through the dictionary to find the most common words and their counts, so we need to add some more Python code to get us the output that will be more helpful.

## **Looping and dictionaries**

If you use a dictionary as the sequence in a for statement, it traverses the keys of the dictionary. This loop prints each key and the corresponding value:

In [None]:
counts_dict = { 'chuck' : 1 , 'annie' : 42, 'jan': 100}
for key in counts_dict:
    print(key, counts_dict[key])

Again, the keys are in the order in which `key-value pairs` were entered into the dictionary.

We can use this pattern to implement the various loop idioms that we have described earlier. For example if we wanted to find all the entries in a dictionary with a value above ten, we could write the following code:

In [None]:
counts_dict = { 'chuck' : 1 , 'annie' : 42, 'jan': 100}
for key in counts_dict:
    if counts_dict[key] > 10 :
        print(key, counts_dict[key])

In [None]:
# The same logic applied to the counts dict from word counting 
for key in counts:
    if counts[key] > 20 :
        print(key, counts[key])

The for loop iterates through the keys of the dictionary, so we must use the index operator to retrieve the corresponding value for each key. We see only the entries with a value above 10.

If you want to print the keys in alphabetical order, you first make a list of the keys in the dictionary using the keys method available in dictionary objects, and then sort that list and loop through the sorted list, looking up each key and printing out key-value pairs in sorted order as follows:

In [None]:
counts_dict = { 'chuck' : 1 , 'annie' : 42, 'jan': 100}
key_lst = list(counts_dict.keys())
print(key_lst)
key_lst.sort()
for key in key_lst:
    print(key, counts_dict[key])

In [None]:
# using the word count dictionary from earlier
key_lst = list(counts.keys())
# print(key_lst)
key_lst.sort()
for key in key_lst:
    if counts[key] > 20 :
        print(key, counts[key])

First you see the list of `keys` in unsorted order that we get from the `.keys()` method. Then we see the `key-value pairs` in order from the for loop.



---



# **Tuples**     
**(from All you need to know about Tuples in Python. 13-Sep-2021.  By Andreas Soularidis on Medium.com)**    


##Introduction     

First of all, let’s talk about Python Tuples in general. Tuples are used to store data of multiple types like str, int, float, boolean, etc in a single variable. Lists and tuples have a lot in common and lots of differences as well. First of all, in both data types, the elements are ordered, so the items have a defined order, that will not change. Also, both are dynamic in size data types. So we don’t need to define the size of the tuple in advance, as Python takes care of that. Moreover, in both data types duplicate values are allowed. On the other hand, the big difference between lists and tuples is that a tuple is immutable data type, so we cannot add, update or delete the elements of a tuple after its declaration


## When to use tuples    

The question is when to use a list and when to use a tuple in my code. A general guide is to use tuples only if you are absolutely sure that the data will not change during the execution of the program. For example, if you want to store the days of the week you can store them in a tuple.    


## Create a tuple    

To create a tuple we use round brackets (parentheses), or we can use the constructor tuple() like below:    

In [None]:
first_tuple = ('dog', 4.5, True, 7, 'apple')
print(first_tuple)

In [None]:
second_tuple = tuple(['dog', 4.5, True, 7, 'apple'])
print(second_tuple)


Notice that, in the second example the constructor `tuple()` takes as a parameter a list and turns it into a tuple. This syntax would be extremely useful later.    

To create a single element tuple we use a comma after the element, so that Python recognizes that as a tuple and not as a simple expression with parenthesis. Look at the example below:    

In [None]:
solo_tuple = (1,)
print(solo_tuple)
print(type(solo_tuple))

In [None]:
not_tuple = (1)
print(not_tuple)
print(type(not_tuple))

## Access in tuple items    

Tuples, like lists, are zero-index data types, so the first element of a tuple has `index[0]`, the second has `index[1]`, and so on. Also, we have the reversed index that starts from `-1` and represents the last element of a tuple. So, using the reversed index the last element has `index[-1]` the second to last element has `index[-2]`, etc. So, using the index we can have access to specific tuple values as following:    

In [None]:
print(first_tuple[1])

In [None]:
print(first_tuple[-5])

## Slicing    

Like lists, we can have access to a specific range of a tuple using the following syntax

```
tuple[start_index : end_index : pace]
```
This expression returns a new tuple from start_index to end_index-1 according to pace. By default, the pace has a value of 1. If we omit the start_index the default value is 0. Similarly, if we omit the end_index the default value is the last index of the tuple. Last, if we omit both indexes we get a copy of the tuple.     

Let’s see some examples below:

In [None]:
print(first_tuple[1:3])

In [None]:
print(first_tuple[::2])

In [None]:
print(first_tuple[:4])

In [None]:
print(first_tuple[:])

## Modifying tuple content     

### Add elements to tuples    

As we mentioned earlier, tuples are immutable data types, so we can not add, update or delete elements after the tuple has been created. In lists, we have the method `append()` to add new elements in an existing list, but in tuples, we don’t have this method. So how can we add elements to an existing tuple?     

To do this, we make a three-step trick.     

1. First, we convert our tuple to a list. We can do this using the constructor `list()`. At this point, we have a list with the same elements as our tuple, so now we can use the `append()` method.     

2. The second step is to use the `append()` method to add the element to our list. At this point, we have all the elements we want on a list.     

3. The third and last step is to convert our list into a tuple, using the constructor `tuple()`. Now, we have all elements we want in a tuple.    

*We can use the `id()` function to determine whether the tuple we begin with and the final tuple are unique objects.*   

Let’s see an example below:    

In [None]:
# check the tuple contents and object id
print(first_tuple)
print(id(first_tuple))

In [None]:
# step 1 - create a list from the tuple
first_list = list(first_tuple)
print(first_list)

In [None]:
# step 2 - append to the new list
first_list.append('banana')
print(first_list)

In [17]:
# step 3 - replace the tuple with a tuple of the new list
first_tuple = tuple(first_list)

In [None]:
# check the tuple contents and object id
print(first_tuple)
print(id(first_tuple))

Be careful, that our initial tuple is not the same as the final tuple despite the fact that both have the same name `my_tuple`. We can easily check this by noticing the difference in `id` values.

### Change tuple elements    

If we try to update the value of an element in tuples we will get an error, because tuples are immutable data types, like below:

In [None]:
print(second_tuple)

In [None]:
second_tuple[0] = 'cat'

*Before we go further, let's create a `third_tuple` with the same element values.*      

*We will use this later:*    

In [26]:
third_tuple = second_tuple[:] 

*Why don't we just use the following statement to create the third tuple?*     

```
third_tuple = second_tuple 
```


To update the value of an element in a tuple we will use the same trick as above.     

1. First, convert our tuple into a list.     

2. Then, we update the elements we want using the index.     

3. Then, we convert our list into a tuple. Now our tuple has the updated values.     

Let’s see the code below:

In [None]:
# check the tuple contents and object id
print(second_tuple)
print(id(second_tuple))

In [None]:
# step 1 - create a list from the tuple
second_list = list(second_tuple)
print(second_list)

In [None]:
# step 2 - replace a list member using index
second_list[0] = 'cat'
print(second_list)

In [24]:
# step 3 - replace the tuple with a tuple of the new list
second_tuple = tuple(second_list)

In [None]:
# check the tuple contents and object id
print(second_tuple)
print(id(second_tuple))

### Delete tuple elements    

As you can guess, we cannot remove elements in a tuple. So we use the same trick as above and take advantage of the `.remove()` method of the list data type to remove element(s).     

Let’s see the code below:

In [None]:
# check the tuple contents and object id
print(third_tuple)
print(id(third_tuple))

In [None]:
# step 1 - create a list from the tuple
third_list = list(third_tuple)
print(third_list)

In [None]:
# step 2 - remove from the new list
third_list.remove(True)
print(third_list)

In [30]:
# step 3 - replace the tuple with a tuple of the new list
third_tuple = tuple(third_list)

In [None]:
# check the tuple contents and object id
print(third_tuple)
print(id(third_tuple))

## Loops in tuples    

One of the most common actions in iterables like lists and tuples is the iteration. We have a couple of ways to look at the elements of a tuple. The most common is using a `for` loop.     

The syntax of the loop is the following:
```
for element in my_tuple:
   do something with the element
```

In [None]:
for element in first_tuple:
    print(element)

In [None]:
for element in second_tuple:
    print(element)

In [None]:
for element in third_tuple:
    print(element)

As in lists, we can use the index number to iterate the elements of a tuple. To do this, we use the for loop with `len()` and `range()` functions.
```
for index in range(len(my_tuple)):
   do something with my_tuple[index]
```

In [None]:
for index in range(len(first_tuple)):
    print(first_tuple[index])

#Tuple into a tuple    

As mentioned earlier, in tuple we can store data of different types.   We can store a tuple into a tuple.      

We can have access to inner tuple elements using the index of the inner tuple and the index of the specific element.      

Let’s see the code below:

In [None]:
fourth_tuple = ('dog', 4.5, True, 7, 'apple', (1, 2, 3))
print(fourth_tuple)

In [None]:
print(fourth_tuple[-1][0])

*We should consider the position of nested iterables when using more than one index.*   

In [None]:
print(fourth_tuple[1][0])

# Unpacking Tuples    
Whenever we create a tuple we “pack” elements into it. Python allows us to extract elements of a tuple and stores these elements into variables.     

Let’s see an example below:    

In [None]:
fifth_tuple = (12, 34, 2, 6, 24)   

first, second, third, fourth, fifth = fifth_tuple   

print(fifth_tuple)

print(first, second, third, fourth, fifth)

In the example above, we have a tuple of numbers and we extract those numbers in variables. Be careful that the number of elements in the tuple must be exactly the same as the number of variables, otherwise, we will get an error. In case we are interested in only some of the elements of the tuple, we can use the asterisk * syntax as above

In [None]:
primary, secondary, *rest = fifth_tuple  

print(primary, secondary, rest)

print('Type of `primary` is', type(primary))
print('Type of `secondary` is', type(secondary))
print('Type of `rest` is', type(rest))

*Bear in mind that a variable which uses an asterisk to take values from a tuple will itself be a list.*    

As we can notice, in the case above, Python returns a list that includes the rest of the elements of the tuple.     

The following is another interesting, valid syntax:     


In [None]:
alpha, *medio, omega = fifth_tuple

print(alpha, medio, omega)

print('Type of `alpha` is', type(alpha))
print('Type of `medio` is', type(medio))
print('Type of `omega` is', type(omega))

##Tuple Methods    


Let's create one more example tuple to demonstrate the two tuple methods:    
1. `.count()`    
2. `.index()`    

In [46]:
sixth_tuple = (12, 34,2, 6, 24, 34)

## `tuple.count()`     

Returns the number of times the given value appears in a tuple.    

In [47]:
print(sixth_tuple.count(34))

2


## `tuple.index()`     

Returns the index of the first appearance of the given element. Be careful that if the given number doesn’t exist, it raises a *ValueError* exception.   

In [48]:
print(sixth_tuple.index(34))

1


In [None]:
print(sixth_tuple.index(7))



---



# **Tuples**
**(from Py4E.com, lesson 11 - former lesson 10)**  

## 11(10).1 Tuples are immutable

In [None]:
# Syntactically, a tuple is a comma-separated list of values:
t = 'a', 'b', 'c', 'd', 'e'

In [None]:
# enclose tuples in parentheses to identify easily
t = ('a', 'b', 'c', 'd', 'e')

In [None]:
# create a tuple with a single element
t1 = ('a',)  # End a element by ','
type(t1)

In [None]:
t2 = ('a')
type(t2)

In [None]:
# built-in function tuple
t = tuple()
print(t)

In [None]:
built-in function tuple with string as argument
t = tuple('lupins')
print(t)

In [None]:
t = ('a', 'b', 'c', 'd', 'e')
print(t[0]) # bracket operator indexes an element

In [None]:
print(t[1:3]) # slice operator selects a range of elements

In [None]:
t[0] = 'A'

In [None]:
# can’t modify the elements of a tuple, but can replace one tuple with another
t = ('A',) + t[1:]
print(t)

## 11(10).2 Comparing tuples

In [None]:
(0, 1, 2) < (0, 3, 4) # comparing with respective sequence elements

In [None]:
(0, 1, 2000000) < (0, 3, 4)

In [None]:
# you have a list of words and you want to sort them from longest to shortest:
txt = 'but soft what light in yonder window breaks'
words = txt.split()
t = list()
for word in words:
    t.append((len(word), word))

t.sort(reverse=True)

res = list()
for length, word in t:
    res.append(word)

print(res)

## 11(10).3 Tuple assignment

In [None]:
# 1st Way

# unique syntactic features of the Python language is the ability to have a tuple on the left side of an assignment statement.
m = [ 'have', 'fun' ]
x, y = m

In [None]:
x

In [None]:
y

In [None]:
# 2nd Way

m = [ 'have', 'fun' ]
x = m[0]
y = m[1]

In [None]:
x

In [None]:
y

In [None]:
# 3rd Way

m = [ 'have', 'fun' ]
(x, y) = m

In [None]:
x

In [None]:
y

In [None]:
# application of tuple assignment allows us to swap the values of two variables in a single statement
x, y = y, x

In [None]:
a, b = 1, 2, 3

In [None]:
addr = 'monty@python.org'
uname, domain = addr.split('@')

In [None]:
print(uname)

In [None]:
print(domain)

## 11(10).4 Dictionaries and tuples

In [None]:
d = {'a':10, 'b':1, 'c':22}
t = list(d.items())
print(t)
# Here each tuple is a key-value pair:

[('a', 10), ('b', 1), ('c', 22)]


In [None]:
t

[('a', 10), ('b', 1), ('c', 22)]

In [None]:
t.sort()

In [None]:
t

[('a', 10), ('b', 1), ('c', 22)]

## 11(10).5 Multiple assignment with dictionaries

In [None]:
for key, val in list(d.items()):
    print(val, key)

In [None]:
d = {'a':10, 'b':1, 'c':22}
l = list()
for key, val in d.items() :
    l.append( (val, key) )

In [None]:
l

In [None]:
l.sort(reverse=True)
l

## 11(10).6 The most common words

**Upload the Py4E exercise file "romeo-full.txt" to your Colab content area.**     

**You can find the file here: [romeo-full.txt](https://bit.ly/romeo-full)**

In [None]:
import string
fhand = open('romeo-full.txt')
counts = dict()
for line in fhand:
    line = line.translate(str.maketrans('', '', string.punctuation))
    line = line.lower()
    words = line.split()
    for word in words:
        if word not in counts:
            counts[word] = 1
        else:
            counts[word] += 1

# Sort the dictionary by value
lst = list()
for key, val in list(counts.items()):
    lst.append((val, key))

lst.sort(reverse=True)

for key, val in lst[:10]:
    print(key, val)



---



# **Advanced text parsing**

In the above example using the file **romeo.txt**, we made the file as simple as possible by removing all punctuation by hand. The actual text has lots of punctuation, as shown below.

```
But, soft! what light through yonder window breaks?

It is the east, and Juliet is the sun.

Arise, fair sun, and kill the envious moon,

Who is already sick and pale with grief,
```

Since the Python `.split()` function looks for spaces and treats words as tokens separated by spaces, we would treat the words “soft!” and “soft” as different words and create a separate dictionary entry for each word.

Also since the file has capitalization, we would treat “who” and “Who” as different words with different counts.

We can solve both these problems by using the string methods `.lower()`, `.punctuation()`, and `.translate()`. The `.translate()` is the most subtle of the methods. 

In [None]:
# line.translate(str.maketrans(from_str, to_str, delete_str))

Replace the characters in `from_str` with the character in the same position in `to_str` and delete all characters that are in `delete_str`. The `from_str` and `to_str` can be empty strings and the `delete_str` parameter can be omitted.

We will not specify the `to_str` but we will use the `delete_str` parameter to delete all of the punctuation. We will even let Python tell us the list of characters that it considers “punctuation”:

In [None]:
import string
string.punctuation

The parameters used by translate were different in Python 2.0.

We make the following modifications to our program:

In [None]:
import string

fname = input('Enter the file name: ')
try:
    fhand = open(fname)
except:
    print('File cannot be opened:', fname)
    exit()

counts = dict()
for line in fhand:
    line = line.rstrip()
    line = line.translate(line.maketrans('', '', string.punctuation))
    line = line.lower()
    words = line.split()
    for word in words:
        if word not in counts:
            counts[word] = 1
        else:
            counts[word] += 1

print(counts)

# Code: http://www.py4e.com/code3/count2.py

Part of learning the “Art of Python” or “Thinking Pythonically” is realizing that Python often has built-in capabilities for many common data analysis problems. Over time, you will see enough example code and read enough of the documentation to know where to look to see if someone has already written something that makes your job much easier.



---



# **Debugging**

As you work with bigger datasets it can become unwieldy to debug by printing and checking data by hand. Here are some suggestions for debugging large datasets:

**Scale down the input**
If possible, reduce the size of the dataset. For example if the program reads a text file, start with just the first 10 lines, or with the smallest example you can find. You can either edit the files themselves, or (better) modify the program so it reads only the first n lines.

If there is an error, you can reduce n to the smallest value that manifests the error, and then increase it gradually as you find and correct errors.

**Check summaries and types**
Instead of printing and checking the entire dataset, consider printing summaries of the data: for example, the number of items in a dictionary or the total of a list of numbers.

A common cause of runtime errors is a value that is not the right type. For debugging this kind of error, it is often enough to print the type of a value.

**Write self-checks**
Sometimes you can write code to check for errors automatically. For example, if you are computing the average of a list of numbers, you could check that the result is not greater than the largest element in the list or less than the smallest. This is called a “sanity check” because it detects results that are “completely illogical”.

Another kind of check compares the results of two different computations to see if they are consistent. This is called a “consistency check”.

**Pretty print the output**
Formatting debugging output can make it easier to spot an error.
Again, time you spend building scaffolding can reduce the time you spend debugging.


---
