<a href="https://colab.research.google.com/github/Teaganstmp/Langlearning/blob/main/dictionaries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python dictionaries: Mapping keys to values

In our initial example of "how do I count how often each word appears in a newspaper article", the central data structure was a "sheet of counts": For each word, we kept a sequence of dashes indicating how often we had seen that word.

In this notebook you will learn how to use Python dictionaries for multiple purposes.
* You'll learn how to store "sheets of counts", for example of frequent words in a text.
* More generally, you can view a dictionary as a flexible-sized collection of variables (containers). We'll discuss how that is useful.
* We'll also discuss how to use Python dictionaries to store an attribute-value matrix or feature structure (something like this: https://en.wikipedia.org/wiki/Feature_structure)

Here's how it works technically. Image a dictionary, say, a translation dictionary from English to German. To look up how to say "cat" in German, you would look under "cat", and you would find "Katze". To retrieve a piece of information, namely "Katze", you look it up under a particular *key*, here its word in English. This is also

need to know where in the dictionary to find it: You can find it under 'cat'.  -- by looking it up

In the "sheet of counts", we *connected* each word to a count (a sequence of dashes). We *mapped* each word to a count. This is what a Python dictionary does: It stores *values* under particular *keys*, it connects each key to a value, so that you can retrieve the value using the key. This is also what a Python dictionary does: It stores each *value* under a *key*, so that you can use the key to look up the value.

Here is what a Python dictionary for an English-to-German dictionary could look like:
```
{
"dog" : "Hund",
"cat" : "Katze",
"rhino" : "Nashorn"
}
```

It contains pairs of data, connected by a ":". In each pair, the left entry (e.g., "dog") is the key, and the right entry (e.g., "Hund") is the value.

You can use the same structure, of key:value pairs, to store a "sheet of counts". Here is what that could look like as a Python dictionary:


```
{
"the" : 3,
"recent" : 1,
"development" : 1,
"that" : 2
}
```

This says that we have three counts of "the", one count of "recent", and so on.

In the second case, the keys were strings, and the values were also strings. In the second case, the keys were again strings, and the values were numbers. Dictionaries are flexible about this. (And the keys could be numbers, or other data types.)

Here is how to construct a Python dictionary from scratch:

In [None]:
latindict = { "dog": "canis", "cat":"felis", "rhino": "rhinoceros",
            "mouse": "mus"}

print("result of dictionary making:", latindict)

result of dictionary making: {'dog': 'canis', 'cat': 'felis', 'rhino': 'rhinoceros', 'mouse': 'mus'}


Or you can start with an empty dictionary and fill it one step at a time:

In [None]:
germandict = {}
print("Step 1 mydict is", germandict)

germandict["dog"] = "Hund"
print("Step 2 mydict is", germandict)

germandict["cat"] = "Katze"
print("Step 3 mydict is", germandict)

germandict["rhino"] = "Nashorn"
print("Step 4 mydict is", germandict)

Step 1 mydict is {}
Step 2 mydict is {'dog': 'Hund'}
Step 3 mydict is {'dog': 'Hund', 'cat': 'Katze'}
Step 4 mydict is {'dog': 'Hund', 'cat': 'Katze', 'rhino': 'Nashorn'}


Here is how you can access a dictionary. Give the key to get the corresponding value:

In [None]:
germandict["dog"]

'Hund'

In [None]:
latindict["mouse"]

'mus'

Note that when you make a new dictionary, you need to use curly brackets, as in:

```germandict = { }```

But when you access the dictionary, or when you add a single entry to a dictionary, you use straight brackets, as in:

```latindict["mouse"]```

## Comparing Python dictionaries and Python lists

Let's compare notation across dictionaries and lists.

Initializing to an empty data structure:
```
mylist = []       # empty list: straight brackets
mydict = {}    # empty dictionary: curly brackets
```

Initializing to a nonempty data structure:
```
# initializing a list:straight brackets
mylist = [“dog”, “cat”, “rhinoceros”]
# initializing a dictionary: curly brackets, key-colon-value
mydict = {"dog":"Hund", "cat":"Katze",  "rhinoceros":"Nashorn"}
```

Accessing items on a list: index in straight brackets. A list “maps” indices to items.

```
mylist[1]  ### will yield 'cat'
```

Accessing items on a dictionary: key in straight brackets. A dictionary maps keys to values.

```
mydict['cat'] ### will yield 'Katze'
```

The standard way to modify a list is via ```append()```

In [None]:
mylist = ["dog", "cat", "rhinoceros"]
mylist.append("armadillo")
print("changed list", mylist)

changed list ['dog', 'cat', 'rhinoceros', 'armadillo']


The standard way to modify a dictionary is to store a value under a key. If you had a key/value pair before and change the value, the previous value is gone.

In [None]:
mydict = {"dog":"Hund", "cat":"Katze"}
print("original dictionary", mydict)

# adding an item
# (the German word for 'armadillo' means literally belt-animal)
mydict["armadillo"] = "Guerteltier"
print("dictionary after adding an item", mydict)

# changing a value.
# the previous value is gone.
mydict["cat"] = "felis"
print("dictionary after changing the value", mydict)

original dictionary {'dog': 'Hund', 'cat': 'Katze'}
dictionary after adding an item {'dog': 'Hund', 'cat': 'Katze', 'armadillo': 'Guerteltier'}
dictionary after changing the value {'dog': 'Hund', 'cat': 'felis', 'armadillo': 'Guerteltier'}


In a list, you get an error when you try to access a list index that isn't there. In the same way, in a dictionary, you get an error when you try to access a key that isn't there.

In [None]:
mylist = ["dog", "cat", "rhinoceros"]
# remove the hash in the beginning of the 'print' line
# that is, "uncomment" the 'print' line,
# to get an IndexError.
# print("this will get you an error", mylist[10])

In [None]:
mydict = {'dog':'Hund', 'rhinoceros':'Nashorn'}
# remove the hash in the beginning of the 'print' line
# that is, "uncomment" the 'print' line,
# to get a KeyError.
# print('this will get you an error', mydict['cat'])

## Dictionary keys and dictionary values

**What can be a dictionary key?**

Strings can be dictionary keys:
```mydict = {"dog":"Hund", "rhinoceros":"Nashorn"}```

Integers can be dictionary keys, for example:

```prime_nums = {2:1, 3:2, 5:3, 7:4, 11:5}```


Not everything can be a dictionary key, for example lists cannot. It's because lists are *mutable*, that is, you can change a list in place. That's not good for a key: It's like a key made of playdough that you can reshape -- but after you reshaped it, it's not going to fit into the lock anymore. So Python disallows it.

**What can be a dictionary value?**

Any data type can be a dictionary value. Even a dictionary can be a dictionary value.

## Checking whether a key is present

You can use ```in``` to check whether a key is in a dictionary:

In [None]:
mydict = {"dog":"Hund", "cat":"Katze", "armadillo":"Guerteltier"}
print("is 'mouse' present?", 'mouse' in mydict)
print("is 'armadillo' present?", "armadillo" in mydict)

is 'mouse' present? False
is 'armadillo' present? True


The Boolean expression with ```in``` checks keys, it does not check values:

In [None]:
mydict = {"dog":"Hund", "cat":"Katze", "armadillo":"Guerteltier"}
'Katze' in mydict

False

If you try to access a key that isn't there, you get a Key Error. Uncomment the following piece of code to see one.

In [None]:
# mydict = {"dog":"Hund", "cat":"Katze", "armadillo":"Guerteltier"}
# mydict["dormouse"]

Having to constantly check whether a key is in the dictionary can be annoying, especially when you populate a large dictionary in a loop. There is a Python data structure that fixes the problem, a `defaultdict` from the package `collections` (https://docs.python.org/3/library/collections.html). We'll see it in use below.

**Try it for yourself**

* Say we have the following dictionary:

In [None]:
dict1 = { "dog" : "Hund", "armadillo" : "Guerteltier"}

Please add to this dictionary the translation of platypus, which is Schnabeltier (literally, beak animal).

* Say we have the following dictionary:

In [None]:
dict2 = { "the" : 1, "a" : 2}

Please change the entry for "the" to be two instead of one.

Here is a piece of code that gives you an example of how to do the next "Try it for yourself" below. It iterates over a list of function words, using each of them as a key to retrieve the matching value from the dictionary.

In [None]:
mylist = [ "the", "and", "a"]
dictionary_of_counts = { "house" : 2, "armadillo" : 1,
                         "the" : 21, "recent" : 2,
                         "said": 3, "a" : 15,
                         "went": 2, "and": 3,
                         "yellow": 1}
print("function word counts")
for word in mylist:
    print(word, dictionary_of_counts[word])

function word counts
the 21
and 3
a 15


**Try it for yourself**

Here is a mini German/English dictionary:

In [None]:
mydict = {"befreit":"liberated", "baeche":"brooks",
          "eise":"ice", "sind":"are", "strom":"river",
          "und":"and", "vom":"from"}

Can you use this dictionary to do a bad translation of the following German sentence? (Hint: this should look a lot like the use case above
where we iterated over a list and printed out dictionary values
for items on the list.)

In [None]:
mysent = "vom eise befreit sind strom und baeche"




*Warning:* Solution below, don't read on if you want to solve the "bad translation" problem for youself.

...

...

...

...

...

...

...

...









Here is a solution.(Note that this is not how you want your machine translation to work! The translations that you get this way are terrible.)

In [None]:
mydict = {"befreit":"liberated", "baeche":"brooks", "eise":"ice", "sind":"are", "strom":"river", "und":"and", "vom":"from"}
mysent = "vom eise befreit sind strom und baeche"
for german_word in mysent.split():
    print( mydict[ german_word], end = " ")
print()

from ice liberated are river and brooks 


Adding the parameter ```end = " "``` puts a space instead of a linebreak at the end of what is printed. That way, multiple "print" outputs land on the same line.

# A dictionary as a collection of variables/containers

In a way, you can view a dictionary as a collection of containers, each of which you address by the key.

First, let's see what happens when we have individual variables. Here's a variable storing the translation of "platypus".

In [None]:
platypus_translation = "Schnabeltier"
platypus_translation

'Schnabeltier'

And of "cat":

In [None]:
cat_translation = "Katze"
cat_translation

'Katze'

We can make more of these, but we need to know, ahead of time, how many translations we want to store. In a dictionary, we can always store as many as we need:

In [None]:
mydict = {"cat":"Katze", "platypus":"Schnabeltier"}
mydict["platypus"]

'Schnabeltier'

Adding an entry:

In [None]:
mydict["armadillo"] = "Guerteltier"
mydict

{'cat': 'Katze', 'platypus': 'Schnabeltier', 'armadillo': 'Guerteltier'}

Like with individual variables, you can update the value that goes with a key. Here is an example. Say you want to count occurrences of words, and you've seen one more "the". Then you record it like this:

In [None]:
mydict = {"the":1, "and": 1, "of": 1}
mydict["the"] = mydict["the"] + 1
mydict

{'the': 2, 'and': 1, 'of': 1}

Compare this to how you would change the contents of an individual variable/container:

In [None]:
counter = 0
mylist = ["a", "b", 'a']
for item in mylist:
    if item == "a":
        counter = counter + 1

counter

2

## Counting words in a text

We can use this idea of a dictionary as a collection of variables/containers to count occurrences of words in a text.

First, as a reminder, here is how you can count occurrences of just one word (here: "to") in a text:

In [None]:
# paragraph from the Onion, March 04
paragraph = """While dieters are accustomed to exercises of will,
a new English translation of Germany's most popular diet book
takes the concept to a new philosophical level.
The Nietzschean diet, which commands its adherents to eat
superhuman amounts of whatever they most fear,
is developing a strong following in America."""

count_to = 0
for word in paragraph.split():
    if word == "to":
        count_to = count_to + 1

print( count_to )

3


Now suppose we want to count occurrences of all words at the same time.
Then we can use a Python dictionary as a collection of containers, one for each word. The words are the keys, and their counts as the values. Every time we encounter a word, we add one to its value in the dictionary.

In [None]:
# paragraph from the Onion, March 04
paragraph = """While dieters are accustomed to exercises of
will, a new English translation of Germany's most popular
diet book takes the concept to a new philosophical level.
The Nietzschean diet, which commands its adherents to eat
superhuman amounts of whatever they most fear,
is developing a strong following in America."""

counts = { }

for word in paragraph.split():
    if  word not in counts:
        counts[word] = 1
    else:
        counts[ word ] = counts[ word ] + 1

print( counts )

{'While': 1, 'dieters': 1, 'are': 1, 'accustomed': 1, 'to': 3, 'exercises': 1, 'of': 3, 'will,': 1, 'a': 3, 'new': 2, 'English': 1, 'translation': 1, "Germany's": 1, 'most': 2, 'popular': 1, 'diet': 1, 'book': 1, 'takes': 1, 'the': 1, 'concept': 1, 'philosophical': 1, 'level.': 1, 'The': 1, 'Nietzschean': 1, 'diet,': 1, 'which': 1, 'commands': 1, 'its': 1, 'adherents': 1, 'eat': 1, 'superhuman': 1, 'amounts': 1, 'whatever': 1, 'they': 1, 'fear,': 1, 'is': 1, 'developing': 1, 'strong': 1, 'following': 1, 'in': 1, 'America.': 1}


The condition ```if word not in counts``` is true if the content of the variable word is not a key in the dictionary counts.

Note that this is a variant of the "accumulation" code pattern that you have seen before. We initialize counts to an empty dictionary. Then we iterate over the words in the paragraph, adding numbers to the dictionary as we go along. The first time we encounter a word, we initialize its count to zero. We know we encounter it for the first time because there is no dictionary key for them yet.

### Using defaultdict

Above I've mentioned that when you iteratively populate a dictionary in a loop, it can be annoying to have to check all the time if a key is in the dictionary. You can see this in the code above:

```
if  word not in counts:
        counts[word] = 1
    else:
        counts[ word ] = counts[ word ] + 1
```

That is verbose, and hard to read. Here's where the `defaultdict` from the package `collections` comes in. A defaultdict never gives you a key error. Instead, when you try to access a key that doesn't exist yet, it adds that key to the dictionary then and there. When you have an integer defaultdict, the value that gets added for a new key is the zero. (When you have a list defaultdict, the value that gets added for a new key is an empty list.)

Here's how that makes our life easier:

In [None]:
from collections import defaultdict

# paragraph from the Onion, March 04
paragraph = """While dieters are accustomed to exercises of
will, a new English translation of Germany's most popular
diet book takes the concept to a new philosophical level.
The Nietzschean diet, which commands its adherents to eat
superhuman amounts of whatever they most fear,
is developing a strong following in America."""

counts = defaultdict(int)

for word in paragraph.split():
    counts[ word ] = counts[ word ] + 1

print( counts )

defaultdict(<class 'int'>, {'While': 1, 'dieters': 1, 'are': 1, 'accustomed': 1, 'to': 3, 'exercises': 1, 'of': 3, 'will,': 1, 'a': 3, 'new': 2, 'English': 1, 'translation': 1, "Germany's": 1, 'most': 2, 'popular': 1, 'diet': 1, 'book': 1, 'takes': 1, 'the': 1, 'concept': 1, 'philosophical': 1, 'level.': 1, 'The': 1, 'Nietzschean': 1, 'diet,': 1, 'which': 1, 'commands': 1, 'its': 1, 'adherents': 1, 'eat': 1, 'superhuman': 1, 'amounts': 1, 'whatever': 1, 'they': 1, 'fear,': 1, 'is': 1, 'developing': 1, 'strong': 1, 'following': 1, 'in': 1, 'America.': 1})


Why does this work? Well, say you encounter a new word that's not in the dictionary yet, "dieters". Then `counts["dieters"]` accesses the key "dieters" that doesn't exist yet. So that key gets added to the dictionary, with a value of zero. Then when you add one to that, you get a count of one -- exactly what you want when you've encountered the word for the first time.


**Try it for yourself**:
* In the code above, each word is counted "as is", which means that "the" and "The" are counted separately. Modify the code such that it lowercases each word before counting.

# Counting words using NLTK

Word counting is a task that we often need to do when we analyze texts. It is surprising for how many different analyses this is the first step! And because this is such a frequent task, the Natural Language Toolkit has a specialized type of dictionary just for counting (of words, or of other items).

This is a trick you will see often with Python packages: They define specialized data types that come with their own methods.

In [None]:
import nltk
# making sure we can split words
nltk.download("punkt")

# a poem from Alice in Wonderland
data = """"You are old, Father William," the young man said,
    "And your hair has become very white;
And yet you incessantly stand on your head—
    Do you think, at your age, it is right?"

"In my youth," Father William replied to his son,
    "I feared it might injure the brain;
But now that I'm perfectly sure I have none,
    Why, I do it again and again."

"You are old," said the youth, "as I mentioned before,
    And have grown most uncommonly fat;
Yet you turned a back-somersault in at the door—
    Pray, what is the reason of that?"

"In my youth," said the sage, as he shook his grey locks,
    "I kept all my limbs very supple
By the use of this ointment—one shilling the box—
    Allow me to sell you a couple."

"You are old," said the youth, "and your jaws are too weak
    For anything tougher than suet;
Yet you finished the goose, with the bones and the beak—
    Pray, how did you manage to do it?"

"In my youth," said his father, "I took to the law,
    And argued each case with my wife;
And the muscular strength, which it gave to my jaw,
    Has lasted the rest of my life."

"You are old," said the youth, "one would hardly suppose
    That your eye was as steady as ever;
Yet you balanced an eel on the end of your nose—
    What made you so awfully clever?"

"I have answered three questions, and that is enough,"
    Said his father; "don't give yourself airs!
Do you think I can listen all day to such stuff?
    Be off, or I'll kick you down stairs!"""

# we use the Natural Language Toolkit to split this poem into words
# in a way that also splits off punctuation
words = nltk.word_tokenize(data)
print("The words are", words)

The words are ['``', 'You', 'are', 'old', ',', 'Father', 'William', ',', "''", 'the', 'young', 'man', 'said', ',', '``', 'And', 'your', 'hair', 'has', 'become', 'very', 'white', ';', 'And', 'yet', 'you', 'incessantly', 'stand', 'on', 'your', 'head—', 'Do', 'you', 'think', ',', 'at', 'your', 'age', ',', 'it', 'is', 'right', '?', "''", '``', 'In', 'my', 'youth', ',', "''", 'Father', 'William', 'replied', 'to', 'his', 'son', ',', '``', 'I', 'feared', 'it', 'might', 'injure', 'the', 'brain', ';', 'But', 'now', 'that', 'I', "'m", 'perfectly', 'sure', 'I', 'have', 'none', ',', 'Why', ',', 'I', 'do', 'it', 'again', 'and', 'again', '.', "''", '``', 'You', 'are', 'old', ',', "''", 'said', 'the', 'youth', ',', '``', 'as', 'I', 'mentioned', 'before', ',', 'And', 'have', 'grown', 'most', 'uncommonly', 'fat', ';', 'Yet', 'you', 'turned', 'a', 'back-somersault', 'in', 'at', 'the', 'door—', 'Pray', ',', 'what', 'is', 'the', 'reason', 'of', 'that', '?', "''", '``', 'In', 'my', 'youth', ',', "''", 'sai

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
# Here is the item counting dictionary
# You initialize it with the list of items to be counted
fd = nltk.FreqDist(words)
# When you inspect this, you see
# the words with the highest counts first
fd

FreqDist({',': 30, 'the': 17, '``': 16, "''": 15, 'you': 10, 'I': 10, ';': 7, 'my': 7, 'said': 6, 'your': 6, ...})

In [None]:
# the 10 most common words and their counts
fd.most_common(10)

[(',', 30),
 ('the', 17),
 ('``', 16),
 ("''", 15),
 ('you', 10),
 ('I', 10),
 (';', 7),
 ('my', 7),
 ('said', 6),
 ('your', 6)]

We can also get the most frequent words in a tabular format:

In [None]:
fd.tabulate(10)

   ,  the   ``   ''  you    I    ;   my said your 
  30   17   16   15   10   10    7    7    6    6 


In [None]:
# the counts for a particular word:
# ask this like you would a dictionary
fd["youth"]

6

**Try it for yourself**:

Get a short text passage from some webpage, and store it as a Python string. Split it into words using either ```split()``` or ```nltk.word_tokenize()```. Then make a new ```nltk.FreqDist``` and use it to count words in the passage.

* What are the 5 most frequent words in the passage, and what are their counts?

* Is the word "and" in the passage? If so, what is its count?

# All keys, all values, all pairs


You can retrieve all the keys of a dictionary:

In [None]:
mydict = {"dog":"Hund", "cat":"Katze", "armadillo":"Guerteltier"}
mydict.keys()

dict_keys(['dog', 'cat', 'armadillo'])

What `mydict.keys()` gives you is an iterator -- something that you can iterate over with a for-loop:

In [None]:
# printing counts for all words that
# are actual words,
# not punctuation
for word in fd.keys():
    if word.isalpha():
        print(word, fd[word], end= ", ")

You 4, are 5, old 4, Father 2, William 2, the 17, young 1, man 1, said 6, And 5, your 6, hair 1, has 1, become 1, very 2, white 1, yet 1, you 10, incessantly 1, stand 1, on 2, Do 2, think 2, at 2, age 1, it 5, is 3, right 1, In 3, my 7, youth 6, replied 1, to 6, his 4, son 1, I 10, feared 1, might 1, injure 1, brain 1, But 1, now 1, that 3, perfectly 1, sure 1, have 3, none 1, Why 1, do 3, again 2, and 4, as 4, mentioned 1, before 1, grown 1, most 1, uncommonly 1, fat 1, Yet 3, turned 1, a 2, in 1, Pray 2, what 1, reason 1, of 4, sage 1, he 1, shook 1, grey 1, locks 1, kept 1, all 2, limbs 1, supple 1, By 1, use 1, this 1, shilling 1, Allow 1, me 1, sell 1, couple 1, jaws 1, too 1, weak 1, For 1, anything 1, tougher 1, than 1, suet 1, finished 1, goose 1, with 2, bones 1, how 1, did 1, manage 1, father 2, took 1, law 1, argued 1, each 1, case 1, wife 1, muscular 1, strength 1, which 1, gave 1, jaw 1, Has 1, lasted 1, rest 1, life 1, one 1, would 1, hardly 1, suppose 1, That 1, eye 1, w

You can also get all the values in a dictionary:

In [None]:
mydict.values()

dict_values(['Hund', 'Katze', 'Guerteltier'])

In [None]:
for v in mydict.values():
    print(v)

Hund
Katze
Guerteltier


In [None]:
# summing up all the values in the
# FreqDist object is
# the same as the length of the original poem
print("summed counts in the FreqDist:", sum(fd.values()))
print("number of words in the poem:", len(nltk.word_tokenize(data)))
print("summed counts in the FreqDist, version 2:", fd.N())

summed counts in the FreqDist: 354
number of words in the poem: 354
summed counts in the FreqDist, version 2: 354


**Try it for yourself.**

* Above, you made a ```nltk.FreqDist``` object counting words in a passage of your choosing. Now iterate through the keys in that dictionary in order to print counts *only for the uppercase words* in that passage.

* Here is a small English/German dictionary as a Python dictionary. Iterate through the values in that dictionary and print only the German words with a length greater or equal to 7 characters.

In [None]:
translationdict = {"dog":"Hund", "cat": "Katze",
                   "dormouse":"Siebenschlaefer",
                   "praying mantis":"Gottesanbeterin",
                  "gopher" : "Taschenratte"}
# put code here...




You can also get access to all key/value pairs in a dictionary, using the method ```items()```:

In [None]:
mydict.items()

dict_items([('dog', 'Hund'), ('cat', 'Katze'), ('armadillo', 'Guerteltier')])

The items (key/value pairs) have a shape like this:

```('dog', 'Hund')```

This looks almost like a list, but with round brackets rather than straight, and you can in fact treat it like a list. In particular, you can access the first part of this pair (the key) with index 0, and the second part of the pair (the value) with index 1.

This data structure is called a *tuple*. It behaves like a list, except that it is immutable, like a string: you cannot ```append()``` to it, and you cannot exchange individual items on a tuple. So you can actually use it as a dictionary key!


In [None]:
firstpair = ("dog", "Hund")
firstkey = firstpair[0]
firstvalue = firstpair[1]
print("the first key is", firstkey, "and the first value is", firstvalue)

the first key is dog and the first value is Hund


Tuples don't have to be length 2. Here is a longer one:

In [None]:
longtuple = ("a", "b", "c", "d")
longtuple[2]

'c'

You can iterate over the keys of a dictionary, the values of a dictionary, and the key/value pairs (items). Here is how to do the latter:

In [None]:
for keyvalue in mydict.items():
    english = keyvalue[0]
    german = keyvalue[1]
    print('English', english, "translates to German", german)

English dog translates to German Hund
English cat translates to German Katze
English armadillo translates to German Guerteltier


You can take a tuple or a list apart by assigning multiple variables to it at once:

In [None]:
firstpair = ("dog", "Hund")
englishword, germanword = firstpair
print("We have assigned", englishword, "to 'englishword' and",
     germanword, "to 'germanword'")

We have assigned dog to 'englishword' and Hund to 'germanword'


So you can fill two containers (variables) at the same time by putting them on the left-hand side of the assignment =. That only works if on the right-hand side you have a list or tuple of length exactly two.

(You can also assign three/four/... variables at the same time if on the right-hand side you have a list or tuple of length exactly three/four/...)

In [None]:
var1, var2, var3 = (1,2,3)
print(var2)

2


Don't miscount, or you get an error message, in particular a ValueError.

In [None]:
# Uncomment (remove the hash from) the 'var1, var2' line
# to get a ValueError with comment
# "too many values to unpack (expected 2)"
# var1, var2= (1,2,3)

Usually when doing assignments, assigning the right-hand side of the "=" to the left-hand side, there was only a single variable on the left-hand side. But if we know that the right-hand side of the "=" has exactly two components, we can put two variables on the left-hand side. The command above takes the tuple ('rhinoceros', 'Nashorn') apart into two items and assigns the first to the variable english and the second to the variable german.

We can combine this with a for-loop when we iterate over key/value pairs:

In [None]:
for english, german in mydict.items():
    print(german, "is German for", english)

Hund is German for dog
Katze is German for cat
Guerteltier is German for armadillo


The central line here is:
```for english, german in mydict.items():```

This is the same idea as above -- we know that any member of mydict.items() consists of two parts (a key and a value), so we can assign it to two variables at once.



**Try it for yourself**:

* Iterate through the key/value pairs in the ```nltk.FreqDist``` dictionary you made above from a passage you chose. For all words that consist solely of punctuation symbols, print out the words and counts.  

* You can also iterate through the key/value pairs that you get from ``fd.most_common(20)``. For all words that don't consist solely of punctuation symbols, print the word and its count.


A simple way to check for punctuation is to say  `not word.isalpha()` to check if `word` contains non-letter characters. But this will also get you words like "say..." since that contains non-letter characters. Here is a trick to check whether a word consists entirely of punctuation symbols:

In [None]:
import string
print("Here is a string of all punctuation symbols that Python is aware of:", string.punctuation)

mystring = "??!??"
if mystring.strip(string.punctuation) == "":
    print("this string consisted entirely of punctuation symbols.")


Here is a string of all punctuation symbols that Python is aware of: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
this string consisted entirely of punctuation symbols.


* Using again the translation dictionary from above with animal names in English and German, iterate through key/value pairs, and for pairs where the German word is at least 7 letters long, print both the German word and its English translation.

In [None]:
translationdict = {"dog":"Hund", "cat": "Katze",
                   "dormouse":"Siebenschlaefer",
                   "praying mantis":"Gottesanbeterin",
                  "gopher" : "Taschenratte"}
# put your code here.


# Dictionaries as attribute-value matrices

The Universal Dependencies data represents each token (word) as an attribute-value matrix, stored as a dictionary. Here is the first word of the 10th sentence of the UD_English-GUM corpus:

In [None]:
firstword = {'id': 1,
  'form': 'Thus',
  'lemma': 'thus',
  'upos': 'ADV',
  'xpos': 'RB',
  'feats': None,
  'head': 16,
  'deprel': 'advmod',
  'deps': None,
  'misc': {'SpaceAfter': 'No'}}

This is the following attribute-value matrix (AVM):

$$
\left[\begin{array}{ll}
\text{id:} & 1\\
\text{form:} & 'Thus'\\
\text{lemma:} & 'thus'\\
\text{upos:} &  'ADV'\\
\text{xpos:} & 'RB'\\
\text{feats:} &  None\\
\text{head:} & 16\\
\text{deprel:}  & advmod\\
\text{deps:}  & None\\
\text{misc:} & \left[\begin{array}{ll}
\text{SpaceAfter:} & 'No'
\end{array}\right]
\end{array}\right]
$$

You can access an entry in this attribute-value matrix through its dictionary key:

In [None]:
firstword["lemma"]

'thus'

One of the values in the AVM is itself an AVM. To access the value that tells you whether there is a space after the word, you need to specify the whole path of keys. `firstword["misc"]` accesses a dictionary, namely `{'SpaceAfter': 'No'}`, which again has keys, in particular `SpaceAfter`:

In [None]:
firstword["misc"]["SpaceAfter"]

'No'

The Universal Dependencies representation of a whole sentence is a list of tokens, that is, a list of dictionaries (=AVMs):

In [None]:
sentence10 = [{'id': 1,
  'form': 'Thus',
  'lemma': 'thus',
  'upos': 'ADV',
  'xpos': 'RB',
  'feats': None,
  'head': 16,
  'deprel': 'advmod',
  'deps': None,
  'misc': {'SpaceAfter': 'No'}},
 {'id': 2,
  'form': ',',
  'lemma': ',',
  'upos': 'PUNCT',
  'xpos': ',',
  'feats': None,
  'head': 1,
  'deprel': 'punct',
  'deps': None,
  'misc': None},
 {'id': 3,
  'form': 'the',
  'lemma': 'the',
  'upos': 'DET',
  'xpos': 'DT',
  'feats': {'Definite': 'Def', 'PronType': 'Art'},
  'head': 4,
  'deprel': 'det',
  'deps': None,
  'misc': None},
 {'id': 4,
  'form': 'time',
  'lemma': 'time',
  'upos': 'NOUN',
  'xpos': 'NN',
  'feats': {'Number': 'Sing'},
  'head': 16,
  'deprel': 'nsubj',
  'deps': None,
  'misc': None},
 {'id': 5,
  'form': 'it',
  'lemma': 'it',
  'upos': 'PRON',
  'xpos': 'PRP',
  'feats': {'Case': 'Nom',
   'Gender': 'Neut',
   'Number': 'Sing',
   'Person': '3',
   'PronType': 'Prs'},
  'head': 6,
  'deprel': 'nsubj',
  'deps': None,
  'misc': None},
 {'id': 6,
  'form': 'takes',
  'lemma': 'take',
  'upos': 'VERB',
  'xpos': 'VBZ',
  'feats': {'Mood': 'Ind',
   'Number': 'Sing',
   'Person': '3',
   'Tense': 'Pres',
   'VerbForm': 'Fin'},
  'head': 4,
  'deprel': 'acl:relcl',
  'deps': None,
  'misc': None},
 {'id': 7,
  'form': 'and',
  'lemma': 'and',
  'upos': 'CCONJ',
  'xpos': 'CC',
  'feats': None,
  'head': 9,
  'deprel': 'cc',
  'deps': None,
  'misc': None},
 {'id': 8,
  'form': 'the',
  'lemma': 'the',
  'upos': 'DET',
  'xpos': 'DT',
  'feats': {'Definite': 'Def', 'PronType': 'Art'},
  'head': 9,
  'deprel': 'det',
  'deps': None,
  'misc': None},
 {'id': 9,
  'form': 'ways',
  'lemma': 'way',
  'upos': 'NOUN',
  'xpos': 'NNS',
  'feats': {'Number': 'Plur'},
  'head': 4,
  'deprel': 'conj',
  'deps': None,
  'misc': None},
 {'id': 10,
  'form': 'of',
  'lemma': 'of',
  'upos': 'SCONJ',
  'xpos': 'IN',
  'feats': None,
  'head': 12,
  'deprel': 'mark',
  'deps': None,
  'misc': None},
 {'id': 11,
  'form': 'visually',
  'lemma': 'visually',
  'upos': 'ADV',
  'xpos': 'RB',
  'feats': None,
  'head': 12,
  'deprel': 'advmod',
  'deps': None,
  'misc': None},
 {'id': 12,
  'form': 'exploring',
  'lemma': 'explore',
  'upos': 'VERB',
  'xpos': 'VBG',
  'feats': {'VerbForm': 'Ger'},
  'head': 9,
  'deprel': 'acl',
  'deps': None,
  'misc': None},
 {'id': 13,
  'form': 'an',
  'lemma': 'a',
  'upos': 'DET',
  'xpos': 'DT',
  'feats': {'Definite': 'Ind', 'PronType': 'Art'},
  'head': 14,
  'deprel': 'det',
  'deps': None,
  'misc': None},
 {'id': 14,
  'form': 'artwork',
  'lemma': 'artwork',
  'upos': 'NOUN',
  'xpos': 'NN',
  'feats': {'Number': 'Sing'},
  'head': 12,
  'deprel': 'obj',
  'deps': None,
  'misc': None},
 {'id': 15,
  'form': 'can',
  'lemma': 'can',
  'upos': 'AUX',
  'xpos': 'MD',
  'feats': {'VerbForm': 'Fin'},
  'head': 16,
  'deprel': 'aux',
  'deps': None,
  'misc': None},
 {'id': 16,
  'form': 'inform',
  'lemma': 'inform',
  'upos': 'VERB',
  'xpos': 'VB',
  'feats': {'VerbForm': 'Inf'},
  'head': 0,
  'deprel': 'root',
  'deps': None,
  'misc': None},
 {'id': 17,
  'form': 'about',
  'lemma': 'about',
  'upos': 'ADP',
  'xpos': 'IN',
  'feats': None,
  'head': 19,
  'deprel': 'case',
  'deps': None,
  'misc': None},
 {'id': 18,
  'form': 'its',
  'lemma': 'its',
  'upos': 'PRON',
  'xpos': 'PRP$',
  'feats': {'Gender': 'Neut',
   'Number': 'Sing',
   'Person': '3',
   'Poss': 'Yes',
   'PronType': 'Prs'},
  'head': 19,
  'deprel': 'nmod:poss',
  'deps': None,
  'misc': None},
 {'id': 19,
  'form': 'relevance',
  'lemma': 'relevance',
  'upos': 'NOUN',
  'xpos': 'NN',
  'feats': {'Number': 'Sing'},
  'head': 16,
  'deprel': 'obl',
  'deps': None,
  'misc': {'SpaceAfter': 'No'}},
 {'id': 20,
  'form': ',',
  'lemma': ',',
  'upos': 'PUNCT',
  'xpos': ',',
  'feats': None,
  'head': 21,
  'deprel': 'punct',
  'deps': None,
  'misc': None},
 {'id': 21,
  'form': 'interestingness',
  'lemma': 'interestingness',
  'upos': 'NOUN',
  'xpos': 'NN',
  'feats': {'Number': 'Sing'},
  'head': 19,
  'deprel': 'conj',
  'deps': None,
  'misc': {'SpaceAfter': 'No'}},
 {'id': 22,
  'form': ',',
  'lemma': ',',
  'upos': 'PUNCT',
  'xpos': ',',
  'feats': None,
  'head': 27,
  'deprel': 'punct',
  'deps': None,
  'misc': None},
 {'id': 23,
  'form': 'and',
  'lemma': 'and',
  'upos': 'CCONJ',
  'xpos': 'CC',
  'feats': None,
  'head': 27,
  'deprel': 'cc',
  'deps': None,
  'misc': None},
 {'id': 24,
  'form': 'even',
  'lemma': 'even',
  'upos': 'ADV',
  'xpos': 'RB',
  'feats': None,
  'head': 27,
  'deprel': 'advmod',
  'deps': None,
  'misc': None},
 {'id': 25,
  'form': 'its',
  'lemma': 'its',
  'upos': 'PRON',
  'xpos': 'PRP$',
  'feats': {'Gender': 'Neut',
   'Number': 'Sing',
   'Person': '3',
   'Poss': 'Yes',
   'PronType': 'Prs'},
  'head': 27,
  'deprel': 'nmod:poss',
  'deps': None,
  'misc': None},
 {'id': 26,
  'form': 'aesthetic',
  'lemma': 'aesthetic',
  'upos': 'ADJ',
  'xpos': 'JJ',
  'feats': {'Degree': 'Pos'},
  'head': 27,
  'deprel': 'amod',
  'deps': None,
  'misc': None},
 {'id': 27,
  'form': 'appeal',
  'lemma': 'appeal',
  'upos': 'NOUN',
  'xpos': 'NN',
  'feats': {'Number': 'Sing'},
  'head': 19,
  'deprel': 'conj',
  'deps': None,
  'misc': {'SpaceAfter': 'No'}},
 {'id': 28,
  'form': '.',
  'lemma': '.',
  'upos': 'PUNCT',
  'xpos': '.',
  'feats': None,
  'head': 16,
  'deprel': 'punct',
  'deps': None,
  'misc': None}]

In [None]:
# now we can iterate through the AVMs for this sentence, and
# print informati0n for each one
for token in sentence10:
    print(token["id"], token["form"], token["upos"],
          token["head"], token["deprel"])

1 Thus ADV 16 advmod
2 , PUNCT 1 punct
3 the DET 4 det
4 time NOUN 16 nsubj
5 it PRON 6 nsubj
6 takes VERB 4 acl:relcl
7 and CCONJ 9 cc
8 the DET 9 det
9 ways NOUN 4 conj
10 of SCONJ 12 mark
11 visually ADV 12 advmod
12 exploring VERB 9 acl
13 an DET 14 det
14 artwork NOUN 12 obj
15 can AUX 16 aux
16 inform VERB 0 root
17 about ADP 19 case
18 its PRON 19 nmod:poss
19 relevance NOUN 16 obl
20 , PUNCT 21 punct
21 interestingness NOUN 19 conj
22 , PUNCT 27 punct
23 and CCONJ 27 cc
24 even ADV 27 advmod
25 its PRON 27 nmod:poss
26 aesthetic ADJ 27 amod
27 appeal NOUN 19 conj
28 . PUNCT 16 punct


Now say we want to determine how often we have subject-verb-object (SVO) versus SOV versus VSO etc. in a Universal Dependencies corpus. To do that, we would like to have an AVM for a word that includes all its dependents. For the verb "inform" in the sentence above, we would like the AVM to list that "time" (word 4) is the nsubj of "inform", and "relevance" (word 19) is its obl:

$$
\left[\begin{array}{ll}
\text{form:} & inform\\
\text{id:} & 16\\
\text{upos:} & VERB\\
\text{dep:} & \[ \left[\begin{array}{ll}
\text{id:} & 4\\
\text{deprel:} & nsubj\end{array}\right],
\left[\begin{array}{ll}
\text{id:} & 19\\
\text{deprel:} & obl\end{array}\right]\]
\end{array}\right]
$$

As a Python data structure, this AVM is rather complex: It is a dictionary, but under the key "dep" the value is a list of dictionaries.


In [None]:
inform_avm_with_deps = { "form" : "inform",
                        "id" : 16,
                        "upos" : "VERB",
                        "dep" : [ {"id" : 4, "deprel" : "nsubj"},
                                  {"id" : 19, "deprel" : "obl"}]
                       }

Here is how we make a version of sentence 10 that has such an AVM for each word.

In [None]:
# first we initialize each AVM to have an empty dependencies list
sentence10_reformat = [ ]
for token in sentence10:
    sentence10_reformat.append( { "form" : token["form"],
                                  "id" : token["id"],
                                  "upos" : token["upos"],
                                  "dep" : [ ]
                                } )

# now we add dependencies
for token in sentence10:
    # looking up the head of this token. index is that head minus one.
    myhead_ix = token["head"] - 1
    # print(token["form"], token["id"], token["head"], sentence10_reformat[myhead_ix]["form"])
    # adding this token to the head's dependencies
    sentence10_reformat[ myhead_ix ]["dep"].append({ "id" : token["id"],
                                                     "deprel" : token["deprel"]})

sentence10_reformat

[{'form': 'Thus',
  'id': 1,
  'upos': 'ADV',
  'dep': [{'id': 2, 'deprel': 'punct'}]},
 {'form': ',', 'id': 2, 'upos': 'PUNCT', 'dep': []},
 {'form': 'the', 'id': 3, 'upos': 'DET', 'dep': []},
 {'form': 'time',
  'id': 4,
  'upos': 'NOUN',
  'dep': [{'id': 3, 'deprel': 'det'},
   {'id': 6, 'deprel': 'acl:relcl'},
   {'id': 9, 'deprel': 'conj'}]},
 {'form': 'it', 'id': 5, 'upos': 'PRON', 'dep': []},
 {'form': 'takes',
  'id': 6,
  'upos': 'VERB',
  'dep': [{'id': 5, 'deprel': 'nsubj'}]},
 {'form': 'and', 'id': 7, 'upos': 'CCONJ', 'dep': []},
 {'form': 'the', 'id': 8, 'upos': 'DET', 'dep': []},
 {'form': 'ways',
  'id': 9,
  'upos': 'NOUN',
  'dep': [{'id': 7, 'deprel': 'cc'},
   {'id': 8, 'deprel': 'det'},
   {'id': 12, 'deprel': 'acl'}]},
 {'form': 'of', 'id': 10, 'upos': 'SCONJ', 'dep': []},
 {'form': 'visually', 'id': 11, 'upos': 'ADV', 'dep': []},
 {'form': 'exploring',
  'id': 12,
  'upos': 'VERB',
  'dep': [{'id': 10, 'deprel': 'mark'},
   {'id': 11, 'deprel': 'advmod'},
   {'id'

Based on this data structure, we can determine whether the subject is before the verb: If so, its ID is lower than that of the verb. We can also determine whether the subject is before the object: If so, its ID isd lower than that of the the object.

# Counting words to get a sense of topic

Here is a more complex counting task: In a text, we count how often words appear, to get a sense of the topic of the text.

**Try it for yourself:** All the problems below are for you to solve.

## Words across all State of the Union addresses

We'll first do word counts across *all* state of the union addresses. Like in the previous notebook, we'll use NLTK's interface to the speeches.

In [None]:
# making sure we have all we need
import nltk
nltk.download('state_union')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package state_union to /root/nltk_data...
[nltk_data]   Unzipping corpora/state_union.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

We'll need an aggregator variable to collect word counts across all state of the union addresses. This time, our aggregator variable will be a sheet of counts, specifically an NLTK FreqDist. Since we count words across all speeches, we use one single FreqDist to collect counts.

Here is a neat fact about NLTK's FreqDist objects: You can use the method `update()` to add a whole new list of words to the counts, like this:

In [None]:
fd = nltk.FreqDist(["here", "are", "some", "words"])
print(fd.items(), "\n")
# here comes the trick
fd.update(["here", "are", "more", "words"])
print(fd.items())

dict_items([('here', 1), ('are', 1), ('some', 1), ('words', 1)]) 

dict_items([('here', 2), ('are', 2), ('some', 1), ('words', 2), ('more', 1)])


Here is an example of a FreqDist as an aggregator variable. It counts words in ae Edward Lear poem, one line at a time.

In [None]:
# first stanza of a poem by Edward Lear,
# one string per line
data = [
"They went to sea in a Sieve, they did",
"In a Sieve they went to sea:",
"In spite of all their friends could say,"
"On a winter's morn, on a stormy day,",
"In a Sieve they went to sea!",
"And when the Sieve turned round and round,",
"And every one cried, `You'll all be drowned!'",
"They called aloud, `Our Sieve ain't big,",
"But we don't care a button! we don't care a fig!",
"In a Sieve we'll go to sea!'",
"Far and few, far and few,",
"Are the lands where the Jumblies live;",
"Their heads are green, and their hands are blue,",
"And they went to sea in a Sieve."]

print("I got this many lines:", len(data))

# aggregator variable
fd = nltk.FreqDist()
# loop
for line in data:
    words = line.split()
    # adding to the aggregator variable
    fd.update(words)

print(fd.most_common(10))


I got this many lines: 13
[('a', 9), ('to', 5), ('Sieve', 5), ('went', 4), ('they', 4), ('In', 4), ('and', 4), ('And', 3), ('the', 3), ('They', 2)]


Now do the following:
* Make a FreqDist object as your aggregator variable
* iterate over fileIDs of state of the union addresses, as in the previous notebook
* for each fileID:
  * pull up the speech, as a list of words
  * add it to the aggregator variable
  
What are the most frequent words?

You'll see a lot of "uninteresting" words come out on top. To make them disappear, let's remove "stopwords" from each speech. You can get NLTK's English stopwords like this -- I'm only showing the first 10:

In [None]:
nltk.corpus.stopwords.words("english")[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In a single state of the union address, you can remove stopwords like this:

* for each word of the speech:
  * if it is not in the list of stopwords:
    * add it to the sheet of counts
    
Let's do this for the first state of the union address:

In [None]:
sotu_fileids = nltk.corpus.state_union.fileids()
first_fileid = sotu_fileids[0]
first_sotu = nltk.corpus.state_union.words(fileids = first_fileid)

mystopwords= nltk.corpus.stopwords.words("english")

fd = nltk.FreqDist()
for word in first_sotu:
    if word not in mystopwords:
        fd[word] += 1

fd.most_common(20)

[('.', 105),
 (',', 92),
 ('peace', 23),
 ('world', 20),
 ('must', 20),
 ('I', 17),
 ('We', 17),
 ('-', 16),
 ('!', 12),
 ('America', 11),
 ('The', 10),
 ('people', 10),
 ('nations', 10),
 ('In', 8),
 ('hope', 8),
 ('freedom', 7),
 ('never', 7),
 ('great', 6),
 ('upon', 6),
 ('shall', 6)]

To do this for all state of the union addresses,
* you make an outer loop that iterates over all speeches, vias their file IDs.
* then for each speech, loop over all the words in that speech.
* if the current word is not a stopword, add it to our sheet of counts.
  
Let's do this, and then see if we get better words coming out on top: