# Introduction to Dictionaries

Dictionaries are sort of like lists, except that we access them with a key, rather than with the index. A key can be a number of different objects: a string, a number, or even a tuple (which we will talk about in a moment).

Dictionaries are within "curly braces"-- `{}` -- and each key is separated by the value with a colon.

The following creates a new dictionary, and then shows how to add or edit entries.

In [None]:
basketball_wins = {'Purdue': 5,
                   'IU': 2,
                   'Northwestern': 0}

# To add a new entry
basketball_wins['Michigan'] = 5

# The same syntax updates an existing entry
basketball_wins['Purdue'] = 6

print(basketball_wins)

You access the data in a dictionary by the key. Unlike a list, you can't access an item in a dictionary by an index number (because the index number could also be a key!)

In [None]:
basketball_wins['Purdue']

In [None]:
# But you get an KeyError if it doesn't exist

basketball_wins['Wisconsin']

Dictionary keys can be any "immutable" object in Python, but they are most often strings or numbers.

While the keys must be unique and don't change, the values can change. The following code takes in a string and counts the letters in it.

In [None]:
string = """
I have been one acquainted with the night.
I have walked out in rain—and back in rain.
I have outwalked the furthest city light.

I have looked down the saddest city lane.
I have passed by the watchman on his beat
And dropped my eyes, unwilling to explain.

I have stood still and stopped the sound of feet
When far away an interrupted cry
Came over houses from another street,

But not to call me back or say good-bye;
And further still at an unearthly height,
One luminary clock against the sky

Proclaimed the time was neither wrong nor right. 
I have been one acquainted with the night.
"""
string = string.lower()
letter_dict = {}
for letter in string:
    # Don't count new lines or spaces
    if letter in ['\n',' ']:
        continue
    if letter in letter_dict: # Check if the key exists
        letter_dict[letter] = letter_dict[letter] + 1
    else: # If it doesn't exist, then create it with the value 1
        letter_dict[letter] = 1
        
print(letter_dict)


### Excercise 1 

See if you can modify the code above to count how often each word appears instead.

In [None]:
text = """
I have been one acquainted with the night.
I have walked out in rain—and back in rain.
I have outwalked the furthest city light.

I have looked down the saddest city lane.
I have passed by the watchman on his beat
And dropped my eyes, unwilling to explain.

I have stood still and stopped the sound of feet
When far away an interrupted cry
Came over houses from another street,

But not to call me back or say good-bye;
And further still at an unearthly height,
One luminary clock against the sky

Proclaimed the time was neither wrong nor right. 
I have been one acquainted with the night.
"""

import string

text = text.lower()
#Either run next line or the comment below
text = text.translate(text.maketrans('', '', string.punctuation))
#punc = '''!()-[]{};:'"\, <>./?@#$%—^&*_~'''
#for not_letter in string: 
#    if not_letter in punc: 
#        string = string.replace(not_letter, " ")
word_dict = {}
every_word = text.split()
for word in every_word:
    word_dict[word] = word_dict.get(word, 0) + 1
print(word_dict)

## Approaches for dealing with missing dictionary keys

This pattern of either modifying an existing dictionary entry or creating a new key is very common, and there are a few approaches for handling it.

1. The first is what I did above - using an if statement to check if the entry exists, e.g.:

```python
if letter in letter_dict:
    letter_dict[letter] = letter_dict[letter] + 1
else:
    letter_dict[letter] = 1
```

2. A very similar approach is used below: instead of `if`, we use a `try...except` clause, e.g.

```python
try:
    letter_dict[letter] = letter_dict[letter] + 1
except keyError:
    letter_dict[letter] = 1
```

3. A shorter, but slightly less readable approach is to use the `get` method of a dictionary. In the code below, `letter_dict.get(letter, 0)` will return the value for the key `letter` if it exists, or it will return `0` if the key doesn't exist 

```python
letter_dict[letter] = letter_dict.get(letter, 0) + 1
```

4. Finally, the `collections` package has a [defualtdict](https://docs.python.org/3.7/library/collections.html#collections.defaultdict) which lets you create a dictionary with a built in default.

```python
import collections
letter_dict = collections.defaultdict(int)
...
letter_dict[letter] = letter_dict[letter] + 1
```

For most things in python, the language tries to have one right way to do things. In this case, I think that any of these options are just fine and are basically equivalent. Use whichever makes the most sense to you.

## Tuples

Tuples are very similar to lists. They are created with parentheses -- `()` -- rather than with square brackets. 

In [None]:
my_tuple = (4,13,'hello')

Like lists, items in a tuple can be accessed by indexing.

In [None]:
my_tuple[1]

However, tuples are "immutable", meaning that they can't be changed after they are created. So, things like "append" and "pop" won't work.

This immutability is (for complicated reasons) an important attribute of dictionary keys, and tuples can be used as dictionary keys. For example, let's say you wanted to store the population of cities in the US. You might create a dictionary like this:

In [None]:
population_dict = {('Georgia', 'Atlanta'): 498000,
              ('Illinois', 'Atlanta'): 1692,
              ('Illinois', 'Chicago'): 2750000
             }

## Example

The following code takes a csv table of city populations that I grabbed from the US Census bureau API and saved [here](https://raw.githubusercontent.com/jdfoote/Intro-to-Programming-and-Data-Science/master/resources/data/uscities.csv). The first few lines below downloads the file. The next bit of code converts the file into a dictionary that looks like the above.

In [15]:
import csv
import requests
import codecs

# This downloads the file and then opens it. You could also save it to your computer, and open it in the normal way
f = requests.get('https://raw.githubusercontent.com/jdfoote/Intro-to-Programming-and-Data-Science/master/resources/data/uscities.csv')
f_csv = csv.reader(codecs.iterdecode(f.iter_lines(), 'utf-8'))
next(f_csv) # This just skips the header row, so it isn't in our data

population_dict = {}
for row in f_csv:
     # To get these numbers, I just opened the CSV file and looked at which columns had this data
    city = row[1]
    state = row[2]
    population = int(row[0])
    if (state, city) in population_dict: # Check for the same city twice in the same state
        print(f"{(state, city)} appears twice in the data")
    else:
        population_dict[(state, city)] = population
        
# This code prints the first few items in the dictionary, to make sure it looks like it's right
print(list(population_dict.items())[:5])

[(('Alabama', 'Abbeville city'), 2560), (('Alabama', 'Adamsville city'), 4281), (('Alabama', 'Addison town'), 718), (('Alabama', 'Akron town'), 328), (('Alabama', 'Alabaster city'), 33487)]


In [25]:
list(population_dict.items())[2][0][1]

'Addison town'

It looks right, so let's press on.

By using tuples as keys, you can do things like summarize by one or the other entries in the tuple.

In [10]:
state_populations = {}
for state_city in population_dict:
    state = state_city[0] # Extract the state from the key
    city_pop = population_dict[state_city] # Extract the population from the value
    try: # If the key exists, then add the population
        state_populations[state] = state_populations[state] + city_pop
    except KeyError: # Otherwise set the value to the population
        state_populations[state] = city_pop
    
print(state_populations)

{'Alabama': 2998987, 'Alaska': 497834, 'Arizona': 5791407, 'Arkansas': 2001152, 'California': 32965607, 'Colorado': 4279051, 'Connecticut': 1379443, 'Delaware': 276116, 'District of Columbia': 705749, 'Florida': 10766975, 'Village of Islands village; Florida': 6317, 'Georgia': 4685332, 'Hawaii': 345064, 'Idaho': 1257628, 'Illinois': 11040504, 'Indiana': 4505674, 'Iowa': 2524555, 'Kansas': 2418759, 'Kentucky': 2473075, 'Louisiana': 2179262, 'Maine': 376400, 'Maryland': 1528860, 'Massachusetts': 3636758, 'Michigan': 5051706, 'Minnesota': 4665488, 'Mississippi': 1505545, 'Missouri': 4067590, 'Montana': 580973, 'Nebraska': 1500339, 'Nevada': 1752205, 'New Hampshire': 428025, 'New Jersey': 4233773, 'New Mexico': 1402249, 'New York': 12403348, 'North Carolina': 5962201, 'North Dakota': 592706, 'Ohio': 7604347, 'Oklahoma': 3044153, 'Oregon': 2963745, 'Pennsylvania': 5586259, 'Rhode Island': 547731, 'South Carolina': 1865562, 'South Dakota': 626159, 'Tennessee': 4090552, 'Moore County metropol

## Excercise 2

Reuse and modify the code above so that it prints a dictionary of the total population of cities that start with each letter of the alphabet.

In [32]:
letter_city_pop = {}
for cities in population_dict:
    city = cities[1] # Extract the city from the key
    city_pop = population_dict[cities] # Extract the population from the value
    for letter in "abcdefghijklmnopqrstuvwxyz":
        if city.startswith(letter.upper()):
            try: # If the key exists, then add the population
                letter_city_pop[letter] = letter_city_pop[letter] + city_pop
            except KeyError: # Otherwise set the value to the population
                letter_city_pop[letter] = city_pop
    
print(letter_city_pop)

{'a': 10205308, 'b': 12556367, 'c': 19334831, 'd': 8004522, 'e': 5776253, 'f': 7887257, 'g': 6978634, 'h': 9657570, 'i': 2537573, 'j': 3115112, 'k': 3357584, 'l': 15896771, 'm': 15038400, 'n': 15665179, 'o': 5766337, 'p': 13478110, 'r': 7750171, 's': 23085235, 't': 6159462, 'u': 1158915, 'v': 2185529, 'w': 9914670, 'y': 952731, 'q': 244256, 'z': 150757, 'x': 27315}


# Day 5 - Chapter 9
## Exercise 2
### Write a program that categorizes each mail message by which day of the week the commit was done. To do this look for lines that start with “From”, then look for the third word and keep a running count of each of the days of the week. At the end of the program print out the contents of your dictionary (order does not matter).

In [None]:
fname = input('Enter the file name: ')
try:
    f = open(fname, 'r')
except:
    print('File cannot be opened:', fname)
    exit()
days = {}
for line in f:
    if line.startswith("From"):
        try:
            day_week = line.split()[2]
        except:
            continue
        days[day_week] = days.get(day_week, 0) + 1
f.close()
print(days)

## Exercise 3
### Write a program to read through a mail log, build a histogram using a dictionary to count how many messages have come from each email address, and print the dictionary.

In [None]:
fname = input('Enter the file name: ')
try:
    f = open(fname, 'r')
except:
    print('File cannot be opened:', fname)
    exit()
emails = {}
for line in f:
    if line.startswith("From") and len(line.split()) > 2:
        email = line.split()[1]
        emails[email] = emails.get(email, 0) + 1
f.close()
print(emails)

## Exercise 4
### Add code to the above program to figure out who has the most messages in the file. After all the data has been read and the dictionary has been created, look through the dictionary using a maximum loop (see Chapter 5: Maximum and minimum loops) to find who has the most messages and print how many messages the person has.

In [None]:
fname = input('Enter the file name: ')
try:
    f = open(fname, 'r')
except:
    print('File cannot be opened:', fname)
    exit()
emails = {}
for line in f:
    if line.startswith("From") and len(line.split()) > 2:
        email = line.split()[1]
        emails[email] = emails.get(email, 0) + 1
f.close()
largest = None
for key in emails:
    if largest is None or emails[key] > largest:
            largest = emails[key]
            sender = key
print(sender, largest)

# Day 5 - Chapter 10
## Exercise 1
### Revise a previous program as follows: Read and parse the “From” lines and pull out the addresses from the line. Count the number of messages from each person using a dictionary. After all the data has been read, print the person with the most commits by creating a list of (count, email) tuples from the dictionary. Then sort the list in reverse order and print out the person who has the most commits.

In [3]:
fname = input('Enter the file name: ')
try:
    f = open(fname, 'r')
except:
    print('File cannot be opened:', fname)
    exit()
emails = {}
for line in f:
    if line.startswith("From") and len(line.split()) > 2:
        email = line.split()[1]
        emails[email] = emails.get(email, 0) + 1
f.close()
senders = list()
for key, val in emails.items() :
    senders.append( (val, key) )
senders.sort(reverse = True)
print(senders[0])

Enter the file name: mbox-short.txt
(5, 'cwen@iupui.edu')


## Exercise 2
### This program counts the distribution of the hour of the day for each of the messages. You can pull the hour from the “From” line by finding the time string and then splitting that string into parts using the colon character. Once you have accumulated the counts for each hour, print out the counts, one per line, sorted by hour as shown below.

In [8]:
fname = input('Enter the file name: ')
try:
    f = open(fname, 'r')
except:
    print('File cannot be opened:', fname)
    exit()
hours = {}
for line in f:
    if line.startswith("From") and len(line.split()) > 2:
        colon = line.find(':')
        hour = line[colon - 2:colon]
        hours[hour] = hours.get(hour, 0) + 1
f.close()
sort_hours = list()
for key, val in hours.items() :
    sort_hours.append( (key, val) )
sort_hours.sort()
print(sort_hours)

Enter the file name: mbox-short.txt
[('04', 3), ('06', 1), ('07', 1), ('09', 2), ('10', 3), ('11', 6), ('14', 1), ('15', 2), ('16', 4), ('17', 2), ('18', 1), ('19', 1)]
