# Data structures: dictionaries

Next to lists, **dictionaries** are another common Python data type. 

At first these objects may seem abstract (and understanding why they are useful takes some time) but please be assured, dictionaries are everywhere when writing a program in Python.


Why is it called a dictionary? Similar to how a traditional dictionary is structured, it consists of a collection of entries (i.e. **keys**) with **values** (in a traditional dictionary, that's the definition).

Example from a traditional dictionary:
> _programming (n)_: The art of writing a program.

A dictionary in Python is a mapping between keys and values:
* **Key**: the index of the dictionary, it can be a string, a number, or a tuple. **Keys are unique**.
* **Value**: the value associated with a given key, it can be of any type! Values don't need to be unique.

For example:

```python
dictionary = {"programming" : "The art of writing a program.",
              "dictionary" : "Collection of words with their definitions."
             }
```

In Python:
* A dictionary is represented in curly brackets `{}`.
* A colon (```:```) is used to separate the key from the value. 
* A key-value pair is called an **item**
* Different **items** in a dictionary are separated by commas (```,```).

You can have as many elements as necessary in a dictionary:

In [None]:
shipwrecks_by_year = {
    "Santa Maria": 1492,
    "USS Indianapolis" : 1945,
    "HMS Endeavour": 1778,
    "Endurance": 1915
    }

Values of dictionaries can also be tuples (like here), or lists, or sets...

In [None]:
shipwrecks_by_location = {
    "Santa Maria": (19.00, -71.00),
    "USS Indianapolis" : (12.03, 134.80),
    "HMS Endeavour": (41.60, -71.35),
    "Endurance": (-69.08, -51.50)
    }

... and dictionary values can also be nested dictionaries!

In [None]:
# Here the outer keys are strings, values are dictionaries!
# The keys of the inner dictionary ('Coordinates', 'Year')
# are strings, their values are (1) a tuple of floats for
# key "Coordinates", and (2) an integer for key "Year".
shipwrecks = {
    "Santa Maria": {
        "Coordinates": (19.00, -71.00),
        "Year": 1492
    },
    "USS Indianapolis": {
        "Coordinates": (12.03, 134.80),
        "Year": 1945
    },
    "HMS Endeavour": {
        "Coordinates": (41.60, -71.35),
        "Year": 1778
    },
    "Endurance": {
        "Coordinates": (-69.08, -51.50),
        "Year": 1915
    }
}

## Counting words

In the context of text mining, we often use dictionaries to keep track of **word counts**. For example the **bag-of-words** representation of the first sentence of Alice in Wonderland
> Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, “and what is the use of a book,” thought Alice “without pictures or conversations?”

could be represented with a dictionary as (after stripping away the punctuation):




In [None]:
wordcounts = {'Alice': 2,
 'was': 2,
 'beginning': 1,
 'to': 2,
 'get': 1,
 'very': 1,
 'tired': 1,
 'of': 3,
 'sitting': 1,
 'by': 1,
 'her': 2,
 'sister': 2,
 'on': 1,
 'the': 3,
 'bank': 1,
 'and': 2,
 'having': 1,
 'nothing': 1,
 'do': 1,
 'once': 1,
 'or': 3,
 'twice': 1,
 'she': 1,
 'had': 2,
 'peeped': 1,
 'into': 1,
 'book': 2,
 'reading': 1,
 'but': 1,
 'it': 2,
 'no': 1,
 'pictures': 2,
 'conversations': 2,
 'in': 1,
 'what': 1,
 'is': 1,
 'use': 1,
 'a': 1,
 'thought': 1,
 'without': 1}


## Creating a dictionary

Creating an **empty** dictionary (with no items):

In [None]:
new_dict = dict()

Creating a dictionary with some items in it:

In [None]:
new_dict = {
    "item1": 1,
    "item2": 2,
    "item3": 3
    }

✏️ **1. Exercise:**

In [None]:
# Create a dictionary that maps all places you've lived in with the country they belong to, and print it.
# You can include some holidays locations if you lived in one country only
# Type your code here:



## Adding and updating items into a dictionary

You add a key-value pair into a dictionary like this, where "London" in this case is the new key you want to add, and "United Kingdom" is its value:

In [None]:
# Instantiate a dictionary of lived places:
dictionary_of_lived_places = {"Toronto": "Canada",
                              "Berlin": "Germany",
                              "Amsterdam": "Netherlands",
                              "Boras": "Sweden",
                              "Saint Petersburg": "Russia",
                              "Antwerp": "Belgium"
                             }
print(dictionary_of_lived_places)

In [None]:
# Below, we're adding a new key-value pair in our dicionary_of_lived_places:
dictionary_of_lived_places["London"] = "United Kingdom"

# Print the dictionary:
print(dictionary_of_lived_places)

You can update the value of a given key in the same way. For example, let's suppose for a moment that "Amsterdam" is not in "Netherlands", but in "The Netherlands":

In [None]:
dictionary_of_lived_places["Amsterdam"] = "The Netherlands"
print(dictionary_of_lived_places)

## Accessing a value from a dictionary

Dictionaries provide a very straightforward way to access the values of a given key.

Suppose we want to get the coordinates of the 'Endurance', given the following dictionary:

In [None]:
shipwrecks_by_location = {
    "Santa Maria": (19.00, -71.00),
    "USS Indianapolis" : (12.03, 134.80),
    "HMS Endeavour": (41.60, -71.35),
    "Endurance": (-69.08, -51.50)
    }

We just need to call the dictionary and the key in square brackets:

In [None]:
print(shipwrecks_by_location["Endurance"])

What happens if you try to access a key that does not exist in the dictionary?

In [None]:
print(shipwrecks_by_location["Titanic"])

To avoid annoying error messages, you can always check beforehand if the key exists:

In [None]:
query = "Titanic"
if query in shipwrecks_by_location:
    print(shipwrecks_by_location[query])
else:
    print("Warning: " + query + " is not in the dictionary.")

Alternatively, you also can use the method ```.get()```: the first element of the method is the key you want to look for in the dictionary, and the second element the message that it is returned if the key has not been found:

In [None]:
query = "Endurance"
print(shipwrecks_by_location.get(query, "Key not in dictionary!"))

In [None]:
query = "Titanic"
print(shipwrecks_by_location.get(query, "Key not in dictionary!"))

## Other methods to retrieve data from a dictionary:

Get all the **keys** in the dictionary:

In [None]:
print(dictionary_of_lived_places.keys())

Get all the **values** in the dictionary (in a list format):

In [None]:
print(dictionary_of_lived_places.values())

Get the **key-value pairs** in the dictionary (in a list of tuples format):

In [None]:
print(dictionary_of_lived_places.items())

## Sorting a dictionary

Dictionaries are unordered and can't be sorted, but we can sort a representation of a dictionary.

We often use sorting operation when working with word counts (or any other type of counts, really). Let's revisit the (reduced) word counts taken from the first sentence of Alice in Wonderland.

Sort **by key** in ascending order:

In [None]:
wordcounts = {'Alice': 2,
 'was': 2,
 'beginning': 1,
 'to': 2,
 'get': 1,
 'very': 1,
 'tired': 1,
 'of': 3,
 'the': 3,
 'bank': 1,
 'and': 2,
 'having': 1,
 'nothing': 1,
 'do': 1,
 'once': 1,
 'or': 3}


In [None]:
sorted(wordcounts.items())

Sort **by key** in descending order:

In [None]:
sorted(wordcounts.items(), reverse=True)

Dictionaries are unordered and can't be sorted, but we can sort a representation of a dictionary.

Sort **by value** in ascending order:

In [None]:
sorted(wordcounts.items(),key=lambda x:x[1])

In [None]:
def get_second_item(x):
    return x[1]
get_second_item((2,3))

In [None]:
sorted(wordcounts.items(),key=get_second_item)

In [None]:
get_second_item = lambda x : x[1]
get_second_item((2,3))

In [None]:
sorted(wordcounts.items(),key=get_second_item)

✏️ **2. Exercise:** 

Sort word frequencies from high to low (i.e. from most common to least common).

In [None]:
# Can you guess how to sort by value in descending order?
#
# Type your code here:



## Iterate over a dictionary

We can iterate over a dictionary in a similar way to how we interate over a list. By default, iteration is done over the keys of the dictionary. See the example:

In [None]:
dictionary_of_lived_places = {
    "Moose Jaw": "Canada",
    "Saskatoon": "Canada",
    "Ithaca": "Greece"
}

for k in dictionary_of_lived_places:
    print(k)

We can also iterate over the dictionary key-value pairs (`k, v`) with the `.items()` method.

In [None]:
dictionary_of_lived_places = {
    "Moose Jaw": "Canada",
    "Saskatoon": "Canada",
    "Ithaca": "Greece"
}

for k, v in dictionary_of_lived_places.items():
    print(k, v)

Remember that we can access the value of a key like this: ```dictionary[key]```. The cell below achieves teh same as the cell above:

In [None]:
for k in dictionary_of_lived_places:
    print(k, dictionary_of_lived_places[k])

## Use case: counting words

Dictionaries are very helpful to count and keep track of how often words appear in a document or a collection. Look at the example below:

In [None]:
text = "This is the second day of the summer school, we're covering data structures in Python."

word_list = text.split(" ")
dict_wordcounts = dict()
for word in word_list:
    if word in dict_wordcounts:
        dict_wordcounts[word] += 1
    else:
        dict_wordcounts[word] = 1

print(dict_wordcounts)

Do you understand what is happening at each line of code?

Now with comments:

In [None]:
text = "This is the second day of the summer school, we're covering data structures in Python."
word_list = text.split(" ") # First of all, split the text by white space.
dict_wordcounts = dict() # We start an empty dictionary, which we will be filling with data as we read it.
for word in word_list: # Read and iterate over the data, in this case a list of words. 
    if word in dict_wordcounts: # Check if the word exists as a key in the dictionary
        dict_wordcounts[word] += 1 # Update the value (i.e. count) of the key (i.e. word) by adding 1 to it.
    else: # If the word does not exist as a key in the dictionary: 
        dict_wordcounts[word] = 1 # Add the key to the dictionary, and give it the value 1.

print(dict_wordcounts)

## Using Counter

In practice, Python provides more convenient tools for computing word counts, a `Counter`.

The `Counter` takes a list of tokens (or other items) as input and computes how often each value appears.

It is not loaded automatically, so we need to add an `import` statement first.

In [None]:
from collections import Counter
sentence = "Alice was beginning to get very tired of sitting by her sister on the bank and of having nothing to do"
word_list = sentence.split()
print(word_list)

Convert the list of tokens to Counter object with word counts.

In [None]:
word_counts = Counter(sentence)
word_counts

The Counter is an adaptation of the dictionary and behaves similarly in many contexts. For example, accessing values by key.

In [None]:
word_counts['of']

Or reassigning values by key.

In [None]:
word_counts['of'] = 3
word_counts['of']

Sorting is much more convenient with the `.most_common()` method. You print the three most common words using the statemnt below.

In [None]:
word_counts.most_common(3)

# APIs and JSON.

## Application Programming Interfaces


- API access means programmatic access to (online) content.
- Another common source of information besides webscraping.

- Many institutions/data providers allow API access to their content. 
- Details are different, but they generally work in similar ways.

We look at Chronicling America [API interface](https://chroniclingamerica.loc.gov/about/api/)

## Chronicling America API

We define a call to the API by formulating a URL.

```python
url="https://chroniclingamerica.loc.gov/search/pages/results/?andtext=suffrage&format=json"
```
- First part defines protocol and server: `https://chroniclingamerica.loc.gov/`
- `search`: type of request
- `pages`: level of search, can also be '`title`'
- `results/?` is followed by search parameters
  - andtext: the search query
  - format: 'html' (default), or 'json', or 'atom' (optional)
  - page: for paging results (optional)

In [None]:
import requests

In [None]:
api_query = "https://chroniclingamerica.loc.gov/search/pages/results/?andtext=suffrage&format=json&page=1" 

In [None]:
content = requests.get(api_query).json()

## Navigating JSON

- JavaScript Object Notation

In [None]:
type(content)

In [None]:
content

```python
{'totalItems': 666702,
 'endIndex': 220,
 'startIndex': 201,
 'itemsPerPage': 20,
 ...}
```

In [None]:
type(content)

In [None]:
content.keys()

In [None]:
type(content['items'])

In [None]:
content['items'][0]

✏️ **3. Exercise:** 


- Change the search term in `api_query` from 'sufrage' to something else
- Collect the JSON response
- Save all the 'ocr_text' in a list called `text`
- Join the text into one string with `' '.join(text)'
- Compute the word counts with `Counter` (use split)
- Print the twenty most frequent words (use Counter)

In [None]:
# write your answer here

# Fin.

## Solutions


In [None]:
# 1. Exercise
# Instantiate a dictionary of lived places:
dictionary_of_lived_places = {"Toronto": "Canada",
                              "Berlin": "Germany",
                              "Amsterdam": "Netherlands",
                              "Boras": "Sweden",
                              "Saint Petersburg": "Russia",
                              "Antwerp": "Belgium"
                             }
print(dictionary_of_lived_places)

In [None]:
# 2. Exercise:
wordcounts = {'Alice': 2,
 'was': 2,
 'beginning': 1,
 'to': 2,
 'get': 1,
 'very': 1,
 'tired': 1,
 'of': 3,
 'the': 3,
 'bank': 1,
 'and': 2,
 'having': 1,
 'nothing': 1,
 'do': 1,
 'once': 1,
 'or': 3}
sorted(wordcounts.items(),key=lambda x:[1], reverse=True)

In [None]:
# 3. Exercise:

# Change the search term in `api_query` from 'sufrage' to something else
# Collect the JSON response
# Save all the 'ocr_text' in a list called `text`
# Join the text into one string with `' '.join(text)'
# Compute the word counts with `Counter` (use split)
# Print the twenty most frequent words (use Counter)

import requests
from collections import Counter
api_query = "https://chroniclingamerica.loc.gov/search/pages/results/?andtext=suffragette&format=json&page=1"
content = requests.get(api_query).json()
text = []
for item in content['items']:
    text.append(item['ocr_eng'])
    
text = ' '.join(text)
text_words = text.split()
word_counts = Counter(text_words)
word_counts.most_common(20)