# Dictionaries

## Instantiating a dictionary

Examples

In [None]:
catalog = {'1008':'widget', '2149':'flange', '19x5':'smoke shifter', '992':'poiuyt'}
profile = {'name':'Mickey Mouse', 'company':'Disney', 'animated':True, 'fingers':8}

print(catalog['2149'])
print(profile['name'])

In [None]:
# The key can be a string variable rather than a literal
characteristic = 'animated'
print(profile[characteristic])

trait = input('What do you want to know about the character? ')
print("The character's", trait, 'is', profile[trait])

## Editing a dictionary

In [None]:
my_dict = {}
my_dict['name'] = input('What is the character name? ')
print(my_dict)
my_dict['company'] = input('Who does the character work for? ')
print(my_dict)
my_dict['fingers'] = int(input('How many fingers does the character have? '))
print(my_dict)

In [None]:
print(catalog)

catalog['2149'] = 'thingamajig'
print(catalog)

del catalog['1008']
print(catalog)

## Practice

The starter code has some data retrieved from the Twitter API for a tweet. Print the text of the tweet, then change the vanlue of the `lang` key to Spanish (language code `es`) and print the dictionary.

In [None]:
tweet = {'created_at':'Wed Sep 18 14:08:54 +0000 2019', 'text':'RT @wnprwheelhouse: @wnprharriet кричать @wnpr !','lang':'ru'}


# Complex data structures

## Lists of lists


In [None]:
first_row = [3, 5, 7, 9]
second_row = [4, 11, -1, 5]
third_row = [-99, 0, 45, 0]
data = [first_row, second_row, third_row]
print(data)

In [None]:
print(len(data))

In [None]:
print(data[1])
print(len(data[1]))

In [None]:
data = [[3, 5, 7, 9], [4, 11, -1, 5], [-99, 0, 45, 0]]
print(data[2][0])

## Lists of dictionaries

In [None]:
characters = [{'name':'Mickey Mouse', 'company':'Disney', 'gender': 'male'}, {'name':'Daisy Duck', 'company':'Disney', 'gender': 'female'}, {'name':'Daffy Duck', 'company':'Warner Brothers', 'gender': 'male'},  {'name':'Fred Flintstone', 'company':'Hanna Barbera', 'gender': 'male'}, {'name':'WALL-E', 'company':'Pixar', 'gender': 'neutral'}, {'name':'Fiona', 'company':'DreamWorks', 'gender': 'female'}]
print(characters[1]['company'])
print(characters[0]['name'])
print(characters[4]['gender'])

# Pandas

Standard import statement for pandas

In [None]:
import pandas as pd

# Create a dictionary to use for experiments
states_dict = {'OH': 'Ohio', 'TN': 'Tennessee', 'AZ': 'Arizona', 'PA': 'Pennsylvania', 'AK': 'Alaska'}
print(states_dict)

## pandas Series

Series are one-dimensional pandas data structures that are sort of a hybrid of dictionaries and lists. They are ordered, but they also are labeled.

We can create an instance of a Series by passing a dictionary as an argument into `pd.Series()`:

In [None]:
states_series = pd.Series(states_dict)
print(states_series)

When a Series is displayed, the label index is shown on the left and the Series items are shown on the right.

We can refer to items in a Series by either position (using an integer index) or by their name (using the label index for the item). Integer indexing is zero-based as with everything else in Python.

In [None]:
print(states_series[2])
print(states_series['TN'])

Series item 2 is the third item in the Series since we start counting with zero.

There is an alternate way of referring to Series items by position that makes the indexing system explicit. `.loc[]` locates items by label index, and `.iloc[]` locates items by integer index. WARNING: One gotcha here is that the "i" in "iloc" should be thought of as referring to "integer", NOT "index". In pandas, when the term "index" is used by itself, it refers to the label index, not the integer index.

Specifying a single index in `.loc[]` or `.iloc[]` returns a single value from the Series. In this case the values are strings, so the type of the returned value is string.

In [None]:
print(states_series.iloc[2])
print(states_series.loc['TN'])
print(type(states_series.loc['TN']))

## pandas DataFrames

DataFrames are two-dimensional data structures composed of Series with shared indices.

DataFrames can be created from a dictionary of Series.

In [None]:
text_series = pd.Series({'OH': 'Ohio', 'TN': 'Tennessee', 'AZ': 'Arizona', 'PA': 'Pennsylvania', 'AK': 'Alaska'})
capital_series = pd.Series({'OH': 'Columbus', 'TN': 'Nashville', 'AZ': 'Phoenix', 'PA': 'Harrisburg', 'AK': 'Juneau'})
population_series = pd.Series({'OH': 11799448, 'TN': 6910840, 'AZ': 7151502, 'PA': 13002700, 'AK': 733391})
print(text_series)
print()
print(capital_series)
print()
print(population_series)

states_dict = {'text': text_series, 'capital': capital_series, 'population': population_series}
states_df = pd.DataFrame(states_dict)

When created in this way, the dictionary keys are used as the column headers (column label indices) and each series becomes a column. The label indices of the series are shared by all of the rows as the row label indices.

When you print a pandas DataFrame, you get a text representation. If the name is given as the last line of the notebook cell, it's displayed in a "prettier" form.

In [None]:
print(states_df)
states_df

## Specifying a column

We can specify a column by using its column header as the label index in square brackets. The resulting column is a pandas Series.

In [None]:
print(states_df['capital'])
print()
print(type(states_df['capital']))

## Specifying a row

Select a row using `.loc` with the label index and `.iloc` with the integer position. The resulting output is a series, with the column labels as its label indices.

In [None]:
print(states_df.loc['AZ'])
print()
print(states_df.iloc[1])

## Specifying a cell

Select a cell using `.loc` with the label index and column label. The resulting output is the type of data containted in the cell.

In [None]:
print(states_df.loc['PA', 'population'])
print(type(states_df.loc['PA', 'population']))
print(states_df.loc['AK', 'capital'])
print(type(states_df.loc['AK', 'capital']))

## Practice

Print the expressions for the following:
- The population column
- The row for Tennessee, using the label index
- The row for Alaska, using the integer index
- The capital of Pennsylvania

# Loading a DataFrame from a file

Although there are a number of ways to build a pandas DataFrame from simpler Python objects, most of the time we will create them from data that are already in tablular form in a file. 

You can load a CSV file by passing in its URL as the argument of the `.read_csv()` function. Since the `School ID` column is a unique identifier for each row, we can use it as the index column.

In [None]:
schools_df = pd.read_csv('https://raw.githubusercontent.com/HeardLibrary/digital-scholarship/master/data/gis/wg/Metro_Nashville_Schools.csv')
# Set the row label index to be the School ID column
schools_df = schools_df.set_index('School ID')
schools_df

If a DataFrame is large, it will be difficult to examine the whole thing at once. We can use several methods to view characteristics of the DataFrame.

The `.head()` method will display the first 5 rows of the DataFrame. You can pass in a different number of rows to display as an argument. 

In [None]:
schools_df.head()

## Practice

Print the `School Name` column. Then print the first 3 rows of the DataFrame.

In [None]:
print(schools_df['School Name'])

## Vectorized operations

Pandas Series and DataFrames support vectorized operations, which means that operations are applied to every item in the Series or DataFrame at once. 

We can calculate the total number of students in each school by adding the `Male` and `Female` columns together.

In [None]:
total_students = schools_df['Male'] + schools_df['Female']
print(total_students)

# Create a DataFram to make the answers easier to read.
summary_df = pd.DataFrame({'male': schools_df['Male'], 'female': schools_df['Female'], 'total': total_students})
summary_df

## Practice

Calculate the percentage of male students in each school and print the series.

# Loops

## Iterating using `for`

Example

In [None]:
basket = ['apple', 'orange', 'banana', 'lemon', 'lime']
for fruit in basket:
    print('I ate one ' + fruit)
print("I'm full now!")

In [None]:
word = 'supercalifragilisticexpialidocious'
print('Spell it out!')
for letter in word:
    print(letter)
print('That wore me out.')

## Building a sequence with a for loop

The pattern of creating an empty thing and then adding a sequence of items to it in a loop is a common one. 

```
sequence = sequence + item
```

can be replaced with 

```
sequence += item
```

Code with explicit concatenation:

In [None]:
list_of_words = ['The ', 'quick ', 'brown ', 'fox ', 'jumps ', 'over ', 'the ', 'lazy ', 'dog ']
sentence = ''
for word in list_of_words:
    sentence = sentence + word # Concatenate the word to the sentence
print(sentence + '!')

Code with shorthand:


In [None]:
sentence = ''
for word in list_of_words:
    sentence += word
print(sentence + '!')

Same strategy, but doing creating a total by summing a list of numbers:


In [None]:
total = 0
for number in [3, 5, 7, 9]:
    total += number
print('The total is', total)

Using a `range()` object to create a list from user input:


In [None]:
bird_list = []
for i in range(4):
    bird = input('Enter a bird name: ')
    bird_list.append(bird)
print('Your bird list is:', bird_list)

# Iterating through rows in a DataFrame

One of the main purposes of pandas is to make it possible to perform operations on entire columns using vectorized operations. However, there are some situations where it makes sense to iterate through each row in the DataFrame and deal with values one row at a time. These situations would include complex operations that require multiple lines of code to describe, or actions that must happen sequentially, such as retrieving data from a URL.

Our example will use information about websites

In [None]:
websites = {
    'name': {'alphabet': 'Google', 'vu': 'Vanderbilt', 'fake': 'Obsolete Website'}, 
    'url': {'alphabet': 'https://www.google.com/', 'vu': 'https://www.vanderbilt.edu/', 'fake': 'https://example.org/fake_url'},
    'status': {'alphabet': 'unknown', 'vu': 'unknown', 'fake': 'unknown'}
           }
websites_df = pd.DataFrame(websites)
websites_df

To generate an iterable object from the DataFrame we use the `.iterrows()` method. Iterating using a `for` loop generates a tuple consisting of the label index and the data from the row, in the form of a Series.

In [None]:
for label_index, website_series in websites_df.iterrows():
    print(label_index)
    print()
    print(website_series)
    print()
    print()

To access a value from the row Series, we can use direct indexing by providing the column label index.

In [None]:
for label_index, website_series in websites_df.iterrows():
    print(website_series['url'])
    print()

Iterating will allow us to check the status of each website one at a time.

In [None]:
import requests
for label_index, website_series in websites_df.iterrows():
    response = requests.get(website_series['url'])
    # HTTP status code 200 means the website is up, 404 means it's down.
    print(label_index, website_series['url'], response.status_code)
    # Assign the status to the status column in the DataFrame
    websites_df.loc[label_index, 'status'] = response.status_code

# Print the updated DataFrame
websites_df

## Practice

Use the .head() method to assign the first 10 rows of the schools DataFrame to a new DataFrame called `schools_subset`. Then iterate through the rows of `schools_subset` and print the `School Name` and `Zip Code` for each row.