# Dictionaries and Pandas

## Dictionaries

### Create a dictionary

The **python dictionary** (`dict`) is a **data structure** that **stores `key:value` pairs**. It is very useful for storing labeled data and then looking up this information by its label. 

In [1]:
# With the strings in countries and capitals, create a dictionary called europe with 4 key:value pairs
# Beware of capitalization! Make sure you use lowercase characters everywhere
countries = ['spain', 'france', 'germany', 'norway']
capitals = ['madrid', 'paris', 'berlin', 'oslo']

europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo' }

# Print out europe to see if the result is what you expected
europe

{'spain': 'madrid', 'france': 'paris', 'germany': 'berlin', 'norway': 'oslo'}

### Access dictionary

In [2]:
# Check out which keys are in europe
europe.keys()

dict_keys(['spain', 'france', 'germany', 'norway'])

In [3]:
# Print out the value that belongs to the key 'norway'
europe['norway']

'oslo'

### Dictionary keys

Dictionary **keys must be unique**. This is, two objects can't be kept in a dictionary if they are associated with the same key.

Also, dictionary keys must be objects of **immutable data type**, such as `str` or `float`.

### Dictionary manipulation

In [4]:
# Add the key 'italy' with the value 'rome' to europe
europe['italy'] = 'rome'

# Check if 'italy' is now a key in europe
'italy' in europe

True

In [5]:
europe = {'spain':'madrid', 'france':'paris', 'germany':'bonn',
          'norway':'oslo', 'italy':'rome', 'poland':'warsaw',
          'australia':'vienna' }

# The capital of Germany is not 'bonn'; it's 'berlin'
# Update its value
europe['germany'] = 'berlin'

# Australia is not in Europe
# Remove the key 'australia' from europe
del(europe['australia'])

# Print out europe
europe

{'spain': 'madrid',
 'france': 'paris',
 'germany': 'berlin',
 'norway': 'oslo',
 'italy': 'rome',
 'poland': 'warsaw'}

### Dictionaries as values

Dictionaries can contain `key:value` pairs where the values are again dictionaries.

In [6]:
europe = { 'spain': { 'capital':'madrid', 'population':46.77 },
           'france': { 'capital':'paris', 'population':66.03 },
           'germany': { 'capital':'berlin', 'population':80.62 },
           'norway': { 'capital':'oslo', 'population':5.084 } }

# Use chained square brackets to select and print out the capital of France
europe['france']['capital']

'paris'

In [8]:
# Create a dictionary, named data, with the keys 'capital' and 'population' 
# Set them to 'rome' and 59.83, respectively
data = {'capital':'rome', 'population':59.83}

# Add a new key-value pair to europe; the key is 'italy' and the value is data, the dictionary you just built
europe['italy'] = data

europe

{'spain': {'capital': 'madrid', 'population': 46.77},
 'france': {'capital': 'paris', 'population': 66.03},
 'germany': {'capital': 'berlin', 'population': 80.62},
 'norway': {'capital': 'oslo', 'population': 5.084},
 'italy': {'capital': 'rome', 'population': 59.83}}

## Pandas

### Pandas

Pandas is an open source library, providing high-performance, easy-to-use data structures and data analysis tools for Python.

In [9]:
import pandas as pd

### Pandas DataFrame

The **DataFrame** is one of Pandas' most important data structures. It's basically a way to store **tabular data** where you can label the rows and the columns.

### Dictionary to DataFrame

In [13]:
# Use the pre-defined lists to create a dictionary called my_dict
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
row_labels = ['US', 'AUS', 'JPN', 'IN', 'RU', 'MOR', 'EG']

my_dict = {'country':names, 'drives_right':dr, 'cars_per_cap':cpc}
cars = pd.DataFrame(my_dict)

# Specify the row labels by setting cars.index equal to row_labels
cars.index = row_labels

cars

Unnamed: 0,country,drives_right,cars_per_cap
US,United States,True,809
AUS,Australia,False,731
JPN,Japan,False,588
IN,India,False,18
RU,Russia,True,200
MOR,Morocco,True,70
EG,Egypt,True,45


### CSV to DataFrame

If you're dealing with millions of observations the data is typically available as files with a regular structure. One of those file types is the CSV file, which is short for "comma-separated values".

In [16]:
# Import cars.csv data as a DataFrame 
# Store this DataFrame as cars
# Specify the index_col argument inside pd.read_csv(): set it to 0, so that the first column is used as row labels
cars = pd.read_csv('./data/cars.csv', index_col=0)

# Print out cars
cars

Unnamed: 0,country,drives_right,cars_per_cap
US,United States,True,809
AUS,Australia,False,731
JPN,Japan,False,588
IN,India,False,18
RU,Russia,True,200
MOR,Morocco,True,70
EG,Egypt,True,45


### Subsetting DataFrames with square brackets

In [17]:
# Use single square brackets to print out the country column of cars as a Pandas Series
cars['country']

US     United States
AUS        Australia
JPN            Japan
IN             India
RU            Russia
MOR          Morocco
EG             Egypt
Name: country, dtype: object

In [18]:
# Use double square brackets to print out the country column of cars as a Pandas DataFrame
cars[['country']]

Unnamed: 0,country
US,United States
AUS,Australia
JPN,Japan
IN,India
RU,Russia
MOR,Morocco
EG,Egypt


In [20]:
# Select the first 3 observations from cars and print them out
cars[:3]

Unnamed: 0,country,drives_right,cars_per_cap
US,United States,True,809
AUS,Australia,False,731
JPN,Japan,False,588


### Subsetting columns with `loc` and `iloc`

With loc and iloc you can do practically any data selection operation on DataFrames. 

* `loc` is label-based, which means that you have to specify rows and columns based on their row and column labels. 
* `iloc` is integer index based, so you have to specify rows and columns by their integer index.

In [24]:
# Use loc and iloc to select the observations for Australia and Egypt as a DataFrame
display(cars.loc[['AUS', 'EG'], :])
display(cars.iloc[[1,6], :])

Unnamed: 0,country,drives_right,cars_per_cap
AUS,Australia,False,731
EG,Egypt,True,45


Unnamed: 0,country,drives_right,cars_per_cap
AUS,Australia,False,731
EG,Egypt,True,45


In [25]:
# Print out a sub-DataFrame, containing the observations for Russia and Morocco and the columns country and drives_right
cars.loc[['RU', 'MOR'], ['country', 'drives_right']]

Unnamed: 0,country,drives_right
RU,Russia,True
MOR,Morocco,True


In [27]:
# Print out the cars_per_cap and drives_right columns as a DataFrame using iloc
cars.iloc[:, [2,1]]

Unnamed: 0,cars_per_cap,drives_right
US,809,True
AUS,731,False
JPN,588,False
IN,18,False
RU,200,True
MOR,70,True
EG,45,True
