# CH-2: Dictionaries and Data Structures

### Motivation for dictionaries

To see why dictionaries are useful, have a look at the two lists below. 

    countries contains the names of some European countries. 
    capitals lists the corresponding names of their capital.

In [4]:
countries = ['spain', 'france', 'germany', 'norway']
capitals = ['madrid', 'paris', 'berlin', 'oslo']

# Get index of 'germany': ind_ger
ind_ger = countries.index('germany')

# Use ind_ger to print out capital of Germany
print(capitals[ind_ger])

berlin


Using dictionaries however,the same goal can be achived more effectively as follows:

    my_dict = {
       "key1":"value1",
       "key2":"value2",
    }

In [6]:
europe = {
    'spain':'madrid',
    'france':'paris',
    'germany':'berlin',
    'norway':'oslo'
}

print(europe['germany'])
print(europe['norway'])

berlin
oslo


#### Access dictionary

If the keys of a dictionary are chosen wisely, accessing the values in a dictionary is easy and intuitive. 

In [7]:
# Print out the keys in europe

print(europe.keys())

dict_keys(['spain', 'france', 'germany', 'norway'])


#### Dictionary Manipulation

If you know how to access a dictionary, you can also **assign a new value** to it. To add a new key-value pair to europe you can use something like this:

    europe['iceland'] = 'reykjavik'

In [8]:
# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo' }

# Add italy to europe
europe['italy'] = 'rome'

# Print out italy in europe
print('italy' in europe)

True


In [9]:
# Add poland to europe
europe['poland'] = 'warsaw'

# Print europe
print(europe)

{'spain': 'madrid', 'france': 'paris', 'germany': 'berlin', 'norway': 'oslo', 'italy': 'rome', 'poland': 'warsaw'}


We can also update values of the dictionary. 

In [10]:
# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'bonn',
          'norway':'oslo', 'italy':'rome', 'poland':'warsaw',
          'australia':'vienna' }

# Update capital of germany

europe['germany']='berlin'

# Remove australia
del europe['australia']

# Print europe
print(europe)

{'spain': 'madrid', 'france': 'paris', 'germany': 'berlin', 'norway': 'oslo', 'italy': 'rome', 'poland': 'warsaw'}


### Nested Dictionaries

Dictionaries can contain **key:value** pairs where the values are again **dictionaries**.

It's perfectly possible to **chained square brackets** to select elements. 

    Example: To fetch the population for Spain from europe, for example, you need:

        europe['spain']['population']

In [13]:
# Dictionary of dictionaries
europe = { 'spain': { 'capital':'madrid', 'population':46.77 },
           'france': { 'capital':'paris', 'population':66.03 },
           'germany': { 'capital':'berlin', 'population':80.62 },
           'norway': { 'capital':'oslo', 'population':5.084 } }


# Print out the capital of France

print(europe['france']['capital'])
print(europe['france']['population'])

paris
66.03


In [15]:
# Create sub-dictionary data
data = {'capital':'rome','population':59.83}

# Add data to europe under key 'italy'
europe['italy'] = data

# Print europe
print(europe)

{'spain': {'capital': 'madrid', 'population': 46.77}, 'france': {'capital': 'paris', 'population': 66.03}, 'germany': {'capital': 'berlin', 'population': 80.62}, 'norway': {'capital': 'oslo', 'population': 5.084}, 'italy': {'capital': 'rome', 'population': 59.83}}


### Dictionary to DataFrame

`Pandas` is an open source library, providing high-performance, easy-to-use data structures and data analysis tools for Python. Sounds promising!

The **DataFrame** is one of `Pandas'` most important data structures. It's basically **a way to store tabular data** where you can label the `rows` and the `columns`. 

**One way to build a DataFrame is from a dictionary.** : THE CASE WHEN THE DATA SET IS SMALL ENOUGH otherwise we use **CSV**

In the exercises that follow you will be working with vehicle data from different countries. Each observation corresponds to a country and the columns give information about the number of vehicles per capita, whether people drive left or right, and so on.

Three lists are defined in the script:

`names`, containing the country names for which data is available.

`dr`, a list with booleans that tells whether people drive left or right in the corresponding country.

`cpc`, the number of motor vehicles per 1000 people in the corresponding country.

Each dictionary key is a column label and each value is a list which contains the column elements.

In [135]:
# Pre-defined lists

names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]

# Import pandas as pd
import pandas as pd

# Create dictionary my_dict with three key:value pairs: my_dict
my_dict = {
            'country': names   , 
            'drives_right': dr , 
            "cars_per_cap": cpc  }

# Build a DataFrame cars from my_dict: cars

cars = pd.DataFrame(my_dict)

# Print cars
cars

Unnamed: 0,country,drives_right,cars_per_cap
0,United States,True,809
1,Australia,False,731
2,Japan,False,588
3,India,False,18
4,Russia,True,200
5,Morocco,True,70
6,Egypt,True,45


Have you noticed that the `row labels` (i.e. the labels for the different observations) were automatically set to integers from 0 up to 6?

To solve this a list `row_labels` has been created. You can use it to specify the row labels of the cars DataFrame. You do this by setting the index attribute of cars, that you can access as cars.index.

In [90]:
print(cars.index)

RangeIndex(start=0, stop=7, step=1)


In [136]:
# Definition of row_labels
row_labels = ['US', 'AUS', 'JPN', 'IN', 'RU', 'MOR', 'EG']

# Specify row labels of cars
cars.index = row_labels

# Print cars again
cars

Unnamed: 0,country,drives_right,cars_per_cap
US,United States,True,809
AUS,Australia,False,731
JPN,Japan,False,588
IN,India,False,18
RU,Russia,True,200
MOR,Morocco,True,70
EG,Egypt,True,45


### CSV to DataFrame

Putting data in a dictionary and then building a DataFrame works, **but it's not very efficient.**

**What if you're dealing with millions of observations?** 

In those cases, _the data is typically available as files with a **regular structure**_. 

One of those file types(file type with regular structure is) is the CSV file, which is short for `"comma-separated values"`.




To import CSV data into Python as a Pandas DataFrame you can use `read_csv()`.

Let's explore this function with the same cars data from the previous exercises. 

This time, however, the data is available in a CSV file, named `cars.csv`.

In [92]:
# Import pandas as pd
import pandas as pd

In [137]:
# Import the cars.csv data: cars
brics = pd.read_csv('brics.csv')

# Print out cars
brics

Unnamed: 0.1,Unnamed: 0,country,capital,area,population
0,BR,Brazil,Brasilia,8.516,200.4
1,RU,Russia,Moscow,17.1,143.5
2,IN,India,New Delhi,3.286,1252.0
3,CH,China,Beijing,9.597,1357.0
4,SA,South Africa,Pretoria,1.221,52.98


In [60]:
pwd

'C:\\Users\\bky\\Desktop\\Data Scientist'

In [138]:
# Import the cars.csv data: cars
brics = pd.read_csv('brics.csv',index_col = 0)

# Print out cars
brics

Unnamed: 0,country,capital,area,population
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.1,143.5
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0
SA,South Africa,Pretoria,1.221,52.98


### Index and Select Data


There are different ways to index and select data from the data frames.

    1. Square brackets
    2. Advanced methods
        I. loc
        II. iloc
        
#### 1. Square brackets

We index and select Pandas DataFrames in **many different ways**. 

The simplest, but not the most powerful way, is to use square brackets.

##### Pandas Series Vs DataFrame

    Pandas Series - single bracket  e.g. cars['cars_per_cap']
    
    Pandas DataFrame - double bracket e.g. cars[['cars_per_cap']]

In [141]:
# Import cars data
import pandas as pd
brics = pd.read_csv('brics.csv', index_col = 0)

# Print out country column as Pandas Series
print(brics["country"], "\n")

print(type(brics["country"]))

BR          Brazil
RU          Russia
IN           India
CH           China
SA    South Africa
Name: country, dtype: object 

<class 'pandas.core.series.Series'>


In [142]:
# Print out country column as Pandas DataFrame
print(brics[["country"]], "\n")

print(type(brics[["country"]]))

         country
BR        Brazil
RU        Russia
IN         India
CH         China
SA  South Africa 

<class 'pandas.core.frame.DataFrame'>


In [143]:
# Print out DataFrame with country and capital columns
print(brics[["country", "capital"]], "\n\n")

print(type(brics[["country", "capital"]]))

         country    capital
BR        Brazil   Brasilia
RU        Russia     Moscow
IN         India  New Delhi
CH         China    Beijing
SA  South Africa   Pretoria 


<class 'pandas.core.frame.DataFrame'>


**Square brackets** can do more than just selecting columns. 

You can also use them `to get rows, or observations, from a DataFrame`. 

The following call selects the first five rows from the brics DataFrame:

        brics[0:5]

The `result` is another DataFrame containing `only the rows you specified`.

**PAY ATTENTION!**: 
    
    You can only select rows using square brackets if you specify a slice, like 0:4. 
    
    Also, you're using the integer indexes of the rows here, not the row labels!

In [144]:
# Import cars data

import pandas as pd
brics = pd.read_csv('brics.csv', index_col = 0)

#print all observations / rows
print(brics, "\n")
# Print out first 3 observations
print(brics[:3] , "\n\n")

# Print out third, fourth and fifth observation
print(brics[2:])

         country    capital    area  population
BR        Brazil   Brasilia   8.516      200.40
RU        Russia     Moscow  17.100      143.50
IN         India  New Delhi   3.286     1252.00
CH         China    Beijing   9.597     1357.00
SA  South Africa   Pretoria   1.221       52.98 

   country    capital    area  population
BR  Brazil   Brasilia   8.516       200.4
RU  Russia     Moscow  17.100       143.5
IN   India  New Delhi   3.286      1252.0 


         country    capital   area  population
IN         India  New Delhi  3.286     1252.00
CH         China    Beijing  9.597     1357.00
SA  South Africa   Pretoria  1.221       52.98


### 2. Advanced methods  { loc &  iloc }
        
       
        
With `loc` and `iloc` you can do practically any data selection operation on DataFrames you can think of. 

`loc` is **label-based**, which means that you have to `specify rows and columns` based on their `row and column labels`. 

`iloc` is integer **index based**, so you have to `specify rows and columns` by their `integer index` like we did in the previous exercise (square brackets).


### Row Access loc

In [149]:
import pandas as pd
brics = pd.read_csv('brics.csv', index_col = 0)

#print Russia row as Pandas Series
print(brics.loc["RU"], "\n\n")

#print Russia row as DataFrame
print(brics.loc[["RU"]])

country       Russia
capital       Moscow
area            17.1
population     143.5
Name: RU, dtype: object 


   country capital  area  population
RU  Russia  Moscow  17.1       143.5


### Multiple-Row Access loc

In [163]:
import pandas as pd
brics = pd.read_csv('brics.csv', index_col = 0)

brics.loc[["RU", "IN", "CH"]]

Unnamed: 0,country,capital,area,population
RU,Russia,Moscow,17.1,143.5
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0


### Row & Column loc

In [165]:
# Row & Column loc

brics.loc[["RU", "IN", "CH"], ["country", "capital"]]

Unnamed: 0,country,capital
RU,Russia,Moscow
IN,India,New Delhi
CH,China,Beijing


### Row Access iloc

In [169]:
brics.iloc[[1]] # is same as brics.loc[["RU"]]

Unnamed: 0,country,capital,area,population
RU,Russia,Moscow,17.1,143.5


In [156]:
brics.loc[["RU"]]

Unnamed: 0,country,capital,area,population
RU,Russia,Moscow,17.1,143.5


### Multiple Row Access iloc

In [187]:
 brics.iloc[[1,2,3]] 
    
# same as:

#brics.loc[["RU", "IN", "CH"]]

Unnamed: 0,country,capital,area,population
RU,Russia,Moscow,17.1,143.5
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0


### Row & Column iloc

In [160]:
brics.iloc[[1,2,3], [0, 1]] 

# smae as:

#brics.loc[["RU", "IN", "CH"], ["country", "capital"]]

Unnamed: 0,country,capital
RU,Russia,Moscow
IN,India,New Delhi
CH,China,Beijing


In [167]:
brics.iloc[:, [0,1]] 

# same as:
#brics.loc[:, ["country", "capital"]]

Unnamed: 0,country,capital
BR,Brazil,Brasilia
RU,Russia,Moscow
IN,India,New Delhi
CH,China,Beijing
SA,South Africa,Pretoria


### Exercises

In [191]:
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)

print(cars)

     cars_per_cap        country  drives_right
US            809  United States          True
AUS            31      Australia         False
JPN           588          Japan         False
IN             18          India         False
RU            200         Russia          True
MOR            70        Morocco          True
EG             45          Egypt          True


In [200]:
# Print out observation for Japan as a Series

print(cars.loc["JPN"])
# OR
# print(cars.iloc[2])

cars_per_cap      588
country         Japan
drives_right    False
Name: JPN, dtype: object


In [201]:
# observations for Australia and Egypt as a DataFrame

print(cars.loc[["AUS", "EG"]])

# OR
#print(cars.iloc[[1,6]])

     cars_per_cap    country  drives_right
AUS            31  Australia         False
EG             45      Egypt          True
     cars_per_cap    country  drives_right
AUS            31  Australia         False
EG             45      Egypt          True


In [202]:
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)

print(cars)

     cars_per_cap        country  drives_right
US            809  United States          True
AUS            31      Australia         False
JPN           588          Japan         False
IN             18          India         False
RU            200         Russia          True
MOR            70        Morocco          True
EG             45          Egypt          True


In [203]:
# Print out drives_right value of Morocco
print(cars.loc[["MOR"],["drives_right"]])

     drives_right
MOR          True


In [208]:
# Print sub-DataFrame, 
# containing the observations for Russia and Morocco 
# and the columns country and drives_right

print(cars.loc[["RU","MOR"]  ,  ["country","drives_right"]])

#OR
#print(cars.iloc[[4,5],[1,2]])

     country  drives_right
RU    Russia          True
MOR  Morocco          True


In [211]:
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
print(cars)

     cars_per_cap        country  drives_right
US            809  United States          True
AUS            31      Australia         False
JPN           588          Japan         False
IN             18          India         False
RU            200         Russia          True
MOR            70        Morocco          True
EG             45          Egypt          True


In [215]:
# Print out drives_right column as Series
print(cars.loc[: , "drives_right"]

#OR
#print(cars.iloc[: , [2]])

US      True
AUS    False
JPN    False
IN     False
RU      True
MOR     True
EG      True
Name: drives_right, dtype: bool

In [219]:
# Print out drives_right column as DataFrame
print(cars.loc[:, ["drives_right"]])

#OR
#print(cars.iloc[:,[2]])

     drives_right
US           True
AUS         False
JPN         False
IN          False
RU           True
MOR          True
EG           True


In [222]:
# Print out cars_per_cap and drives_right as DataFrame
print(cars.loc[:, ["cars_per_cap", "drives_right"]])

#OR
#print(cars.iloc[:, [0, 2]])

     cars_per_cap  drives_right
US            809          True
AUS            31         False
JPN           588         False
IN             18         False
RU            200          True
MOR            70          True
EG             45          True
