 🌎 GPGN268 - Geophysical Data Analysis
- **Instructor:** Bia Villas Boas  
- **TA:** Seunghoo Kim

## Lecture 14: More on Pandas, dictionaries, and file I/O

#### 🎯 Learning Objectives from this Lecture:
- Select individual values from a Pandas dataframe.
- Select entire rows or entire columns from a dataframe.
- Select a subset of both rows and columns from a dataframe in a single operation.
- Select a subset of a dataframe by a single Boolean criterion.

In [17]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Dictionaries
So far we've worked with numpy arrays and pythin lists, but there is one important structure in Python that we havent discussed: dictionaries. This is an extremely useful data structure. It maps keys to values.

Dictionaries are unordered!

In [7]:
# Different ways to create disctionaries

# Using curly brackets-> key: value
band_facts = {'name': 'Coldplay', 'studio albums': 9}

# Usinf the function dict -> key=value
major_facts = dict(name='Geophysics', tracks=6)

print(type(band_facts))
print(band_facts)
print(type(major_facts))
print(major_facts)

<class 'dict'>
{'name': 'Coldplay', 'studio albums': 9}
<class 'dict'>
{'name': 'Geophysics', 'tracks': 6}


To access values in a dictionary, you can't use an index (because dictionaries are not ordered). You access values using their keys:

In [8]:
band_facts['name']

'Coldplay'

Square brackets [...] are Python for “get item” in many different contexts.

You can check for the presence of a key

In [10]:
# check if the dictionary major_facts has the key "tracks"
'tracks' in major_facts

True

In [11]:
# check if the dictionary major_facts has the key "students"
'students' in major_facts

False

If you try to access a key that doesn't exist in the dictionary, you will get an error:

In [13]:
# try to access a non-existant key
band_facts['grammys']

KeyError: 'grammys'

You can add new keys to a dictonary:

In [15]:
band_facts['grammys'] = 7
band_facts

{'name': 'Coldplay', 'studio albums': 9, 'grammys': 7}

A very useful trick is to iterate over keys:

In [16]:
# iterate over keys
for k in band_facts:
    print(k, band_facts[k])

name Coldplay
studio albums 9
grammys 7


We were talking about `pandas`, so why we've suddenly change to dictionaries? Well it turns out dictionaries are fundamental to DataFrames. Let's see how.

### Creating DataFrames from Dictionaries

In [18]:
# first we create a dictionary
data = {'mass': [0.3e24, 4.87e24, 5.97e24],       # kg
        'diameter': [4879e3, 12_104e3, 12_756e3], # m
        'rotation_period': [1407.6, np.nan, 23.9] # h
       }
df = pd.DataFrame(data, index=['Mercury', 'Venus', 'Earth'])
df

Unnamed: 0,mass,diameter,rotation_period
Mercury,3e+23,4879000.0,1407.6
Venus,4.87e+24,12104000.0,
Earth,5.97e+24,12756000.0,23.9


Pandas handles missing data very elegantly, keeping track of it through all calculations

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, Mercury to Earth
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   mass             3 non-null      float64
 1   diameter         3 non-null      float64
 2   rotation_period  2 non-null      float64
dtypes: float64(3)
memory usage: 96.0+ bytes


As a recap, we can look at summary statistics using the `describe` method:

In [21]:
df.describe()

Unnamed: 0,mass,diameter,rotation_period
count,3.0,3.0,2.0
mean,3.713333e+24,9913000.0,715.75
std,3.006765e+24,4371744.0,978.423653
min,3e+23,4879000.0,23.9
25%,2.585e+24,8491500.0,369.825
50%,4.87e+24,12104000.0,715.75
75%,5.42e+24,12430000.0,1061.675
max,5.97e+24,12756000.0,1407.6


We can get a single column, which will return a Pandas Series, using python’s getitem syntax on the DataFrame object.

In [22]:
df['mass']

Mercury    3.000000e+23
Venus      4.870000e+24
Earth      5.970000e+24
Name: mass, dtype: float64

In [23]:
# A Series is a one-dimensional pandas object
type(df['mass'])

pandas.core.series.Series

We can also select a specific column using attribute ("dot") syntax.

In [24]:
df.mass

Mercury    3.000000e+23
Venus      4.870000e+24
Earth      5.970000e+24
Name: mass, dtype: float64

We've already seen that we can index using the operations `loc` and `iloc`.  

In [25]:
df

Unnamed: 0,mass,diameter,rotation_period
Mercury,3e+23,4879000.0,1407.6
Venus,4.87e+24,12104000.0,
Earth,5.97e+24,12756000.0,23.9


In [27]:
# Select the row "Earth" and comlun "mass"
df.loc['Earth', 'mass']

5.97e+24

### Merging, combining, and joining data

Pandas allow us to easly merge, combine, and manipulate DataFrames. Consider the dataframe below:

In [33]:
data = {'lead singer': ['Chris Martin', 'Freddie Mercury', 'Julian Casablancas'], 
        'albums': [9, 15, 6]
       }
df1 = pd.DataFrame(data, index=['Coldplay', 'Queen', 'The Strokes'])
df1

Unnamed: 0,lead singer,albums
Coldplay,Chris Martin,9
Queen,Freddie Mercury,15
The Strokes,Julian Casablancas,6


In [34]:
other_data = {'year created': [1997, 1970, 1998],      
        'grammys': [7, 0, 1]
       }
df2 = pd.DataFrame(other_data, index=['Coldplay', 'Queen', 'The Strokes'])
df2

Unnamed: 0,year created,grammys
Coldplay,1997,7
Queen,1970,0
The Strokes,1998,1


Now, we would like to merge the two Dataframes keeping the index (rows) and merging along the columns (axis=1). First we create a list with the DataFrames we want to "stick together"

In [35]:
dfs = [df1, df2]

No we use the function `concat`, which stands for concatenate.

In [37]:
# We want to concatenate the DataFrames in the list "dfs" along the column axis (axis=1)
df_combined = pd.concat(dfs, axis=1)
df_combined

Unnamed: 0,lead singer,albums,year created,grammys
Coldplay,Chris Martin,9,1997,7
Queen,Freddie Mercury,15,1970,0
The Strokes,Julian Casablancas,6,1998,1


Now, let's suppose we have some more Coldplay facts.

In [38]:
more_facts = {'number of members': 4,      
        'hometown': 'London'  
       }
df3 = pd.DataFrame(more_facts, index=['Coldplay'])
df3

Unnamed: 0,number of members,hometown
Coldplay,4,London


Now, we still want to aggregate that to our main DataFrame, but the row index now only has "Coldplay". No problem! `pandas` will add the new fields to "Coldplay" and fill the rest with NaNs

In [39]:
df_all = pd.concat([df_combined, df3], axis=1)
df_all

Unnamed: 0,lead singer,albums,year created,grammys,number of members,hometown
Coldplay,Chris Martin,9,1997,7,4.0,London
Queen,Freddie Mercury,15,1970,0,,
The Strokes,Julian Casablancas,6,1998,1,,


### Saving your data

If you did quite a bit of manipulation on your raw data, it might be smart to save a processed version of it so you don't have to repeat all the steps in your code everytime. It is straighforward to save a `pandas` `DataFrame` to a text file, such as `csv`. The syntax is:

```python
df.to_csv('path/to/output/file_name.csv')
```

For example, the line below saves our DataFrame on the Desktop with the name "band_facts.csv".

In [41]:
df_all.to_csv('/Users/bia/Desktop/band_facts.csv')