![](IMAGES)

## Python Data Science Tools - Data Models

**What are data models and why are they useful?**



In [9]:
# Import the libraries we are going to use
import pandas as pd
import xarray as xr
import iris

### Tabular Data - Pandas

**References:<br>Pandas documentaiton -> https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html**

All of us will be familiar with tabular data - values arranged into rows and columns to convey relations between data.

|Name|Age|Height|
|---|---|---|
|Alice|28|1.86|
|Bob|46|1.75|
|Charlie|99|1.93|

We may be familiar with tools that allow us to work with tabular data, such as Excel, MATLAB, databases (MongoDB, SQL), or pen & paper.

A Python data structure we could use to represent a table is a dictionary, where the `keys` are the column labels and the `values` are a list of the values in each column:

In [11]:
table = {'Name':  ['Alice', 'Bob', 'Charlie'],
         'Age':   [28, 46, 99],
         'Height':[1.86, 1.75, 1.93]}
table

{'Name': ['Alice', 'Bob', 'Charlie'],
 'Age': [28, 46, 99],
 'Height': [1.86, 1.75, 1.93]}

This is a perfectly correct way to store the data containted in a table, but is not particularly useful for interacting with the data in ways we expect to for tabular data. 

For instance, selecting a row from the table represented with the dictionary is not simple:

In [7]:
row1 = [table['Name'][1],
        table['Age'][1],
        table['Height'][1]]

row1

['Bob', 46, 1.75]

This is a bit cumbersome and for a table with 20 columns would get frustrating and messy.

So we could implement a function to get a row for us:

In [8]:
def get_row(i, table):
    row = []
    for key in table.keys():
        row.append(table[key][i])
    return row

get_row(1, table)

['Bob', 46, 1.75]

This works and will scale to a table with 20 columns just fine, but it involved engineering and I'm lazy so don't want to engineer every feature of a table when I need it.

Thankfully [Pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) is a Python library which allows us to work with tabular data using a data model called a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html?highlight=dataframe#pandas.DataFrame):

In [10]:
df = pd.DataFrame(table)
df

Unnamed: 0,Name,Age,Height
0,Alice,28,1.86
1,Bob,46,1.75
2,Charlie,99,1.93


Already we can see that this is a much more familiar model for intarcting with tabular data than a dictionary. On top of being pretty, Pandas has implemented a whole spectrum of features and functionality that makes working with tabular data more simple and powerful*.

_*Pandas is underpinned with some very fast C code, which leads to operations on huge tables of data running at lightning speed._

**What can we do with a Pandas DataFrame?**

#### Indexing rows and columns:

This is easy with a DataFrame and fundamental to interacting with it. We can select columns or rows either by index or values, in a number of ways:

In [35]:
df

Unnamed: 0,Name,Age,Height
0,Alice,28,1.86
1,Bob,46,1.75
2,Charlie,99,1.93


In [14]:
# Columns can be indexed like a Python dictionary
df['Name']

0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object

In [16]:
# But columns are also attributes of the DataFrame object
df.Name

0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object

In [24]:
# Rows can be retrieved by index using iloc[]
df.iloc[1]

Name       Bob
Age         46
Height    1.75
Name: 1, dtype: object

In [38]:
# Or retrieved by index value using loc[] (which in this case returns the same as iloc[])
df.loc[1]

Name       Bob
Age         46
Height    1.75
Name: 1, dtype: object

In [39]:
# Rows can also be selected based on their values
df.loc[df.Name == 'Bob']

Unnamed: 0,Name,Age,Height
1,Bob,46,1.75


The index of the table is currently an integer, but we can set the index of the table to whatever we want

In [40]:
df_name = df.set_index('Name')
df_name

Unnamed: 0_level_0,Age,Height
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Alice,28,1.86
Bob,46,1.75
Charlie,99,1.93


In [41]:
# Now we have changed the index, we can use loc[] to get a row based on a Name value
df_name.loc['Bob']

Age       46.00
Height     1.75
Name: Bob, dtype: float64

#### Sort


- name
- height

Stats
- avergae age
- average height

Extend
- join two tables (more people)
- more columns

Plot
- histogram

Much much more. Please see [list of example notebooks](https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks#pandas-for-data-analysis) and [docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html)

### Gridded Data - Iris / Xarray
