<img src=images/data_models.png width=1000/>

## Python Data Science Tools - Data Models

**What are data models and why are they useful?**

"*A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities.*" - Wikipedia

Basically, a data model holds data in an object that is easier and more convenient to use than raw 1s and 0s. This can include a way to store metadata with the the data, displaying the data through interfaces that make it easier to understand and automatically perfoming functions for us that would otherwise require a lot of engineering.

### Data - Python

Python has a number of basic data structures that allow us to store data in the memory of the computer:

- [Lists] = Mutable series of data
- (Tuples) = Immutable series of data
- {Dict:onaries} = Key:Value pairs

For example:

In [None]:
lst = [279.3, 284.0, 290.1, 284.1]
lst

In [None]:
tup = ('Wales', 'Scotland', 'England', 'Northern Ireland')
tup

In [None]:
dictionary = {'foo': 7, 'bar': 'Closed'}
dictionary

All these data structures contain data but without context, do we know enough about what these values are? <br>Is `lst` a list of temperature values or latitude coordinates? Is `tup` an immutable list of nations or names of the childeren in an unfortunate family?

**Without metadata and some mapping of how the values relate to one another, the data is not very useful to us.**

### Tabular Data - Pandas

**References:<br>Pandas documentaiton -> https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html**

All of us will be familiar with tabular data - values arranged into rows and columns to convey relations between data.

|Name|Age|Height|
|---|---|---|
|Alice|28|1.86|
|Bob|26|1.75|
|Charlie|99|1.83|

We may be familiar with tools that allow us to work with tabular data, such as Excel, MATLAB, databases (MongoDB, SQL), or pen & paper.

A Python data structure we could use to represent a table is a dictionary, where the `keys` are the column labels and the `values` are a list of the values in each column:

In [None]:
table = {'Name':  ['Alice', 'Bob', 'Charlie'],
         'Age':   [28, 26, 99],
         'Height':[1.86, 1.75, 1.83]}
table

This is a perfectly correct way to store the data containted in a table, but is not particularly useful for interacting with the data in ways we expect to for tabular data. 

For instance, selecting a row from the table represented with the dictionary is not simple:

In [None]:
row1 = [table['Name'][1],
        table['Age'][1],
        table['Height'][1]]

row1

This is a bit cumbersome and for a table with 20 columns would get frustrating and messy.

So we could implement a function to get a row for us:

In [None]:
def get_row(i, table):
    row = []
    for key in table.keys():
        row.append(table[key][i])
    return row

get_row(1, table)

This works and will scale to a table with 20 columns just fine, but it involved engineering and I'm lazy so don't want to engineer every feature of a table when I need it.

Thankfully [Pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) is a Python library which allows us to work with tabular data using a data model called a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html?highlight=dataframe#pandas.DataFrame):

In [None]:
import pandas as pd

df = pd.DataFrame(table)
df

Already we can see that this is a much more familiar model for intarcting with tabular data than a dictionary. On top of being pretty, Pandas has implemented a whole spectrum of features and functionality that makes working with tabular data more simple and powerful*.

_*Pandas is underpinned with some very fast C code, which leads to operations on huge tables of data running at lightning speed._

**What can we do with a Pandas DataFrame?**

#### Indexing rows and columns

This is easy with a DataFrame and fundamental to interacting with it. We can select columns or rows either by index or values, in a number of ways:

In [None]:
df

In [None]:
# Columns can be indexed like a Python dictionary
df['Name']

In [None]:
# But columns are also attributes of the DataFrame object
df.Name

In [None]:
# Rows can be retrieved by index using iloc[]
df.iloc[1]

In [None]:
# Or retrieved by index value using loc[] (which in this case returns the same as iloc[])
df.loc[1]

In [None]:
# Rows can also be selected based on their values
df.loc[df.Name == 'Bob']

The index of the table is currently an integer, but we can set the index of the table to whatever we want

In [None]:
df_name = df.set_index('Name')
df_name

In [None]:
# Now we have changed the index, we can use loc[] to get a row based on a Name value
df_name.loc['Bob']

#### Sort rows

Pandas also gives us the power to sort a DataFrame into orders that are easier to work with. The table is already sorted in alphabetical order for names, but we can also order it according to index, heights or age:

In [None]:
# We can sort according to index, including in reverse order
df.sort_index(ascending=False)

In [None]:
# We can also sort by the values of any column
df.sort_values(by="Age")

In [None]:
# And of course reverse the order
df.sort_values(by="Height", ascending=False)

#### Stats

Pandas includes makes it easy to perform statistical operations on DataFrames:

In [None]:
# Calculate the mean of all the columns
df.mean()

#### Extend

Pandas makes it easy to extent DataFrames by adding columns, concatenating or merging them:

In [None]:
table2 = {'Name': ['Daniel', 'Evan', 'Fred'],
          'Age': [43, 15, 31], 
          'Height': [1.53, 1.89, 1.56]}

df2 = pd.DataFrame(table2)
df2

In [None]:
# It is easy to concatenate DataFrames
pd.concat([df, df2])

In [None]:
# It is possible to reset the indices when concatenating
pd.concat([df, df2], ignore_index=True)

In [None]:
# It is also easy to add a new column to a DataFrame with insert()
df2.insert(loc=2, column='Weight', value=[89, 71, 74])
df2

In [None]:
# Pandas also supports database style merge() operations
df.merge(df2, how='outer')

#### Plot

Finally, Pandas also includes some quick plotting functionality to make it easy to visualise your data.

In [None]:
# The plot() method supports many kinds of plots
df.plot(x='Name', y='Height', kind='bar')

There is much much more to Pandas. Please see [list of example notebooks](https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks#pandas-for-data-analysis) and [docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html)

### Gridded Data - Numpy
**References:<br>Numpy documentaiton -> https://numpy.org/doc/stable/user/quickstart.html**

All of us will also be familiar with gridded data - an array of values arranged on an n-dimensional grid.

![](further_ds_examples/iris_course/images/multi_array.png)

We can technically use a table to represent an array, or a list of lists, but these solutions are clunky and difficult to use.

Instead we use Numpy to create n-dimensional array that we can interact with similar to mathematical matrices.

In [None]:
arr = np.random.random(size=(4, 3))
arr

In [None]:
# Indexing values from an array uses matrix notation
arr[0,1]

In [None]:
# Which you can use to set values
arr[2,1] = 2
arr

In [None]:
# We can also perform maths with arrays
arr + 1

In [None]:
# And statistics
arr.mean()

However, we have no idea what these data values represent - **A Numpy array contains no metadata.**

There is much much more to Numpy. Please see the [Numpy notebook](further_ds_examples/numpy_intro.ipynb) in `further_ds_examples` and the [docs](https://numpy.org/doc/stable/user/quickstart.html).

### Gridded Data - Iris

Many of us at the Met Office will have used Iris at some point. Iris gives us arrays with metadata using the CF (Climate-Forecast) data model:

![](further_ds_examples/iris_course/images/multi_array_to_cube.png)

Iris wraps a numpy array with metadata into an object called a `Cube`. This has clearer interface to understand what all the axes of the n-dimensional array represent, what the values represent and other arbitrary metadata such as how the data was produced.

Multiple `Cube`s can also be collated together in a `CubeList`.

In [None]:
# Create a Cube from the Numpy array we made before
import iris

cube = iris.cube.Cube(data=arr)
cube

In [None]:
# We can still access the array
cube.data

In [None]:
# Let's name the cube and give it units
cube.standard_name = 'air_temperature'
cube.units = '°C'
cube

In [None]:
# We can add dimension coordinates to describe the axes of the array
lon = iris.coords.DimCoord([-180, -90, 0, 90], standard_name='longitude', units='degrees')
lat = iris.coords.DimCoord([-45, 0, 45], standard_name='latitude', units='degrees')

cube.add_dim_coord(lon, 0)
cube.add_dim_coord(lat, 1)

cube

In [None]:
# We can store multiple Cubes in a CubeList
cube2 = cube
cubes = iris.cube.CubeList([cube, cube2])
cubes

In [None]:
# Like the array, we can perform maths on a cube
cube_K = cube + 273
display(cube_K)
display(cube_K.data)

In [None]:
# And statistics
cube_mean = cube.collapsed(coords=['longitude', 'latitude'], aggregator=iris.analysis.MEAN)
display(cube_mean)
display(cube_mean.data)

In [None]:
# Iris also has a quickplot function to easily plot Cubes
import iris.quickplot as qplt
%matplotlib inline

qplt.contourf(cube)

In [None]:
# With some help from matplotlib we can add coastlines
import matplotlib.pyplot as plt

qplt.contourf(cube)
plt.gca().coastlines()

There is much much more to Iris. Please see the [Iris course notebooks](further_ds_examples/iris_course/0.Iris_Course_Intro.ipynb) in `further_ds_examples` and the [docs](https://scitools.org.uk/iris/docs/latest/index.html).

### Gridded Data - Xarray

- netCDF data model
- DataArray == Cube
- DataSet ~= CubeDict
- Less strict with metadata