<div>
    <img src="https://dev.pandas.io/static/img/pandas.svg"><br>
</div>

So far we have looked at some fairly simple datasets.  NumPy is great for multi-dimensional arrays, but
book-keeping can be tricky.  Pandas is our friend here.  Pandas adds meta-data to our data, and allows
us to interact with data using names and words, rather than indexes. This can mean that we can
write much clearer code (yay).  It's also really good at working with data that you would have previously
interacted with in spreadsheets.  Spreadsheets are the source of **many** errors, keeping data and
results in the same file is almost criminal! Your data are sacred and should **never be in the same
file that you process them in!**.

Pandas [github README](https://github.com/pandas-dev/pandas/blob/master/README.md) outlines why you should
care about Pandas:

> **pandas** is a Python package providing fast, flexible, and expressive data structures designed to 
make working with "relational" or "labeled" data both easy and intuitive. It aims to be the 
fundamental high-level building block for doing practical, **real world** data analysis in Python.
Additionally, it has the broader goal of becoming the **most powerful and flexible open source 
data analysis / manipulation tool available in any language**. It is already well on its way towards 
this goal.

When Pandas says **real world** think messy data. Measurements of properties of the Earth are *almost always*
messy: data points are missed when power supplies go down, or when it is too wet to get into the field, 
almost all Earth science datasets are noisy, and almost all Earth science data are multi-dimensional and
relational (e.g. multiple variables at one particular place and/or time).  Pandas is really good at coping
with this mess, and **will make your life easier!**

**!yay pandas!**

To show some of the functionality of Pandas we are going to play around with the New Zealand Centroid
Moment Tensor database, maintained by John Ristau of GNS.  This dataset is publicly available
on the [GeoNet github page](https://github.com/GeoNet/data). Centroid Moment Tensors are a little
like focal mechanisms: they are a way of modeling the faulting style of an earthquake.  They are
a little more complex than focal mechanisms because they allow for *non-double couple* forces, and
so can also describe explosions and implosions and any combination thereof.

To start off, we will write a little function to download the data from the website and write it into
a csv file in the data directory.

In [1]:
%matplotlib widget

In [2]:
import requests

def get_geonet_cmt():
    """ Download GeoNet CMT catalogue and save to the Data directory. """
    response = requests.get(
        "https://raw.githubusercontent.com/GeoNet/data/master/"
        "moment-tensor/GeoNet_CMT_solutions.csv")
    with open("data/GeoNet_CMT_solutions.csv", "wb") as f:
        f.write(response.content)

Lets quickly run this function to get the data.  There won't be any output.  Note that I didn't
provide these data in the repository because a) I don't have permission to re-distribute the data
and b) this dataset gets updated frequently!

In [3]:
get_geonet_cmt()

Lets have a quick look at the first five lines of the file that we just downloaded:

In [4]:
with open("data/GeoNet_CMT_solutions.csv", "r") as f:
    i = 0
    while i < 5:
        print(f.readline())
        i += 1

PublicID,Date,Latitude,Longitude,strike1,dip1,rake1,strike2,dip2,rake2,ML,Mw,Mo,CD,NS,DC,Mxx,Mxy,Mxz,Myy,Myz,Mzz,VR,Tva,Tpl,Taz,Nva,Npl,Naz,Pva,Ppl,Paz

2103645,20030821121200,-45.1929,166.8300,213,56,98,20,35,79,7.0,7.1,5.61e+26,22,5,87,-735165.31,2369692.25,-1425430.75,-4250704.50,1486940.25,4985869.50,83,5416627.50,78,149,388026.19,6,28,-5804654.00,11,298

2169849,20030821141200,-45.3592,166.8152,212,68,98,12,23,72,6.1,6.1,1.34e+25,14,3,70,-24379.98,20586.80,-59955.30,-77293.30,76089.99,101673.27,67,144527.23,66,135,-21321.51,7,29,-123205.72,23,296

2206498,20030821195600,-45.2900,166.8020,252,53,108,44,41,67,5.1,5.3,9.16e+23,9,1,79,-6419.43,4901.66,-2648.50,-1525.10,-573.02,7944.53,71,8649.84,74,217,1021.07,14,60,-9670.91,6,329

2218435,20030822000200,-45.0656,166.9658,232,68,102,23,25,63,5.1,5.2,7.19e+23,14,1,66,-1515.79,3653.11,-4129.25,-2732.08,2961.54,4247.87,89,6523.77,65,162,1337.04,11,47,-7860.81,22,313



It's a pretty big database, running from 2003 to now. We could read it into a load of Numpy arrays, but
then we would have to keep track of which columns are which index, or array, and we would ideally have some
columns as `floats` and others as `ints`, and the date column as `datetime.datetime`s...

**There is a better way.**

You guessed it: **Pandas**

In [5]:
import pandas as pd

cmt_solutions = pd.read_csv("data/GeoNet_CMT_solutions.csv")
print(cmt_solutions.head())

  PublicID            Date  Latitude  Longitude  strike1  dip1  rake1  \
0  2103645  20030821121200  -45.1929   166.8300      213    56     98   
1  2169849  20030821141200  -45.3592   166.8152      212    68     98   
2  2206498  20030821195600  -45.2900   166.8020      252    53    108   
3  2218435  20030822000200  -45.0656   166.9658      232    68    102   
4  2254800  20030822152900  -45.1861   166.9908      247    48    100   

   strike2  dip2  rake2  ...  VR         Tva  Tpl  Taz        Nva  Npl  Naz  \
0       20    35     79  ...  83  5416627.50   78  149  388026.19    6   28   
1       12    23     72  ...  67   144527.23   66  135  -21321.51    7   29   
2       44    41     67  ...  71     8649.84   74  217    1021.07   14   60   
3       23    25     63  ...  89     6523.77   65  162    1337.04   11   47   
4       52    43     79  ...  77     3391.86   82  222    -179.29    8   60   

          Pva  Ppl  Paz  
0 -5804654.00   11  298  
1  -123205.72   23  296  
2    -96

We now have a *DataFrame* object, which has columns labeled correctly and rows indexed by row-number.

I promised that we would be able to get the *Date* column into `datetime.datetime` type.  We can do that
on-the-fly while reading:

In [6]:
cmt_solutions = pd.read_csv("data/GeoNet_CMT_solutions.csv", parse_dates=["Date"])
print(cmt_solutions.head())

  PublicID                Date  Latitude  Longitude  strike1  dip1  rake1  \
0  2103645 2003-08-21 12:12:00  -45.1929   166.8300      213    56     98   
1  2169849 2003-08-21 14:12:00  -45.3592   166.8152      212    68     98   
2  2206498 2003-08-21 19:56:00  -45.2900   166.8020      252    53    108   
3  2218435 2003-08-22 00:02:00  -45.0656   166.9658      232    68    102   
4  2254800 2003-08-22 15:29:00  -45.1861   166.9908      247    48    100   

   strike2  dip2  rake2  ...  VR         Tva  Tpl  Taz        Nva  Npl  Naz  \
0       20    35     79  ...  83  5416627.50   78  149  388026.19    6   28   
1       12    23     72  ...  67   144527.23   66  135  -21321.51    7   29   
2       44    41     67  ...  71     8649.84   74  217    1021.07   14   60   
3       23    25     63  ...  89     6523.77   65  162    1337.04   11   47   
4       52    43     79  ...  77     3391.86   82  222    -179.29    8   60   

          Pva  Ppl  Paz  
0 -5804654.00   11  298  
1  -123205

To do:

1. Sorting
2. Plot magnitude vs time
3. Circular plot of strike and dip
4. Region selection and replot
5. Stats (mean etc)
6. Apply function to sum all moments