# Data handling with Pandas

Pandas is a library optimized for handling one or two dimensional data sources [1]. One dimensional data is stored in a `Series` object, and two dimensional data is stored in a `DataFrame` object.

### Loading the library
It is customary to give the library a short handle '`pd`' at import time:

In [46]:
import pandas as pd

### Loading data from CSV files



Pandas gives us a comprehensive set of tools for loading data from [a variety of sources](http://pandas.pydata.org/pandas-docs/version/0.18.1/io.html), including CSV, Excel, SQL, JSON, and Stata, amongst others. In this demonstration, we'll read a comma separated value file of global emissions data from the year 1751 until 2011.

The `.read_csv` method gives us options for how we want to format the data as we read it in. In reading in our data file, we want to skip the second row (indexed as `1`!) and use the column `Time` as the index of our resulting `DataFrame`.

In [47]:
emissions = pd.read_csv('../../data/Climate/global_emissions.csv', 
                        skiprows=[1], index_col='Year')
emissions  # Display the resulting DataFrame in the notebook

Unnamed: 0_level_0,Total carbon emissions from fossil fuel consumption and cement production (million metric tons of C),Carbon emissions from gas fuel consumption,Carbon emissions from liquid fuel consumption,Carbon emissions from solid fuel consumption,Carbon emissions from cement production,Carbon emissions from gas flaring,Per capita carbon emissions (metric tons of carbon; after 1949 only)
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1751,3,0,0,3,0,0,
1752,3,0,0,3,0,0,
1753,3,0,0,3,0,0,
1754,3,0,0,3,0,0,
1755,3,0,0,3,0,0,
1756,3,0,0,3,0,0,
1757,3,0,0,3,0,0,
1758,3,0,0,3,0,0,
1759,3,0,0,3,0,0,
1760,3,0,0,3,0,0,


### Selecting rows of data by name
Both `DataFrame` and `Series` objects have an `index` attribute which is used to identify their rows. We can access rows of data according to this index, using the `.loc[...]` syntax.

Between the brackets, we can select individual rows:
```
emissions.loc[1875]
```
or ranges of dates:
```
emissions.loc[1908:1920]
```
or ranges beginning or ending at a specific point:
```
emissions.loc[1967:]
emissions.loc[:1805]
```
Give these a try and become comfortable selecting index ranges.

In [51]:
emissions.loc[1985:1987]

Unnamed: 0_level_0,Total carbon emissions from fossil fuel consumption and cement production (million metric tons of C),Carbon emissions from gas fuel consumption,Carbon emissions from liquid fuel consumption,Carbon emissions from solid fuel consumption,Carbon emissions from cement production,Carbon emissions from gas flaring,Per capita carbon emissions (metric tons of carbon; after 1949 only)
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1985,5438,835,2186,2237,131,49,1.12
1986,5606,830,2293,2300,137,46,1.13
1987,5750,892,2306,2364,143,44,1.14


### Selecting rows of data by position

In addition to selecting by row names, we can select by the row position using the `.iloc` syntax.

This syntax lets us select the first n rows:
```
emissions.iloc[:5]
```
or, if we wish, the last n, using a minus sign to indicate counting from the end of the `DataFrame`:
```
emissions.iloc[-5:]
```
or rows in the middle:
```
emissions.iloc[10:20]
```

In [52]:
emissions.iloc[-3:]

Unnamed: 0_level_0,Total carbon emissions from fossil fuel consumption and cement production (million metric tons of C),Carbon emissions from gas fuel consumption,Carbon emissions from liquid fuel consumption,Carbon emissions from solid fuel consumption,Carbon emissions from cement production,Carbon emissions from gas flaring,Per capita carbon emissions (metric tons of carbon; after 1949 only)
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2009,8700,1582,3039,3604,412,63,1.27
2010,9140,1698,3100,3832,445,65,1.32
2011,9449,1760,3137,3997,491,63,1.35


### Renaming columns
The column names given in the CSV file are too long to use conveniently in dealing with data. We can assign new column names from a list of strings, that will be applied in order as the new column names:

In [53]:
emissions.columns = ['Total Emissions', 'Gas Emissions', 'Liquid Emissions', 
                     'Solid Emissions', 'Cement Emissions', 'Flare Emissions',
                     'Per Capita Emissions']
emissions.iloc[-3:]

Unnamed: 0_level_0,Total Emissions,Gas Emissions,Liquid Emissions,Solid Emissions,Cement Emissions,Flare Emissions,Per Capita Emissions
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2009,8700,1582,3039,3604,412,63,1.27
2010,9140,1698,3100,3832,445,65,1.32
2011,9449,1760,3137,3997,491,63,1.35


### Accessing specific columns

Each of the columns in the `DataFrame` can be accessed as its own `Series` object, using the same syntax we would use to access members of a python dictionary:

In [58]:
emissions['Total Emissions']

Year
1751       3
1752       3
1753       3
1754       3
1755       3
1756       3
1757       3
1758       3
1759       3
1760       3
1761       3
1762       3
1763       3
1764       3
1765       3
1766       3
1767       3
1768       3
1769       3
1770       3
1771       4
1772       4
1773       4
1774       4
1775       4
1776       4
1777       4
1778       4
1779       4
1780       4
        ... 
1982    5111
1983    5093
1984    5278
1985    5438
1986    5606
1987    5750
1988    5963
1989    6094
1990    6121
1991    6198
1992    6136
1993    6133
1994    6241
1995    6374
1996    6524
1997    6624
1998    6610
1999    6597
2000    6763
2001    6929
2002    6992
2003    7405
2004    7784
2005    8076
2006    8363
2007    8532
2008    8740
2009    8700
2010    9140
2011    9449
Name: Total Emissions, dtype: int64

Passing a list of column names into this syntax returns a subset of the dataframe:

In [59]:
emissions[['Gas Emissions', 'Liquid Emissions']]

Unnamed: 0_level_0,Gas Emissions,Liquid Emissions
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1751,0,0
1752,0,0
1753,0,0
1754,0,0
1755,0,0
1756,0,0
1757,0,0
1758,0,0
1759,0,0
1760,0,0


### Arithmetic
We can perform [element-wise arithmetic](http://pandas.pydata.org/pandas-docs/version/0.18.1/dsintro.html#dataframe-interoperability-with-numpy-functions) on `DataFrame` columns using natural syntax.

In [60]:
emissions['Gas Emissions'] + emissions['Liquid Emissions']

Year
1751       0
1752       0
1753       0
1754       0
1755       0
1756       0
1757       0
1758       0
1759       0
1760       0
1761       0
1762       0
1763       0
1764       0
1765       0
1766       0
1767       0
1768       0
1769       0
1770       0
1771       0
1772       0
1773       0
1774       0
1775       0
1776       0
1777       0
1778       0
1779       0
1780       0
        ... 
1982    2934
1983    2915
1984    3006
1985    3021
1986    3123
1987    3198
1988    3347
1989    3441
1990    3515
1991    3653
1992    3584
1993    3630
1994    3668
1995    3711
1996    3826
1997    3894
1998    3987
1999    3998
2000    4128
2001    4157
2002    4176
2003    4347
2004    4485
2005    4552
2006    4628
2007    4643
2008    4732
2009    4621
2010    4798
2011    4897
dtype: int64

### Simple operations

A number of simple operations are built into Pandas to facilitate working with the data. For example, we can show [descriptive statistics](http://pandas.pydata.org/pandas-docs/version/0.18.1/basics.html#descriptive-statistics) such as the maximum value of each column:

In [61]:
emissions.max()

Total Emissions         9449.00
Gas Emissions           1760.00
Liquid Emissions        3137.00
Solid Emissions         3997.00
Cement Emissions         491.00
Flare Emissions          110.00
Per Capita Emissions       1.35
dtype: float64

The year [in which this maximum value occurred](http://pandas.pydata.org/pandas-docs/version/0.18.1/basics.html#index-of-min-max-values):

In [57]:
emissions.idxmax()

Total Emissions         2011
Gas Emissions           2011
Liquid Emissions        2011
Solid Emissions         2011
Cement Emissions        2011
Flare Emissions         1973
Per Capita Emissions    2011
dtype: int64

Or the sum of each column:

In [62]:
emissions.sum()

Total Emissions         373729.0
Gas Emissions            49774.0
Liquid Emissions        131976.0
Solid Emissions         179160.0
Cement Emissions          9366.0
Flare Emissions           3456.0
Per Capita Emissions        65.5
dtype: float64

## Pandas and PySD

By default, PySD will return the results of model simulation as a Pandas `DataFrame`, with the column names representing elements of the model, and the index (row names) as timestamps in the model.

In [5]:
import pysd
model = pysd.read_vensim('../../models/Predator_Prey/Predator_Prey.mdl')
sim_result_df = model.run()
sim_result_df

Unnamed: 0,Predator Population,Prey Population
0.000000,100.000000,2.500000e+02
0.015625,100.375000,2.577734e+02
0.031250,100.763598,2.657884e+02
0.046875,101.166319,2.740525e+02
0.062500,101.583713,2.825733e+02
0.078125,102.016354,2.913589e+02
0.093750,102.464841,3.004174e+02
0.109375,102.929803,3.097573e+02
0.125000,103.411897,3.193874e+02
0.140625,103.911808,3.293167e+02


In this case, may want to downsample the returned data to make it more manageable:

In [6]:
sim_result_df.loc[range(50)]

Unnamed: 0,Predator Population,Prey Population
0,100.0,250.0
1,211.552891,1768.835
2,26037.148846,9285.526
3,155329.734184,0.03478882
4,153784.453003,9.557026e-09
5,152254.153168,3.189036e-15
6,150739.081255,1.289315e-21
7,149239.08573,6.300084000000001e-28
8,147754.016571,3.711618e-34
9,146283.725246,2.630076e-40


### Notes

[1]: While pandas can handle dimensions larger than two, it is clunky. [Xarray](http://xarray.pydata.org/en/stable/) is a package for handling multidimensional data that interfaces well with Pandas.


### Resources
- [Basic introduction](http://pandas.pydata.org/pandas-docs/stable/10min.html) to Pandas constructs
- [More advanced](http://pandas.pydata.org/pandas-docs/stable/cookbook.html#cookbook) usage of Pandas syntax
- [Cookbook of Pandas Applications](https://github.com/jvns/pandas-cookbook)