# Data handling with Pandas

Pandas is a library optimized for handling one or two dimensional data sources [1]. One dimensional data is stored in a `Series` object, and two dimensional data is stored in a `DataFrame` object.

### Loading the library
It is customary to give the library a short handle '`pd`' at import time:

In [2]:
import pandas as pd
pd.options.display.max_rows = 10 #this line aids in displaying the data concisely




### Loading data from CSV files



Pandas gives us a comprehensive set of tools for loading data from [a variety of sources](http://pandas.pydata.org/pandas-docs/version/0.18.1/io.html), including CSV, Excel, SQL, JSON, and Stata, amongst others. In this demonstration, we'll read a comma separated value file of global emissions data from the year 1751 until 2011.

The `.read_csv` [method]() gives us options for how we want to format the data as we read it in. In reading in our data file, we want to skip the second row (indexed as `1`!) and use the column `Time` as the index of our resulting `DataFrame`.

In [3]:
emissions = pd.read_csv('../../data/Climate/global_emissions.csv', 
                        skiprows=[1], index_col='Year')
emissions  # Display the resulting DataFrame in the notebook

Unnamed: 0_level_0,Total carbon emissions from fossil fuel consumption and cement production (million metric tons of C),Carbon emissions from gas fuel consumption,Carbon emissions from liquid fuel consumption,Carbon emissions from solid fuel consumption,Carbon emissions from cement production,Carbon emissions from gas flaring,Per capita carbon emissions (metric tons of carbon; after 1949 only)
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1751,3,0,0,3,0,0,
1752,3,0,0,3,0,0,
1753,3,0,0,3,0,0,
1754,3,0,0,3,0,0,
1755,3,0,0,3,0,0,
...,...,...,...,...,...,...,...
2007,8532,1563,3080,3442,382,65,1.28
2008,8740,1625,3107,3552,387,68,1.29
2009,8700,1582,3039,3604,412,63,1.27
2010,9140,1698,3100,3832,445,65,1.32


### Selecting rows of data by name
Both `DataFrame` and `Series` objects have an `index` attribute which is used to identify their rows. We can access rows of data according to this index, using the `.loc[...]` syntax.

Between the brackets, we can select individual rows:
```
emissions.loc[1875]
```
or ranges of dates:
```
emissions.loc[1908:1920]
```
or ranges beginning or ending at a specific point:
```
emissions.loc[1967:]
emissions.loc[:1805]
```
Give these a try and become comfortable selecting index ranges.

In [4]:
emissions.loc[1985:1987]

Unnamed: 0_level_0,Total carbon emissions from fossil fuel consumption and cement production (million metric tons of C),Carbon emissions from gas fuel consumption,Carbon emissions from liquid fuel consumption,Carbon emissions from solid fuel consumption,Carbon emissions from cement production,Carbon emissions from gas flaring,Per capita carbon emissions (metric tons of carbon; after 1949 only)
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1985,5438,835,2186,2237,131,49,1.12
1986,5606,830,2293,2300,137,46,1.13
1987,5750,892,2306,2364,143,44,1.14


### Selecting rows of data by position

In addition to selecting by row names, we can select by the row position using the `.iloc` syntax.

This syntax lets us select the first n rows:
>```
emissions.iloc[:5]
```

or, if we wish, the last n, using a minus sign to indicate counting from the end of the `DataFrame`:

>```
emissions.iloc[-5:]
```

or rows in the middle:
>```
emissions.iloc[10:20]
```

In [8]:
emissions.iloc[1:30]

Unnamed: 0_level_0,Total carbon emissions from fossil fuel consumption and cement production (million metric tons of C),Carbon emissions from gas fuel consumption,Carbon emissions from liquid fuel consumption,Carbon emissions from solid fuel consumption,Carbon emissions from cement production,Carbon emissions from gas flaring,Per capita carbon emissions (metric tons of carbon; after 1949 only)
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1752,3,0,0,3,0,0,
1754,3,0,0,3,0,0,
1756,3,0,0,3,0,0,
1758,3,0,0,3,0,0,
1760,3,0,0,3,0,0,
...,...,...,...,...,...,...,...
1772,4,0,0,4,0,0,
1774,4,0,0,4,0,0,
1776,4,0,0,4,0,0,
1778,4,0,0,4,0,0,


### Renaming columns
The column names given in the CSV file are too long to use conveniently in dealing with data. We can assign new column names from a list of strings, that will be applied in order as the new column names:

In [11]:
emissions.columns = ['Total Emissions', 'Gas Emissions', 'Liquid Emissions', 
                     'Solid Emissions', 'Cement Emissions', 'Flare Emissions',
                     'Per Capita Emissions']
emissions.iloc[-3:]

Unnamed: 0_level_0,Total Emissions,Gas Emissions,Liquid Emissions,Solid Emissions,Cement Emissions,Flare Emissions,Per Capita Emissions
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2009,8700,1582,3039,3604,412,63,1.27
2010,9140,1698,3100,3832,445,65,1.32
2011,9449,1760,3137,3997,491,63,1.35


### Accessing specific columns

Each of the columns in the `DataFrame` can be accessed as its own `Series` object, using the same syntax we would use to access members of a python dictionary:

In [None]:
emissions[['Total Emissions']]

: 

Passing a list of column names into this syntax returns a subset of the dataframe:

In [50]:
emissions[['Gas Emissions', 'Liquid Emissions']]

Unnamed: 0_level_0,Gas Emissions,Liquid Emissions
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1751,0,0
1752,0,0
1753,0,0
1754,0,0
1755,0,0
...,...,...
2007,1563,3080
2008,1625,3107
2009,1582,3039
2010,1698,3100


### Element-wise Arithmetic
We can perform [element-wise arithmetic](http://pandas.pydata.org/pandas-docs/version/0.18.1/dsintro.html#dataframe-interoperability-with-numpy-functions) on `DataFrame` columns using natural syntax.

In [12]:
emissions['Gas Emissions'] + emissions['Liquid Emissions']

Year
1751       0
1752       0
1753       0
1754       0
1755       0
        ... 
2007    4643
2008    4732
2009    4621
2010    4798
2011    4897
dtype: int64

### Array Operations

A number of simple operations are built into Pandas to facilitate working with the data. For example, we can show [descriptive statistics](http://pandas.pydata.org/pandas-docs/version/0.18.1/basics.html#descriptive-statistics) such as the maximum value of each column:

In [20]:
print(emissions.idxmax(), emissions.max())

Total Emissions         2011
Gas Emissions           2011
Liquid Emissions        2011
Solid Emissions         2011
Cement Emissions        2011
Flare Emissions         1973
Per Capita Emissions    2011
dtype: int64 Total Emissions         9449.00
Gas Emissions           1760.00
Liquid Emissions        3137.00
Solid Emissions         3997.00
Cement Emissions         491.00
Flare Emissions          110.00
Per Capita Emissions       1.35
dtype: float64


The year [in which this maximum value occurred](http://pandas.pydata.org/pandas-docs/version/0.18.1/basics.html#index-of-min-max-values):

In [53]:
emissions.idxmax()

Total Emissions         2011
Gas Emissions           2011
Liquid Emissions        2011
Solid Emissions         2011
Cement Emissions        2011
Flare Emissions         1973
Per Capita Emissions    2011
dtype: int64

Or the sum of each column:

In [54]:
emissions.sum()

Total Emissions         373729.0
Gas Emissions            49774.0
Liquid Emissions        131976.0
Solid Emissions         179160.0
Cement Emissions          9366.0
Flare Emissions           3456.0
Per Capita Emissions        65.5
dtype: float64

In [55]:
emissions['Per Capita Emissions'].loc[1930:]

Year
1930     NaN
1931     NaN
1932     NaN
1933     NaN
1934     NaN
        ... 
2007    1.28
2008    1.29
2009    1.27
2010    1.32
2011    1.35
Name: Per Capita Emissions, dtype: float64

### Merging Datasets
The dataset we have currently is missing data for per capita consumption before 1950. We have another dataset which gives us estimates of the world population which we can use to try and fill in some missing data. It too, however, has some missing values: before 1900, the data comes at 50 year intervals.

In [57]:
population = pd.read_csv('../../data/Climate/world_population.csv', index_col='Year')

What we need to do is first merge the two datasets together. Pandas gives us a merge function which allows us to align the datasets on their index values.

In [62]:
merged = pd.merge(emissions, population, how='outer', left_index=True, right_index=True)
merged.loc[1750:2011]

Unnamed: 0_level_0,Total Emissions,Gas Emissions,Liquid Emissions,Solid Emissions,Cement Emissions,Flare Emissions,Per Capita Emissions,World Population
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1750,,,,,,,,8.115621e+08
1751,3.0,0.0,0.0,3.0,0.0,0.0,,
1752,3.0,0.0,0.0,3.0,0.0,0.0,,
1753,3.0,0.0,0.0,3.0,0.0,0.0,,
1754,3.0,0.0,0.0,3.0,0.0,0.0,,
...,...,...,...,...,...,...,...,...
2007,8532.0,1563.0,3080.0,3442.0,382.0,65.0,1.28,6.681607e+09
2008,8740.0,1625.0,3107.0,3552.0,387.0,68.0,1.29,6.763733e+09
2009,8700.0,1582.0,3039.0,3604.0,412.0,63.0,1.27,6.846480e+09
2010,9140.0,1698.0,3100.0,3832.0,445.0,65.0,1.32,6.929725e+09


### Interpolating missing values
The merge operation creates `NaN` values in the rows where data is missing from the world population column. We can fill these using a cubic spline interpolation from the surrounding points:

In [64]:
interpolated = merged.interpolate(method='cubic')
interpolated.loc[1750:2011]

Unnamed: 0_level_0,Total Emissions,Gas Emissions,Liquid Emissions,Solid Emissions,Cement Emissions,Flare Emissions,Per Capita Emissions,World Population
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1750,,,,,,,,8.115621e+08
1751,3.0,0.0,0.0,3.0,0.0,0.0,,8.155185e+08
1752,3.0,0.0,0.0,3.0,0.0,0.0,,8.194193e+08
1753,3.0,0.0,0.0,3.0,0.0,0.0,,8.232672e+08
1754,3.0,0.0,0.0,3.0,0.0,0.0,,8.270645e+08
...,...,...,...,...,...,...,...,...
2007,8532.0,1563.0,3080.0,3442.0,382.0,65.0,1.28,6.681607e+09
2008,8740.0,1625.0,3107.0,3552.0,387.0,68.0,1.29,6.763733e+09
2009,8700.0,1582.0,3039.0,3604.0,412.0,63.0,1.27,6.846480e+09
2010,9140.0,1698.0,3100.0,3832.0,445.0,65.0,1.32,6.929725e+09


#### Calculating per capita emissions
Now we can calculate a new value for per capita emissions. We multiply by `1,000,000` to convert from units of 'Million Metric Tons' as the Total Emissions are expressed, to merely 'Metric Tons', as the existing, incomplete estimate of per capita emissions is expressed.

In [69]:
interpolated['Per Capita Emissions 2'] = interpolated['Total Emissions'] / interpolated['World Population'] * 1000000
interpolated.loc[1751:2011]

Unnamed: 0_level_0,Total Emissions,Gas Emissions,Liquid Emissions,Solid Emissions,Cement Emissions,Flare Emissions,Per Capita Emissions,World Population,Per Capita Emissions 2
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1751,3.0,0.0,0.0,3.0,0.0,0.0,,8.155185e+08,0.003679
1752,3.0,0.0,0.0,3.0,0.0,0.0,,8.194193e+08,0.003661
1753,3.0,0.0,0.0,3.0,0.0,0.0,,8.232672e+08,0.003644
1754,3.0,0.0,0.0,3.0,0.0,0.0,,8.270645e+08,0.003627
1755,3.0,0.0,0.0,3.0,0.0,0.0,,8.308138e+08,0.003611
...,...,...,...,...,...,...,...,...,...
2007,8532.0,1563.0,3080.0,3442.0,382.0,65.0,1.28,6.681607e+09,1.276938
2008,8740.0,1625.0,3107.0,3552.0,387.0,68.0,1.29,6.763733e+09,1.292186
2009,8700.0,1582.0,3039.0,3604.0,412.0,63.0,1.27,6.846480e+09,1.270726
2010,9140.0,1698.0,3100.0,3832.0,445.0,65.0,1.32,6.929725e+09,1.318956


## Pandas and PySD

By default, PySD will return the results of model simulation as a Pandas `DataFrame`, with the column names representing elements of the model, and the index (row names) as timestamps in the model.

In [None]:
import pysd
model = pysd.read_vensim('../../models/Predator_Prey/Predator_Prey.mdl')
sim_result_df = model.run()
sim_result_df

In this case, may want to downsample the returned data to make it more manageable:

In [None]:
sim_result_df.loc[range(50)]

### Notes

[1]: While pandas can handle dimensions larger than two, it is clunky. [Xarray](http://xarray.pydata.org/en/stable/) is a package for handling multidimensional data that interfaces well with Pandas.


### Resources
- [Basic introduction](http://pandas.pydata.org/pandas-docs/stable/10min.html) to Pandas constructs
- [More advanced](http://pandas.pydata.org/pandas-docs/stable/cookbook.html#cookbook) usage of Pandas syntax
- [Cookbook of Pandas Applications](https://github.com/jvns/pandas-cookbook)