# Working with Economic data in Python

This notebook will introduce you to working with data in ``Python``. You can use packages like ``Numpy`` to work with arrays, matrices, and such, and anipulate data (see my [Introduction to Python](./IntroPython.ipynb)). But given the needs of economists it is better to use [Pandas](http://pandas.pydata.org). [Pandas](http://pandas.pydata.org) allows you to import and process data in many useful ways. It interacts greatly with [other packages](http://pydata.org/downloads.html) that complement it making it a very powerful tool for data analysis. 

In [None]:
# Let's import pandas
from __future__ import division
%pylab --no-import-all
%matplotlib inline
import pandas as pd
import numpy as np

With [Pandas](http://pandas.pydata.org) you can 

1. Import many types of data, including
    * CSV files
    * Tab or other types of delimited files 
    * Excel (xls, xlsx) files
    * Stata files
2. Open files directly from a website
3. Merge, select, join data
4. Perform statistical analyses
5. Create plots of your data

and much more

# Examples

Let's start by importing the data from the Penn World Tables

In [None]:
df=pd.read_excel('http://www.rug.nl/ggdc/docs/pwt90.xlsx',sheetname=1)
dfpwt=pd.read_stata('http://www.rug.nl/ggdc/docs/pwt90.dta')

The main object in [Pandas](http://pandas.pydata.org) is a dataframe. 
Here we have created two dataframes based on an Excel file and a Stata file.

``df`` is a dataframe that contains the variable names and their definition, while ``dfpwt`` contains the data. Let's explore them.

In [None]:
df

In [None]:
dfpwt

Now, we can create new variables, transform and plot the data

## Computing $\log$ Income per Capita

To compute the $log$ of income per capita (GDPpc), the first thing we need is to know the name of the column that contains the GDPpc data in the dataframe. To do this, let's find among the variables those whic in their description have the word *capita*.

In [None]:
df.columns

In [None]:
df.loc[df['Variable definition'].apply(lambda x: str(x).lower().find('capita')!=-1)]

So, it seems the data does not contain that variable. But do not panic...we know how to compute it based on *GDP* and *Population*. Let's do it!

### Identify the name of the variable for GDP

In [None]:
df.loc[df['Variable definition'].apply(lambda x: str(x).upper().find('GDP')!=-1)]

### Identify the name of the variable for population

In [None]:
df.loc[df['Variable definition'].apply(lambda x: str(x).lower().find('population')!=-1)]

### Create a new variable/column with Expenditure-side real GDPpc at chained PPPs

In [None]:
dfpwt['rgdppce'] = dfpwt['rgdpe'] / dfpwt['pop']

### Create a plot of this measure of GDP vs Population for all countries in the last year of the dataset

In [None]:
# Let's figure out the last period
lastperiod = dfpwt.iloc[-1].year
print(lastperiod)

In [None]:
# Select data for last period
dflast = dfpwt.loc[dfpwt.year==lastperiod]

In [None]:
dflast

In [None]:
ax = dflast.plot.scatter(x='pop', y='rgdpe', c='rgdppce', cmap='Reds', )

### Use statistical and mathematical functions to analyze the data

In [None]:
# Describe the data
dfpwt.describe()

In [None]:
dflast.describe()

In [None]:
dflast[['rgdpe', 'rgdppce', 'pop']].corr()

### Excercise: 
1. Create GDPpc measures based on all other measures of GDP
2. Compare these measures using plot, correlations, etc. 
