
This notebook is derived from a larger tutorial called <b>'Python Programming for Economics and Finance'</b><br>
by Thomas J. Sargent and John Stachurski

The full tutorial is available at https://python-programming.quantecon.org/pandas.html

<a href="https://pandas.pydata.org/pandas-docs/stable/reference/index.html">Pandas API reference</a>

In [2]:
import numpy as np
import pandas as pd
from datetime import date

Two important data types defined by pandas are Series and DataFrame.

You can think of a Series as a “column” of data, such as a collection of observations on a single variable of the same type.

A DataFrame is a two-dimensional object for storing related columns of data (a collection of series objects).

<h1>Series</h1>

We begin by creating a series of four random observations

In [3]:
s = pd.Series(np.random.randn(4), name='daily stock returns')
s

0    0.201657
1    0.449609
2    0.065047
3   -1.054896
Name: daily stock returns, dtype: float64

Here you can imagine the indices <b>0, 1, 2, 3</b> as indexing four listed companies, and the values being daily returns on their shares.

Pandas Series are built on top of NumPy arrays and support many similar operations

In [6]:
s = s * 100

In [7]:
np.abs(s)

0     20.165730
1     44.960942
2      6.504696
3    105.489623
Name: daily stock returns, dtype: float64

In [9]:
s.describe().round(2)

count      4.00
mean      -8.46
std       66.61
min     -105.49
25%      -21.49
50%       13.34
75%       26.36
max       44.96
Name: daily stock returns, dtype: float64

In [10]:
s.index = ['AMZN', 'AAPL', 'MSFT', 'GOOG']
s

AMZN     20.165730
AAPL     44.960942
MSFT      6.504696
GOOG   -105.489623
Name: daily stock returns, dtype: float64

In [11]:
s['AMZN']

20.16572990443681

In [12]:
s['AMZN'] = 0
s

AMZN      0.000000
AAPL     44.960942
MSFT      6.504696
GOOG   -105.489623
Name: daily stock returns, dtype: float64

In [15]:
'AAPX' in s

False

<h1>DataFrames</h1>
While a Series is a single column of data, a DataFrame is several columns, one for each variable.

In essence, a DataFrame in pandas is analogous to a (highly optimized) Excel spreadsheet.

Thus, it is a powerful tool for representing and analyzing data that are naturally organized into rows and columns, often with descriptive indexes for individual rows and individual columns.

Let’s look at an example that reads data from the CSV file CountryData.csv.

The dataset contains the following indicators

<table style="border: 1px solid; text-align: center;">
    <tr><th style="border: 1px solid"><b>Variable Name</b></th><th style="border: 1px solid; text-align: center;"><b>Description</b></th></tr>
    <tr><td style="border: 1px solid">POP</td><td style="border: 1px solid">Population (in thousands)</td></tr>
    <tr><td style="border: 1px solid">XRAT</td><td style="border: 1px solid">Exchange Rate to US Dollar</td></tr>
    <tr><td style="border: 1px solid">tcgdp</td><td style="border: 1px solid">Total PPP Converted GDP (in million international dollar)</td></tr>
    <tr><td style="border: 1px solid">cc</td><td style="border: 1px solid">Consumption Share of PPP Converted GDP Per Capita (%)</td></tr>
    <tr><td style="border: 1px solid">cq</td><td style="border: 1px solid">Government Consumption Share of PPP Converted GDP Per Capita (%)</td></tr>    
</table>

We’ll read this in from a URL using the pandas function read_csv.

In [16]:
df = pd.read_csv('..//datafiles//CountryData.csv')
type(df)

pandas.core.frame.DataFrame

In [17]:
df

Unnamed: 0,country,country isocode,year,POP,XRAT,tcgdp,cc,cg
0,Argentina,ARG,2000,37335.653,0.9995,295072.2,75.716805,5.578804
1,Australia,AUS,2000,19053.186,1.72483,541804.7,67.759026,6.720098
2,India,IND,2000,1006300.297,44.9416,1728144.0,64.575551,14.072206
3,Israel,ISR,2000,6114.57,4.07733,129253.9,64.436451,10.266688
4,Malawi,MWI,2000,11801.505,59.543808,5026.222,74.707624,11.658954
5,South Africa,ZAF,2000,45064.098,6.93983,227242.4,72.71871,5.726546
6,United States,USA,2000,282171.957,1.0,9898700.0,72.347054,6.032454
7,Uruguay,URY,2000,3219.793,12.099592,25255.96,78.97874,5.108068


<h4>Select Data by Position</h4>
In practice, one thing that we do all the time is to find, select and work with a subset of the data of our interests.

We can select particular rows using standard Python array slicing notation

In [18]:
df[2:5]

Unnamed: 0,country,country isocode,year,POP,XRAT,tcgdp,cc,cg
2,India,IND,2000,1006300.297,44.9416,1728144.0,64.575551,14.072206
3,Israel,ISR,2000,6114.57,4.07733,129253.9,64.436451,10.266688
4,Malawi,MWI,2000,11801.505,59.543808,5026.222,74.707624,11.658954


To select columns, we can pass a list containing the names of the desired columns represented as strings

In [20]:
df[['country', 'tcgdp']].round(2)

Unnamed: 0,country,tcgdp
0,Argentina,295072.22
1,Australia,541804.65
2,India,1728144.37
3,Israel,129253.89
4,Malawi,5026.22
5,South Africa,227242.37
6,United States,9898700.0
7,Uruguay,25255.96


To select both rows and columns using integers, the iloc attribute should be used with the format .iloc[rows, columns].

In [21]:
df.iloc[2:5, 0:4]

Unnamed: 0,country,country isocode,year,POP
2,India,IND,2000,1006300.297
3,Israel,ISR,2000,6114.57
4,Malawi,MWI,2000,11801.505


To select rows and columns using a mixture of integers and labels, the loc attribute can be used in a similar way

In [23]:
df.loc[df.index[2:5], ['country', 'tcgdp']].round(2)

Unnamed: 0,country,tcgdp
2,India,1728144.37
3,Israel,129253.89
4,Malawi,5026.22


<h4>Select Data by Conditions</h4>
Instead of indexing rows and columns using integers and names, we can also obtain a sub-dataframe of our interests that satisfies certain (potentially complicated) conditions.

This section demonstrates various ways to do that.

The most straightforward way is with the [] operator.

In [24]:
df[df.POP >= 20000]

Unnamed: 0,country,country isocode,year,POP,XRAT,tcgdp,cc,cg
0,Argentina,ARG,2000,37335.653,0.9995,295072.2,75.716805,5.578804
2,India,IND,2000,1006300.297,44.9416,1728144.0,64.575551,14.072206
5,South Africa,ZAF,2000,45064.098,6.93983,227242.4,72.71871,5.726546
6,United States,USA,2000,282171.957,1.0,9898700.0,72.347054,6.032454


To understand what is going on here, notice that df.POP >= 20000 returns a series of boolean values.

In [25]:
df.POP >= 20000

0     True
1    False
2     True
3    False
4    False
5     True
6     True
7    False
Name: POP, dtype: bool

In this case, df[___] takes a series of boolean values and only returns rows with the True values.

Take one more example,

In [26]:
df[(df.country.isin(['Argentina', 'India', 'South Africa'])) & (df.POP > 40000)]

Unnamed: 0,country,country isocode,year,POP,XRAT,tcgdp,cc,cg
2,India,IND,2000,1006300.297,44.9416,1728144.0,64.575551,14.072206
5,South Africa,ZAF,2000,45064.098,6.93983,227242.4,72.71871,5.726546


However, there is another way of doing the same thing, which can be slightly faster for large dataframes, with more natural syntax.

In [27]:
df.query("POP >= 20000")

Unnamed: 0,country,country isocode,year,POP,XRAT,tcgdp,cc,cg
0,Argentina,ARG,2000,37335.653,0.9995,295072.2,75.716805,5.578804
2,India,IND,2000,1006300.297,44.9416,1728144.0,64.575551,14.072206
5,South Africa,ZAF,2000,45064.098,6.93983,227242.4,72.71871,5.726546
6,United States,USA,2000,282171.957,1.0,9898700.0,72.347054,6.032454


In [28]:
df.query("country in ['Argentina', 'India', 'South Africa'] and POP > 40000")

Unnamed: 0,country,country isocode,year,POP,XRAT,tcgdp,cc,cg
2,India,IND,2000,1006300.297,44.9416,1728144.0,64.575551,14.072206
5,South Africa,ZAF,2000,45064.098,6.93983,227242.4,72.71871,5.726546


We can also allow arithmetic operations between different columns.

In [29]:
df[(df.cc + df.cg >= 80) & (df.POP <= 20000)]

Unnamed: 0,country,country isocode,year,POP,XRAT,tcgdp,cc,cg
4,Malawi,MWI,2000,11801.505,59.543808,5026.221784,74.707624,11.658954
7,Uruguay,URY,2000,3219.793,12.099592,25255.961693,78.97874,5.108068


In [30]:
# the above is equivalent to 
df.query("cc + cg >= 80 & POP <= 20000")

Unnamed: 0,country,country isocode,year,POP,XRAT,tcgdp,cc,cg
4,Malawi,MWI,2000,11801.505,59.543808,5026.221784,74.707624,11.658954
7,Uruguay,URY,2000,3219.793,12.099592,25255.961693,78.97874,5.108068


For example, we can use the conditioning to select the country with the largest household consumption - gdp share cc.

In [31]:
df.loc[df.cc == max(df.cc)]

Unnamed: 0,country,country isocode,year,POP,XRAT,tcgdp,cc,cg
7,Uruguay,URY,2000,3219.793,12.099592,25255.961693,78.97874,5.108068


When we only want to look at certain columns of a selected sub-dataframe, we can use the above conditions with the .loc[__ , __] command.

The first argument takes the condition, while the second argument takes a list of columns we want to return.

In [32]:
df.loc[(df.cc + df.cg >= 80) & (df.POP <= 20000), ['country', 'year', 'POP']]

Unnamed: 0,country,year,POP
4,Malawi,2000,11801.505
7,Uruguay,2000,3219.793


<h4>Subsetting Dataframe</h4>

Real-world datasets can be enormous.

It is sometimes desirable to work with a subset of data to enhance computational efficiency and reduce redundancy.

Let’s imagine that we’re only interested in the population (POP) and total GDP (tcgdp).

One way to strip the data frame df down to only these variables is to overwrite the dataframe using the selection method described above

In [33]:
df_subset = df[['country', 'POP', 'tcgdp']]
df_subset

Unnamed: 0,country,POP,tcgdp
0,Argentina,37335.653,295072.2
1,Australia,19053.186,541804.7
2,India,1006300.297,1728144.0
3,Israel,6114.57,129253.9
4,Malawi,11801.505,5026.222
5,South Africa,45064.098,227242.4
6,United States,282171.957,9898700.0
7,Uruguay,3219.793,25255.96


We can then save the smaller dataset for further analysis.

In [34]:
df_subset.to_csv('../datafiles/Countrysubset.csv', index=False)