
This notebook is derived from a larger tutorial called <b>'Python Programming for Economics and Finance'</b><br>
by Thomas J. Sargent and John Stachurski

The full tutorial is available at https://python-programming.quantecon.org/pandas.html

<a href="https://pandas.pydata.org/pandas-docs/stable/reference/index.html">Pandas API reference</a>

In [None]:
import numpy as np
import pandas as pd
from datetime import date

Two important data types defined by pandas are Series and DataFrame.

You can think of a Series as a “column” of data, such as a collection of observations on a single variable of the same type.

A DataFrame is a two-dimensional object for storing related columns of data (a collection of series objects).

<h1>Series</h1>

We begin by creating a series of four random observations

In [None]:
s = pd.Series(np.random.randn(4), name='daily stock returns')
s

Here you can imagine the indices <b>0, 1, 2, 3</b> as indexing four listed companies, and the values being daily returns on their shares.

Pandas Series are built on top of NumPy arrays and support many similar operations

In [None]:
s * 100

In [None]:
np.abs(s)

In [None]:
s.describe()

In [None]:
s.index = ['AMZN', 'AAPL', 'MSFT', 'GOOG']
s

In [None]:
s['AMZN']

In [None]:
s['AMZN'] = 0
s

In [None]:
'AAPL' in s

<h1>DataFrames</h1>
While a Series is a single column of data, a DataFrame is several columns, one for each variable.

In essence, a DataFrame in pandas is analogous to a (highly optimized) Excel spreadsheet.

Thus, it is a powerful tool for representing and analyzing data that are naturally organized into rows and columns, often with descriptive indexes for individual rows and individual columns.

Let’s look at an example that reads data from the CSV file CountryData.csv.

The dataset contains the following indicators

<table style="border: 1px solid; text-align: center;">
    <tr><th style="border: 1px solid"><b>Variable Name</b></th><th style="border: 1px solid; text-align: center;"><b>Description</b></th></tr>
    <tr><td style="border: 1px solid">POP</td><td style="border: 1px solid">Population (in thousands)</td></tr>
    <tr><td style="border: 1px solid">XRAT</td><td style="border: 1px solid">Exchange Rate to US Dollar</td></tr>
    <tr><td style="border: 1px solid">tcgdp</td><td style="border: 1px solid">Total PPP Converted GDP (in million international dollar)</td></tr>
    <tr><td style="border: 1px solid">cc</td><td style="border: 1px solid">Consumption Share of PPP Converted GDP Per Capita (%)</td></tr>
    <tr><td style="border: 1px solid">cq</td><td style="border: 1px solid">Government Consumption Share of PPP Converted GDP Per Capita (%)</td></tr>    
</table>

We’ll read this in from a URL using the pandas function read_csv.

In [None]:
df = pd.read_csv('..//datafiles//CountryData.csv')
type(df)

In [None]:
df

<h4>Select Data by Position</h4>
In practice, one thing that we do all the time is to find, select and work with a subset of the data of our interests.

We can select particular rows using standard Python array slicing notation

In [None]:
df[2:5]

To select columns, we can pass a list containing the names of the desired columns represented as strings

In [None]:
df[['country', 'tcgdp']]

To select both rows and columns using integers, the iloc attribute should be used with the format .iloc[rows, columns].

In [None]:
df.iloc[2:5, 0:4]

To select rows and columns using a mixture of integers and labels, the loc attribute can be used in a similar way

In [None]:
df.loc[df.index[2:5], ['country', 'tcgdp']]

<h4>Select Data by Conditions</h4>
Instead of indexing rows and columns using integers and names, we can also obtain a sub-dataframe of our interests that satisfies certain (potentially complicated) conditions.

This section demonstrates various ways to do that.

The most straightforward way is with the [] operator.

In [None]:
df[df.POP >= 20000]

To understand what is going on here, notice that df.POP >= 20000 returns a series of boolean values.

In [None]:
df.POP >= 20000

In this case, df[___] takes a series of boolean values and only returns rows with the True values.

Take one more example,

In [None]:
df[(df.country.isin(['Argentina', 'India', 'South Africa'])) & (df.POP > 40000)]

However, there is another way of doing the same thing, which can be slightly faster for large dataframes, with more natural syntax.

In [None]:
df.query("POP >= 20000")

In [None]:
df.query("country in ['Argentina', 'India', 'South Africa'] and POP > 40000")

We can also allow arithmetic operations between different columns.

In [None]:
df[(df.cc + df.cg >= 80) & (df.POP <= 20000)]

In [None]:
# the above is equivalent to 
df.query("cc + cg >= 80 & POP <= 20000")

For example, we can use the conditioning to select the country with the largest household consumption - gdp share cc.

In [None]:
df.loc[df.cc == max(df.cc)]

When we only want to look at certain columns of a selected sub-dataframe, we can use the above conditions with the .loc[__ , __] command.

The first argument takes the condition, while the second argument takes a list of columns we want to return.

In [None]:
df.loc[(df.cc + df.cg >= 80) & (df.POP <= 20000), ['country', 'year', 'POP']]

<h4>Subsetting Dataframe</h4>

Real-world datasets can be enormous.

It is sometimes desirable to work with a subset of data to enhance computational efficiency and reduce redundancy.

Let’s imagine that we’re only interested in the population (POP) and total GDP (tcgdp).

One way to strip the data frame df down to only these variables is to overwrite the dataframe using the selection method described above

In [None]:
df_subset = df[['country', 'POP', 'tcgdp']]
df_subset

We can then save the smaller dataset for further analysis.

In [None]:
df_subset.to_csv('../datafiles/Countrysubset.csv', index=False)