# Pandas fundamentals

#### Written for the QuantEcon Indian Summer Workshop (August 2022)
#### Author: [Shu Hu](https://shu-hu.com/intro.html)

In [1]:
import pandas as pd

Given the dataframe below you are required to do the following exercises.

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/QuantEcon/lecture-python-programming/master/source/_static/lecture_specific/pandas/data/test_pwt.csv')

In [3]:
df

Unnamed: 0,country,country isocode,year,POP,XRAT,tcgdp,cc,cg
0,Argentina,ARG,2000,37335.653,0.9995,295072.2,75.716805,5.578804
1,Australia,AUS,2000,19053.186,1.72483,541804.7,67.759026,6.720098
2,India,IND,2000,1006300.297,44.9416,1728144.0,64.575551,14.072206
3,Israel,ISR,2000,6114.57,4.07733,129253.9,64.436451,10.266688
4,Malawi,MWI,2000,11801.505,59.543808,5026.222,74.707624,11.658954
5,South Africa,ZAF,2000,45064.098,6.93983,227242.4,72.71871,5.726546
6,United States,USA,2000,282171.957,1.0,9898700.0,72.347054,6.032454
7,Uruguay,URY,2000,3219.793,12.099592,25255.96,78.97874,5.108068


### Exercise 1 (View data)

### Exercise 1.1

Show the top 5 rows of the dataframe ``df``.

### Solution

In [4]:
df.head(5)

Unnamed: 0,country,country isocode,year,POP,XRAT,tcgdp,cc,cg
0,Argentina,ARG,2000,37335.653,0.9995,295072.2,75.716805,5.578804
1,Australia,AUS,2000,19053.186,1.72483,541804.7,67.759026,6.720098
2,India,IND,2000,1006300.297,44.9416,1728144.0,64.575551,14.072206
3,Israel,ISR,2000,6114.57,4.07733,129253.9,64.436451,10.266688
4,Malawi,MWI,2000,11801.505,59.543808,5026.222,74.707624,11.658954


### Exercise 1.2

Show the bottom 5 rows of the dataframe ``df``.

### Solution

In [5]:
df.tail(5)

Unnamed: 0,country,country isocode,year,POP,XRAT,tcgdp,cc,cg
3,Israel,ISR,2000,6114.57,4.07733,129253.9,64.436451,10.266688
4,Malawi,MWI,2000,11801.505,59.543808,5026.222,74.707624,11.658954
5,South Africa,ZAF,2000,45064.098,6.93983,227242.4,72.71871,5.726546
6,United States,USA,2000,282171.957,1.0,9898700.0,72.347054,6.032454
7,Uruguay,URY,2000,3219.793,12.099592,25255.96,78.97874,5.108068


### Exercise 1.3

Show a quick statistic summary of the dataframe ``df``.

### Solution

In [6]:
df.describe()

Unnamed: 0,year,POP,XRAT,tcgdp,cc,cg
count,8.0,8.0,8.0,8.0,8.0,8.0
mean,2000.0,176382.6,16.415811,1606312.0,71.404995,8.145477
std,0.0,347922.3,22.758175,3397025.0,5.318015,3.383397
min,2000.0,3219.793,0.9995,5026.222,64.436451,5.108068
25%,2000.0,10379.77,1.543623,103254.4,66.963157,5.689611
50%,2000.0,28194.42,5.50858,261157.3,72.532882,6.376276
75%,2000.0,104341.1,20.310094,838389.6,74.959919,10.614755
max,2000.0,1006300.0,59.543808,9898700.0,78.97874,14.072206


### Exercise 1.4

Transpose the dataframe ``df``.

### Solution

In [7]:
df.T

Unnamed: 0,0,1,2,3,4,5,6,7
country,Argentina,Australia,India,Israel,Malawi,South Africa,United States,Uruguay
country isocode,ARG,AUS,IND,ISR,MWI,ZAF,USA,URY
year,2000,2000,2000,2000,2000,2000,2000,2000
POP,37335.653,19053.186,1006300.297,6114.57,11801.505,45064.098,282171.957,3219.793
XRAT,0.9995,1.72483,44.9416,4.07733,59.543808,6.93983,1.0,12.099592
tcgdp,295072.21869,541804.6521,1728144.3748,129253.89423,5026.221784,227242.36949,9898700.0,25255.961693
cc,75.716805,67.759026,64.575551,64.436451,74.707624,72.71871,72.347054,78.97874
cg,5.578804,6.720098,14.072206,10.266688,11.658954,5.726546,6.032454,5.108068


### Exercise 1.5

Sort the dataframe ``df`` by values in column ``tcgdp`` ascendingly.

### Solution

In [8]:
df.sort_values(by="tcgdp")

Unnamed: 0,country,country isocode,year,POP,XRAT,tcgdp,cc,cg
4,Malawi,MWI,2000,11801.505,59.543808,5026.222,74.707624,11.658954
7,Uruguay,URY,2000,3219.793,12.099592,25255.96,78.97874,5.108068
3,Israel,ISR,2000,6114.57,4.07733,129253.9,64.436451,10.266688
5,South Africa,ZAF,2000,45064.098,6.93983,227242.4,72.71871,5.726546
0,Argentina,ARG,2000,37335.653,0.9995,295072.2,75.716805,5.578804
1,Australia,AUS,2000,19053.186,1.72483,541804.7,67.759026,6.720098
2,India,IND,2000,1006300.297,44.9416,1728144.0,64.575551,14.072206
6,United States,USA,2000,282171.957,1.0,9898700.0,72.347054,6.032454


### Exercise 2 (Select data)

### Exercise 2.1

Select the column ``tcgdp`` from the dataframe ``df``, yielding a series.

### Solution

In [9]:
df["tcgdp"]

0    2.950722e+05
1    5.418047e+05
2    1.728144e+06
3    1.292539e+05
4    5.026222e+03
5    2.272424e+05
6    9.898700e+06
7    2.525596e+04
Name: tcgdp, dtype: float64

In [10]:
df.tcgdp

0    2.950722e+05
1    5.418047e+05
2    1.728144e+06
3    1.292539e+05
4    5.026222e+03
5    2.272424e+05
6    9.898700e+06
7    2.525596e+04
Name: tcgdp, dtype: float64

### Exercise 2.2

Select the top 3 rows from the dataframe ``df``.

### Solution

In [11]:
df[0:3]

Unnamed: 0,country,country isocode,year,POP,XRAT,tcgdp,cc,cg
0,Argentina,ARG,2000,37335.653,0.9995,295072.2,75.716805,5.578804
1,Australia,AUS,2000,19053.186,1.72483,541804.7,67.759026,6.720098
2,India,IND,2000,1006300.297,44.9416,1728144.0,64.575551,14.072206


### Exercise 2.3


Get the first row of data across all the columns.

### Solution

In [12]:
df.loc[0]

country               Argentina
country isocode             ARG
year                       2000
POP                   37335.653
XRAT                     0.9995
tcgdp              295072.21869
cc                    75.716805
cg                     5.578804
Name: 0, dtype: object

### Exercise 2.4

Select all values on two columns ``country`` and ``POP``.

### Solution

In [13]:
df[["country", "POP"]]

Unnamed: 0,country,POP
0,Argentina,37335.653
1,Australia,19053.186
2,India,1006300.297
3,Israel,6114.57
4,Malawi,11801.505
5,South Africa,45064.098
6,United States,282171.957
7,Uruguay,3219.793


You can also use the ``loc`` method which gives you access to locators for both column and row selection.

In [14]:
df.loc[:, ["country", "POP"]]

Unnamed: 0,country,POP
0,Argentina,37335.653
1,Australia,19053.186
2,India,1006300.297
3,Israel,6114.57
4,Malawi,11801.505
5,South Africa,45064.098
6,United States,282171.957
7,Uruguay,3219.793


### Exercise 2.5

Select values in rows 2-4 on two columns ``country`` and ``POP``.

### Solution

In [15]:
df.loc[2:4, ["country", "POP"]]

Unnamed: 0,country,POP
2,India,1006300.297
3,Israel,6114.57
4,Malawi,11801.505


### Exercise 2.6

Get a scalar value for the 2nd row in the column ``POP`` which should be ``1006300.297``.

### Solution

In [16]:
df.loc[2, "POP"]

1006300.297

In [17]:
# The .at property acts only on single value pairs for columns and rows, 
# while .loc can include ranges and more advanced filters
df.at[2, "POP"]

1006300.297

### Exercise 3 (Boolean indexing)

### Exercise 3.1 

Select data with column ``POP``'s values greater than ``30_000``.

### Solution

In [18]:
df[df['POP'] > 30_000]

Unnamed: 0,country,country isocode,year,POP,XRAT,tcgdp,cc,cg
0,Argentina,ARG,2000,37335.653,0.9995,295072.2,75.716805,5.578804
2,India,IND,2000,1006300.297,44.9416,1728144.0,64.575551,14.072206
5,South Africa,ZAF,2000,45064.098,6.93983,227242.4,72.71871,5.726546
6,United States,USA,2000,282171.957,1.0,9898700.0,72.347054,6.032454


### Exercise 3.2

We see

In [19]:
df1 = df.copy()
df1

Unnamed: 0,country,country isocode,year,POP,XRAT,tcgdp,cc,cg
0,Argentina,ARG,2000,37335.653,0.9995,295072.2,75.716805,5.578804
1,Australia,AUS,2000,19053.186,1.72483,541804.7,67.759026,6.720098
2,India,IND,2000,1006300.297,44.9416,1728144.0,64.575551,14.072206
3,Israel,ISR,2000,6114.57,4.07733,129253.9,64.436451,10.266688
4,Malawi,MWI,2000,11801.505,59.543808,5026.222,74.707624,11.658954
5,South Africa,ZAF,2000,45064.098,6.93983,227242.4,72.71871,5.726546
6,United States,USA,2000,282171.957,1.0,9898700.0,72.347054,6.032454
7,Uruguay,URY,2000,3219.793,12.099592,25255.96,78.97874,5.108068


In [20]:
df1["X"] = ["A", "B", "C", "D", "E", "F", "G","H"]

In [21]:
df1

Unnamed: 0,country,country isocode,year,POP,XRAT,tcgdp,cc,cg,X
0,Argentina,ARG,2000,37335.653,0.9995,295072.2,75.716805,5.578804,A
1,Australia,AUS,2000,19053.186,1.72483,541804.7,67.759026,6.720098,B
2,India,IND,2000,1006300.297,44.9416,1728144.0,64.575551,14.072206,C
3,Israel,ISR,2000,6114.57,4.07733,129253.9,64.436451,10.266688,D
4,Malawi,MWI,2000,11801.505,59.543808,5026.222,74.707624,11.658954,E
5,South Africa,ZAF,2000,45064.098,6.93983,227242.4,72.71871,5.726546,F
6,United States,USA,2000,282171.957,1.0,9898700.0,72.347054,6.032454,G
7,Uruguay,URY,2000,3219.793,12.099592,25255.96,78.97874,5.108068,H


Filter rows containing ``A``, ``C``, ``G`` in column ``X`` using method [isin()](https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html#pandas.Series.isin).

### Solution

In [22]:
df1[df1["X"].isin(["A", "C", "G"])]

Unnamed: 0,country,country isocode,year,POP,XRAT,tcgdp,cc,cg,X
0,Argentina,ARG,2000,37335.653,0.9995,295072.2,75.716805,5.578804,A
2,India,IND,2000,1006300.297,44.9416,1728144.0,64.575551,14.072206,C
6,United States,USA,2000,282171.957,1.0,9898700.0,72.347054,6.032454,G
