# Assignment: OECD Producer Price Index

The [producer price index](https://en.wikipedia.org/wiki/Producer_price_index) (PPI) measures the rate of change of price for products sold as they leave the producer. [OECD](http://oecd.org/), an intergovernmental organization, maintains a dataset of PPI for countries around the world. In this assignment, you will visualize the PPI of various countries from Jan 2011 to Jan 2023 as high dimensional data.

## Data

* [PPI dataset](https://data.oecd.org/price/producer-price-indices-ppi.htm#indicator-chart)

The important columns of this dataset are `LOCATION`, `TIME` and `Value`. We will treat the per-country PPI values over time as a single data point. I.e. Each high dimension data point consists of all the values from Jan 2011 to Jan 2023 for a given country. You may want to use `pandas.pivot` to switch the data frame from long form to wide form. For this assignment, we will replace all `NaN` values by 0. 

## Task

Your task for this assignment is to find a two-dimensional embedding of this high dimensional dataset that clusters countries with similar PPI value history together. The final visualization should be a 2D scatter plot. The x and y axis should map to the components computed from the dimension reduction algorithm.  The location information should be encoded as color.

Please use this notebook for this assignment.

In [42]:
import altair as alt
import pandas as pd
import sklearn

url = "https://github.com/qnzhou/practical_data_visualization_in_python/files/14559866/oecd_ppi.csv"
data = pd.read_csv(url)

In [43]:
# Let's look at the shape of the data
print(data.shape)

# Checkout a few rows...
data.head()

(5751, 8)


Unnamed: 0,LOCATION,INDICATOR,SUBJECT,MEASURE,FREQUENCY,TIME,Value,Flag Codes
0,AUT,PPI,DOMESTIC,IDX2015,M,2011-01,98.65053,
1,AUT,PPI,DOMESTIC,IDX2015,M,2011-02,99.12756,
2,AUT,PPI,DOMESTIC,IDX2015,M,2011-03,99.98622,
3,AUT,PPI,DOMESTIC,IDX2015,M,2011-04,100.3678,
4,AUT,PPI,DOMESTIC,IDX2015,M,2011-05,100.3678,


In [44]:
# Now let's pivot the data from long form to wide form while isolating the important columns
important_data = data.pivot(index='LOCATION', columns='TIME', values='Value')

# Let's look at the shape of the pivot data
print(important_data.shape)

# Make sure it we pivot correctly...
important_data.tail()

(40, 145)


TIME,2011-01,2011-02,2011-03,2011-04,2011-05,2011-06,2011-07,2011-08,2011-09,2011-10,...,2022-04,2022-05,2022-06,2022-07,2022-08,2022-09,2022-10,2022-11,2022-12,2023-01
LOCATION,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
SVK,102.3,103.3,103.9,104.6,105.1,104.7,104.9,104.8,104.7,104.7,...,124.92,127.67,129.67,131.59,129.64,130.21,130.06,131.37,128.6,
SVN,97.43,98.35,98.89,99.29,99.17,99.61,99.35,99.58,99.55,99.44,...,127.15,129.93,131.49,132.11,133.11,133.97,134.65,135.23,135.77,
SWE,98.07587,98.18581,98.07587,99.06542,98.73557,99.17537,99.94502,100.1649,99.94502,99.61517,...,138.5377,139.967,143.2655,143.1556,142.3859,143.9252,146.2342,145.7944,144.2551,
TUR,71.81463,73.47893,74.56522,74.6636,75.52855,76.04095,77.05756,78.59068,79.82455,80.50912,...,569.6923,600.8178,638.8711,657.6825,677.035,694.961,717.101,731.5427,748.2839,
ZAF,76.57256,77.66993,78.64538,79.17374,79.49889,79.62083,79.98662,80.31177,80.4337,80.47434,...,151.2106,153.9223,157.1505,160.637,159.8623,161.0244,161.6701,162.4449,162.4449,


In [48]:
# Clean the data by replacing NAN values with 0
important_data = important_data.fillna(0)

# Make sure we replaced correctly...
important_data.tail()
print(important_data.index)

Index(['AUT', 'BEL', 'CHE', 'COL', 'CRI', 'CZE', 'DEU', 'DNK', 'EA19', 'ESP',
       'EST', 'EU27_2020', 'FIN', 'FRA', 'G-7', 'GBR', 'GRC', 'HUN', 'IRL',
       'ISL', 'ISR', 'ITA', 'JPN', 'KOR', 'LTU', 'LUX', 'LVA', 'MEX', 'NLD',
       'NOR', 'OECD', 'OECDE', 'POL', 'PRT', 'RUS', 'SVK', 'SVN', 'SWE', 'TUR',
       'ZAF'],
      dtype='object', name='LOCATION')


In [46]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Compute PCA on data
data_scaled = StandardScaler().fit_transform(important_data)
pca = PCA(2)
r = pca.fit_transform(data_scaled)

In [55]:
# Define PCA dataframe
df_pca = pd.DataFrame({
    "x":r[:,0], 
    "y":r[:,1], 
    "label":list(important_data.index),
})

# Plot the PCA data
alt.Chart(df_pca).mark_point().encode(x="x:Q", y="y:Q", color="label:N").properties(width=800, height=600)

# TODO: Fix Legend (Title should be 'Country' and Legend should fit)
# TODO: x-axis should be 'Principle Component 1' and y-axis should be 'Principle Component 2'
# TODO: Title should be '2D PCA of OECD PPI by Country'
# TODO: Confirm data representation is correct & the PCA output is correct