## Data acquisition notebook for historical, country-level GDP (`Y`), population (`P`), and capital (`K`) information

In the SLIIDERS workflow, historical information is only used for generating the initial (i.e., year 2010) capital stock values and creating capital and population ratios with respect to a reference year (in our case, 2019). To have some degree of usable information, we will gather and organize historical information for 1950-2019 period. In this notebook, we acquire data from various sources that will be used for this workflow.

## Importing necessary modules and functions

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import glob
import os
import shutil
import subprocess
import tarfile
import warnings
from itertools import product as lstprod
from pathlib import Path
from urllib import request as urequest
from zipfile import ZipFile

import dask.distributed as dd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import rhg_compute_tools.kubernetes as rhgk
import statsmodels.api as sm
import xarray as xr
from dask_gateway import Gateway
from pandas_datareader import wb as dr_wb
from scipy.optimize import minimize as opt_min
from tqdm.auto import tqdm

from sliiders import settings as sset

# dask gateway setup
gateway = Gateway()
image_name = sset.DASK_IMAGE

In [None]:
## shorthand for the directory
os.makedirs(sset.DIR_YPK_RAW, exist_ok=True)
os.makedirs(sset.DIR_LITPOP_RAW, exist_ok=True)
os.makedirs(sset.DIR_GEG15_RAW, exist_ok=True)

## Fetching all raw data from various sources

### Penn World Tables 10.0 (PWT 10.0)

In [None]:
## PWT10.0
pwt100_data = pd.read_excel("https://www.rug.nl/ggdc/docs/pwt100.xlsx", sheet_name=2)

## PWT10.0 capital details
pwt100_data_K = pd.read_excel(
    "https://www.rug.nl/ggdc/docs/pwt100-capital-detail.xlsx", sheet_name=2
)

pwt_filenames = ["pwt_100.xlsx", "pwt_K_detail_100.xlsx"]
for i, data in enumerate([pwt100_data, pwt100_data_K]):
    data.to_excel(
        excel_writer=(sset.DIR_YPK_RAW / pwt_filenames[i]),
        sheet_name="Sheet1",
        index=False,
    )

### Maddison Project Dataset

In [None]:
madd = pd.read_excel(
    "https://www.rug.nl/ggdc/historicaldevelopment/maddison/data/mpd2020.xlsx",
    sheet_name=2,
)
madd.to_excel(
    excel_writer=(sset.DIR_YPK_RAW / "maddison_project.xlsx"),
    index=False,
    sheet_name="Sheet1",
)

### World Bank WDI: Investment-to-GDP ratio, GDP and GDPpc (nominal and PPP), and Population

In [None]:
## country name and iso3 country code information
country_info = dr_wb.get_countries()[["name", "iso3c"]].rename(
    columns={"name": "country", "iso3c": "ccode"}
)

## relevant indicator information for the `dr_wb` module to fetch the variables
wbwdi_indicators = [
    "SP.POP.TOTL",  ## population
    "NE.GDI.FTOT.ZS",  ## investment-to-GDP ratio
    "NY.GDP.MKTP.PP.KD",  ## GDP PPP
    "NY.GDP.PCAP.PP.KD",  ## GDP per capita PPP
    "NY.GDP.MKTP.KD",  ## GDP nominal
    "NY.GDP.PCAP.KD",  ## GDP per capita nominal
]

j = 0
for indi in wbwdi_indicators:
    indi_info = (
        dr_wb.download(indicator=indi, country="all", start=1950, end=2020)
        .reset_index()
        .astype({"year": "int64"})
        .merge(country_info, on=["country"], how="left")
        .set_index(["ccode", "year"])
    )

    if j == 0:
        j += 1
        wbwdi_info = indi_info.copy()
    else:
        wbwdi_info = wbwdi_info.merge(
            indi_info.drop(["country"], axis=1),
            left_index=True,
            right_index=True,
            how="outer",
        )

## excluding those that have no information and saving the data
wb_info_vars = [x for x in wbwdi_info.columns if x != "country"]
wbwdi_info = wbwdi_info.loc[~pd.isnull(wbwdi_info[wb_info_vars]).all(axis=1), :]
wbwdi_info.to_parquet(sset.DIR_YPK_RAW / "wdi_pop_iy_gdp.parquet")

### WB WDI: exchange rate

In [None]:
## country name and iso3 country code information
country_info = dr_wb.get_countries()[["name", "iso3c"]].rename(
    columns={"name": "country", "iso3c": "ccode"}
)

xr_code = "PA.NUS.FCRF"
xr_wb = dr_wb.download(indicator=xr_code, country="all", start=1950, end=2019)
xr_wb = (
    xr_wb.reset_index()
    .astype({"year": "int64"})
    .merge(country_info, on=["country"], how="left")
)
(
    xr_wb.set_index(["ccode", "year"])
    .rename(columns={xr_code: "xrate"})
    .to_parquet(sset.DIR_YPK_RAW / "wdi_xr.parquet")
)

### UN WPP populations (overall and by-population-group data)

In [None]:
## overall information
un_df = pd.read_csv(
    "https://population.un.org/wpp/Download/Files/"
    "1_Indicators%20(Standard)/CSV_FILES/WPP2019_TotalPopulationBySex.csv"
)

## by_age_group
by_age = pd.read_csv(
    "https://population.un.org/wpp/Download/Files/1_Indicators"
    "%20(Standard)/CSV_FILES/WPP2019_PopulationByAgeSex_Medium.csv"
)

## exporting
un_df.to_csv(sset.DIR_YPK_RAW / "UN_WPP2019_TotalPopulation.csv", index=False)
by_age.to_csv(sset.DIR_YPK_RAW / "UN_WPP2019_Population_by_Age.csv", index=False)

### Åland Island GDP and population

We will keep the format and data unaltered, but change the file name to be more human-readable.

In [None]:
## GDP information
ala_gdp = pd.read_excel(
    "https://www.asub.ax/sites/www.asub.ax/files/attachments/page/nr005en.xls",
    header=3,
)

## population
ala_pop_link = (
    "https://www.asub.ax/sites/www.asub.ax/files/attachments/page/alv01_aland_faroe"
    "_islands_and_greenland_-_an_overview_with_comparable_data.xlsx"
)
ala_pop = pd.read_excel(
    ala_pop_link,
    header=2,
    sheet_name="Population development",
)

## exporitng
ala_gdp.to_excel(sset.DIR_YPK_RAW / "aland_gdp.xlsx", index=False)
ala_pop.to_excel(sset.DIR_YPK_RAW / "aland_pop.xlsx", index=False)

### LitPop (Eberenz et al. 2020, Earth Syst. Sci. Data)

#### Download Data from the Internet

In [None]:
## directory for the litpop dataset to be stored in
direc = sset.DIR_LITPOP_RAW


def litpop_download(link, direc=direc):
    """Given a URL link, downloads (LitPop-related) data from the web and saves it in
    the specified local directory. The file name is parsed so that anything after the
    string `?sequence` is dropped (e.g., `file.txt?sequence=..` to `file.txt`).

    Parameters
    ----------
    link : str
        URL link for the file online
    direc : str
        directory for

    Returns
    -------
    None, but saves the file downloaded from online to `direc`.

    """
    stop = link.find("?sequence")
    start = link.rfind("/", 0, stop) + 1
    urequest.urlretrieve(link, direc / link[start:stop])

    return None

In [None]:
link_base = (
    "https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/331316"
)

## readme, data, normalized data, and metadata
links = [
    link_base + "/_readme_v1_2.txt?sequence=18&isAllowed=y",
    link_base + "/LitPop_v1_2.tar?sequence=16&isAllowed=y",
    link_base + "/Lit_Pop_norm_v1.tar?sequence=4&isAllowed=y",
    link_base + "/_metadata_countries_v1_2.csv?sequence=12&isAllowed=y",
]

In [None]:
## cluster setup
N_CLUSTER = len(links)
cluster = gateway.new_cluster(worker_image=image_name, profile="micro")
client = cluster.get_client()
cluster.scale(N_CLUSTER)
cluster

In [None]:
## Takes approximately 20 minutes
futures = client.map(litpop_download, links)
dd.progress(futures)

In [None]:
cluster.scale(0)
client.close()
cluster.close()
cluster.shutdown()

#### Un-tar and clear storage

We only un-tar the regular (not normalized) LitPop data here.

In [None]:
# un-tar
regular_litpop = sset.DIR_LITPOP_RAW / "LitPop_v1_2.tar"
with tarfile.open(regular_litpop) as file:
    file.extractall(sset.DIR_LITPOP_RAW)

# clear storage for the existing tar file
os.remove(regular_litpop)

### GEG-15

We download 2'30" GEG15 and unzip.

In [None]:
# downloading
zip_url = (
    "https://data.humdata.org/dataset/1c9cf1eb-c20a-4a06-8309-9416464af746/"
    "resource/e321d56d-022e-4070-80ac-f7860646408d/download/gar-exp.zip"
)
zip_path = sset.DIR_GEG15_RAW / "gar-exp.zip"
urequest.urlretrieve(zip_url, zip_path)

# unzipping
outpath = sset.DIR_GEG15_RAW / zip_path.stem
os.makedirs(outpath, exist_ok=True)
subprocess.Popen(["unzip", f"{zip_path}", "-d", f"{outpath}"])

In [None]:
# remove zip file (use after unzipping)
os.remove(zip_path)

### Global Wealth Databook (from Credit Suisse)

We download the 2021 vintage (latest as of 2022-03-21).

In [None]:
## gathering the files; 2020 databook is missing
BASE_URL_GWDB = (
    "https://www.credit-suisse.com/media/assets/corporate/docs/about-us/research"
    "/publications/{}"
)

GWDB2021_PDFNAME = "global-wealth-databook-2021.pdf"
gwr_raw = urequest.urlopen(BASE_URL_GWDB.format(GWDB2021_PDFNAME))
file = open(str(sset.DIR_GLOBAL_WEALTH_RAW / GWDB2021_PDFNAME), "wb")
file.write(gwr_raw.read())
file.close()

## Further data requiring separate manual instructions

In all cases, we will download these in the directory specified in `sset.DIR_YPK_RAW`.

### CIA World Factbook (compiled by Coleman [2020]; the version that is utilized in this workflow)

1. Travel to this [link](https://github.com/iancoleman/cia_world_factbook_api) (credit to Coleman [2020]), and scroll down to the `readme.md`.
2. In the **Data** section of the `readme.md` file, there should be a link on "Historical"; click on this link to travel to a `mega.nz` website having `weekly_json.7z` file.
3. After checking that the filename to download is `weekly_json.7z`, download the said file by clicking on the "Download" button.
4. When download is successful, import `weekly_json.7z` to the preferred directory (`sset.DIR_YPK_RAW` in this implementation).

### IMF investment-to-GDP ratio, population, and GDP

1. Travel to this [link](https://www.imf.org/en/Publications/SPROLLs/world-economic-outlook-databases#sort=%40imfdate%20descending) to get to the World Economic Outlook Databases page.
2. Click on the latest "World Economic Outlook Database" link on the page; for our purposes, we have used the latest available one, which was "World Economic Outlook Database, October 2021" (may be updated in the future).
3. Click "By Countries", then click "ALL COUNTRIES", then click "CONTINUE" on the page that says "Select Countries."
4. Under the "NATIONAL ACCOUNTS" tab, check the following categories:
   - Gross domestic product, current prices (U.S. DOLLARS)
   - Gross domestic product per capita, current prices (U.S. DOLLARS)
   - Gross domestic product per capita, constant prices (PURCHASING POWER PARITY; 2017 INTERNATIONAL DOLLARS)
   - Total investment (PERCENT OF GDP)
5. Under the "PEOPLE" tab, check the category "Population," then click on "CONTINUE."
6. Under the tab "DATE RANGE," use the earliest year for "Start Year" (1980, in our case), and the latest non-future year for "End Year" (2020, in our case).
7. Under the tab "ADVANCED SETTINGS", click on "ISO Alpha-3 Code" for getting country codes. 
8. Click on "PREPARE REPORT." Then, click on "DOWNLOAD REPORT." Saved data should be in Excel format and be named `WEO_Data.xls`.
9. Open the said file on Excel, and re-save it in a preferred format of choice (we chose `.xlsx`); this is because the original file formatting is incompatible with Python and causes the error `ValueError: Excel file format cannot be determined, you must specify an engine manually.`

### UN Statistics National Accounts (Analysis of Main Aggregates, GDP per capita information)

1. Travel to this [link](https://unstats.un.org/unsd/snaama/Basic) to get to the UN Statistics National Accounts search page.
2. Select all countries and all years available, and select "GDP, Per Capita GDP - US Dollars"
3. Select "Export to CSV", and you will download the file `Results.csv`. Rename this file as `un_snaama_nom_gdppc.csv`.

### OECD: region-level population information 

1. Go to the following OECD Stat website: link [here](https://stats.oecd.org/)
2. On the left, find the header "Regions and Cities" and click the "+" button.
3. From the drop down menu, click on "Regional Statistics".
4. Again from the drop down menu, click on "Regional Demography."
5. Finally, select "Population by 5-year age groups, small regions TL3."
6. Download the file by selecting "Export," then "Text File (CSV)."
7. When a pop-up appears, select "Default format" then "Download."
8. Load it to a folder of your choice on the Jupyterlab setting.
9. Finally, move the said file to the desired location; in our case, we renamed the file `REGION_DEMOGR.csv` (due to the file name having random-ish numeric parts).

### OECD: region-level GDP (PPP 2015, in millions) information

1. Go to the following OECD Stat website: link [here](https://stats.oecd.org/)
2. On the left, find the header "Regions and Cities" and click the "+" button.
3. From the drop down menu, click on "Regional Statistics".
4. Again from the drop down menu, click on "Regional Economy."
5. Finally, select "Gross Domestic Product, Small regions TL3."
6. Download the file by selecting "Export," then "Text File (CSV)."
7. When a pop-up appears, select "Default format" then "Download."
8. Load it to a folder of your choice on the Jupyterlab setting.
9. Finally, move the said file to the desired location; in our case, we renamed the file `REGION_ECONOM.csv` (due to the file name having random-ish numeric parts).

### IIASA and OECD models' GDP and population projections (2010-2100, every 5 years)

1. Go to the following IIASA SSP Database website: link [here](https://tntcat.iiasa.ac.at/SspDb); you may need to register and create your log-in.
2. In the above tabs, there is a tab called "Download"; click on it.
3. Under "SSP Database Version 2 Downloads (2018)" and under the sub-header "Basic Elements", there is a download link for `SspDb_country_data_2013-06-12.csv.zip`. Click and download the said `.zip` file.
4. Extract and save the `SspDb_country_data_2013-06-12.csv`. Again, for our purposes, we save this in `sset.DIR_YPK_RAW`.

### LandScan 2019

1. To download this dataset, you need to first apply for an Oak Ridge National Laboratory account (link [here](https://landscan.ornl.gov/user/apply)).
2. After having gained access, go to the said website, click on "DOWNLOAD" -> "LandScan Datasets" -> "Continue to download" next to LandScan 2019.
3. Click on "By downloading LandScan 2019 I agree to the above terms" in the following webpage; this will download the file `LandScan Global 2019.zip`. We save this in `sset.DIR_LANDSCAN_RAW`.