## Notebook for downloading inputs to create SLIIDERS-ECON

This notebook contains directions for downloading various input datasets to create the final product for this directory, the **SLIIDERS-ECON** dataset.

In general, we will keep the format, file name, and data unaltered, but apply changes when
- file name is not human-readable, too long, or is not much informative about the dataset (assign appropriate file names)
- file format causes errors (save in a similar file format that is not error-prone)

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import os
import ssl
import subprocess
import tarfile
from io import BytesIO
from pathlib import Path
from urllib import request as urequest
from zipfile import ZipFile

import dask.distributed as dd
import pandas as pd
import requests
from dask_gateway import Gateway
from pandas_datareader import wb as dr_wb
from sliiders import settings as sset
from tqdm.auto import tqdm

# dask gateway setup
gateway = Gateway()
image_name = sset.DASK_IMAGE

In [None]:
# creating select directories
PWT_DIRECTORY = Path(os.path.dirname(sset.PATH_PWT_RAW))
IMF_WEO_DIRECTORY = Path(os.path.dirname(sset.PATH_IMF_WEO_RAW))
MPD_DIRECTORY = Path(os.path.dirname(sset.PATH_MPD_RAW))
GWDB_DIRECTORY = Path(os.path.dirname(sset.PATH_GWDB2021_RAW))
SRTM15PLUS_DIRECTORY = Path(os.path.dirname(sset.PATH_SRTM15_PLUS))

directories_to_create = [
    sset.DIR_YPK_RAW,
    PWT_DIRECTORY,
    IMF_WEO_DIRECTORY,
    MPD_DIRECTORY,
    GWDB_DIRECTORY,
    SRTM15PLUS_DIRECTORY,
    sset.DIR_WB_WDI_RAW,
    sset.DIR_LITPOP_RAW,
    sset.DIR_GEG15_RAW,
    sset.DIR_CIA_RAW,
    sset.DIR_CCI_RAW,
    sset.DIR_UN_WPP_RAW,
    sset.DIR_UN_AMA_RAW,
    sset.DIR_ALAND_STATISTICS_RAW,
    sset.DIR_OECD_REGIONS_RAW,
    sset.DIR_LANDSCAN_RAW,
    sset.DIR_IIASA_PROJECTIONS,
]
for direc in directories_to_create:
    os.makedirs(direc, exist_ok=True)

## Fetching raw data from various sources

### Penn World Tables 10.0 (PWT 10.0)

In [None]:
# PWT10.0
pwt100_data = pd.read_excel("https://www.rug.nl/ggdc/docs/pwt100.xlsx", sheet_name=2)

# PWT10.0 capital details
pwt100_data_K = pd.read_excel(
    "https://www.rug.nl/ggdc/docs/pwt100-capital-detail.xlsx", sheet_name=2
)

pwt_filenames = ["pwt_100.xlsx", "pwt_K_detail_100.xlsx"]
for i, data in enumerate([pwt100_data, pwt100_data_K]):
    data.to_excel(
        excel_writer=(PWT_DIRECTORY / pwt_filenames[i]),
        sheet_name="Sheet1",
        index=False,
    )

### Maddison Project Dataset (MPD, Maddison Project Database 2020)

In [None]:
madd = pd.read_excel(
    "https://www.rug.nl/ggdc/historicaldevelopment/maddison/data/mpd2020.xlsx",
    sheet_name=2,
)
madd.to_excel(
    excel_writer=(sset.PATH_MPD_RAW),
    index=False,
    sheet_name="Sheet1",
)

### World Bank WDI (WB WDI)

#### Investment-to-GDP ratio, GDP and GDPpc (nominal and PPP), and Population

In [None]:
# country name and iso3 country code information
country_info = dr_wb.get_countries()[["name", "iso3c"]].rename(
    columns={"name": "country", "iso3c": "ccode"}
)

# relevant indicator information for the `dr_wb` module to fetch the variables
wbwdi_indicators = [
    "SP.POP.TOTL",  # population
    "NE.GDI.FTOT.ZS",  # investment-to-GDP ratio
    "NY.GDP.MKTP.PP.KD",  # GDP PPP
    "NY.GDP.PCAP.PP.KD",  # GDP per capita PPP
    "NY.GDP.MKTP.KD",  # GDP nominal
    "NY.GDP.PCAP.KD",  # GDP per capita nominal
]

j = 0
for indi in wbwdi_indicators:
    indi_info = (
        dr_wb.download(indicator=indi, country="all", start=1950, end=2020)
        .reset_index()
        .astype({"year": "int64"})
        .merge(country_info, on=["country"], how="left")
        .set_index(["ccode", "year"])
    )

    if j == 0:
        j += 1
        wbwdi_info = indi_info.copy()
    else:
        wbwdi_info = wbwdi_info.merge(
            indi_info.drop(["country"], axis=1),
            left_index=True,
            right_index=True,
            how="outer",
        )

# excluding those that have no information and saving the data
wb_info_vars = [x for x in wbwdi_info.columns if x != "country"]
wbwdi_info = wbwdi_info.loc[~pd.isnull(wbwdi_info[wb_info_vars]).all(axis=1), :]
wbwdi_info.to_parquet(sset.DIR_WB_WDI_RAW / "wdi_pop_iy_gdp.parquet")

#### WB WDI: exchange rate

In [None]:
# country name and iso3 country code information
country_info = dr_wb.get_countries()[["name", "iso3c"]].rename(
    columns={"name": "country", "iso3c": "ccode"}
)

xr_code = "PA.NUS.FCRF"
xr_wb = dr_wb.download(indicator=xr_code, country="all", start=1950, end=2019)
xr_wb = (
    xr_wb.reset_index()
    .astype({"year": "int64"})
    .merge(country_info, on=["country"], how="left")
)
(
    xr_wb.set_index(["ccode", "year"])
    .rename(columns={xr_code: "xrate"})
    .to_parquet(sset.DIR_WB_WDI_RAW / "wdi_xr.parquet")
)

### UN WPP populations (overall and by-population-group data)

In [None]:
# overall information
un_df = pd.read_csv(
    "https://population.un.org/wpp/Download/Files/"
    "1_Indicators%20(Standard)/CSV_FILES/WPP2019_TotalPopulationBySex.csv"
)

# by_age_group
by_age = pd.read_csv(
    "https://population.un.org/wpp/Download/Files/1_Indicators"
    "%20(Standard)/CSV_FILES/WPP2019_PopulationByAgeSex_Medium.csv"
)

# exporting
un_df.to_csv(sset.DIR_UN_WPP_RAW / "UN_WPP2019_TotalPopulation.csv", index=False)
by_age.to_csv(sset.DIR_UN_WPP_RAW / "UN_WPP2019_Population_by_Age.csv", index=False)

### Åland Island GDP and population (from Statistics and Research Åland or ÅSUB)

Note when newer versions are available, old links from ÅSUB will become deprecated; the below links in `ALA_GDP_LINK` and `ALA_POP_LINK` are valid as of 2022-03-29.

In [None]:
# links
ALA_GDP_LINK = (
    "https://www.asub.ax/sites/www.asub.ax/files/attachments/page/nr005en.xls"
)
ALA_POP_LINK = (
    "https://www.asub.ax/sites/www.asub.ax/files/attachments/page/alv01_aland_faroe"
    "_islands_and_greenland_-_an_overview_with_comparable_data.xlsx"
)

# datasets read-in
ala_gdp = pd.read_excel(ALA_GDP_LINK, header=3)
ala_pop = pd.read_excel(ALA_POP_LINK, header=2, sheet_name="Population development")

# exporting
ala_gdp.to_excel(sset.DIR_ALAND_STATISTICS_RAW / "aland_gdp.xlsx", index=False)
ala_pop.to_excel(sset.DIR_ALAND_STATISTICS_RAW / "aland_pop.xlsx", index=False)

### Global Wealth Databook (from Credit Suisse)

We download the 2021 vintage (latest as of 2022-03-21).

In [None]:
URL_GWDB = (
    "https://www.credit-suisse.com/media/assets/corporate/docs/about-us/research"
    "/publications/global-wealth-databook-2021.pdf"
)

gwr_raw = urequest.urlopen(URL_GWDB)
file = open(str(sset.PATH_GWDB2021_RAW), "wb")
file.write(gwr_raw.read())
file.close()

### CIA World Factbook, versions 2000 to 2020

In [None]:
cia_download_url = "https://www.cia.gov/the-world-factbook/about/archives/download"
cia_files = [f"factbook-{x}.zip" for x in range(2000, 2021)]

for i in tqdm(cia_files):
    skip = False
    for j in [2000, 2001, 2019, 2020]:
        if str(j) in i:
            skip = True

    if not skip:
        cia_req = requests.get("/".join([cia_download_url, i]))
        cia_zip = ZipFile(BytesIO(cia_req.content))
        cia_zip.extractall(str(sset.DIR_CIA_RAW))

### LitPop (Eberenz et al. 2020, Earth Syst. Sci. Data)

#### Download Data from the Internet

In [None]:
# link for downloading the LitPop files
link_base = (
    "https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/331316"
)

# readme, data, normalized data, and metadata
links = [
    link_base + "/_readme_v1_2.txt?sequence=18&isAllowed=y",
    link_base + "/LitPop_v1_2.tar?sequence=16&isAllowed=y",
    link_base + "/Lit_Pop_norm_v1.tar?sequence=4&isAllowed=y",
    link_base + "/_metadata_countries_v1_2.csv?sequence=12&isAllowed=y",
]

In [None]:
def litpop_download(link, direc=sset.DIR_LITPOP_RAW):
    """Given a URL link, downloads (LitPop-related) data from the web and saves it in
    the specified local directory. The file name is parsed so that anything after the
    string `?sequence` is dropped (e.g., `file.txt?sequence=..` to `file.txt`).

    Parameters
    ----------
    link : str
        URL link for the file online
    direc : str or pathlib.Path
        directory to store the LitPop datasets

    Returns
    -------
    None, but saves the file downloaded from online to `direc`.

    """
    if type(direc) is str:
        direc = Path(direc)

    stop = link.find("?sequence")
    start = link.rfind("/", 0, stop) + 1
    urequest.urlretrieve(link, direc / link[start:stop])

    return None

In [None]:
# cluster setup
N_CLUSTER = len(links)
cluster = gateway.new_cluster(worker_image=image_name, profile="micro")
client = cluster.get_client()
cluster.scale(N_CLUSTER)
cluster

In [None]:
# takes approximately 20 minutes
futures = client.map(litpop_download, links)
dd.progress(futures)

In [None]:
cluster.scale(0)
client.close()
cluster.close()
cluster.shutdown()

#### Un-tar and clear storage

We only un-tar the regular (not normalized) LitPop data here.

In [None]:
# un-tar
regular_litpop = sset.DIR_LITPOP_RAW / "LitPop_v1_2.tar"
with tarfile.open(regular_litpop) as file:
    file.extractall(sset.DIR_LITPOP_RAW)

# clear storage for the existing tar file
os.remove(regular_litpop)

### GEG-15

We download 2'30" GEG15 and unzip.

In [None]:
# downloading
zip_url = (
    "https://data.humdata.org/dataset/1c9cf1eb-c20a-4a06-8309-9416464af746/"
    "resource/e321d56d-022e-4070-80ac-f7860646408d/download/gar-exp.zip"
)
zip_path = sset.DIR_GEG15_RAW / "gar-exp.zip"
urequest.urlretrieve(zip_url, zip_path)

# unzipping
outpath = sset.DIR_GEG15_RAW / zip_path.stem
os.makedirs(outpath, exist_ok=True)
subprocess.Popen(["unzip", f"{zip_path}", "-d", f"{outpath}"])

In [None]:
# remove zip file (use after unzipping)
os.remove(zip_path)

### Country-level Construction Cost Index from [Lincke and Hinkel (2021, *Earth's Future*)](https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2020EF001965?campaign=woletoc)

The accompanying GitHub repository to Lincke and Hinkel (2021) is at [this link](https://github.com/daniellincke/DIVA_paper_migration).

In [None]:
# raw data file from the GitHub repo
lincke_hinkel_cci_url = (
    "https://raw.githubusercontent.com/daniellincke/"
    "DIVA_paper_migration/master/data/csv/country_input.csv"
)

# data read-in
lincke_hinkel_df = pd.read_csv(lincke_hinkel_cci_url)

# saving at PATH_EXPOSURE_LINCKE
lincke_hinkel_df.to_parquet(sset.PATH_EXPOSURE_LINCKE)

### SRTM 15+

We download the latest version, which is version 2.4 (as of 2022-03-29).

In [None]:
# Workaround for urllib request error
ssl._create_default_https_context = ssl._create_unverified_context
URL_SRTM15 = "https://topex.ucsd.edu/pub/srtm15_plus/SRTM15_V2.4.nc"

urequest.urlretrieve(URL_SRTM15, SRTM15PLUS_DIRECTORY / URL_SRTM15.split("/")[-1])

### National Levee Database

Initialize directory structure to fill manually in a later part of this notebook.

In [None]:
sset.DIR_NLDB.mkdir(parents=True, exist_ok=True)

for state in sset.NLDB_STATES:
    (sset.DIR_NLDB / state).mkdir(exist_ok=True)

## Further data requiring separate manual instructions

In all cases below, `sset` is defined by `from sliiders import settings as sset` as above.

### UN Statistics National Accounts (Analysis of Main Aggregates; abbreviated as UN AMA)

#### UN AMA nominal (current prices) GDP per capita information

1. Travel to this [link](https://unstats.un.org/unsd/snaama/Basic) to get to the UN Statistics National Accounts search page.
2. Select all countries and all years available, and select "GDP, Per Capita GDP - US Dollars".
3. Select "Export to CSV", and you will download the file `Results.csv`. Rename this file as `un_snaama_nom_gdppc.csv`. We save this in `sset.DIR_UN_AMA_RAW`.

#### UN AMA nominal (current prices) GDP information

1. Similar to the nominal GDP per capita information, travel to this [link](https://unstats.un.org/unsd/snaama/Basic) to get to the UN Statistics National Accounts search page.
2. Select all countries and all years available, and select "GDP, at current prices - US Dollars".
3. Select "Export to CSV", and you will download the file `Results.csv`. Rename this file as `un_snaama_nom_gdp.csv`. We save this in `sset.DIR_UN_AMA_RAW`.

### OECD region-level information

#### OECD: population (region-level)
1. Go to the following OECD Stat website: link [here](https://stats.oecd.org/)
2. On the left, find the header "Regions and Cities" and click the "+" button.
3. From the drop down menu, click on "Regional Statistics".
4. Again from the drop down menu, click on "Regional Demography."
5. Finally, select "Population by 5-year age groups, small regions TL3." Make sure that "Indicator" is selected as "Population, All ages".
6. Download the file by selecting "Export," then "Text File (CSV)."
7. When a pop-up appears, select "Default format" then "Download." Rename the file as `REGION_DEMOGR.csv` (due to it having random-ish numeric parts in the name). Note that this step may take a longer time than others.
8. Finally, move the said file to `sset.DIR_OECD_REGIONS_RAW`.

#### OECD: GDP (region-level, in millions of constant 2015 PPP USD)
1. Similar to the population information, go to the following OECD Stat website: link [here](https://stats.oecd.org/)
2. On the left, find the header "Regions and Cities" and click the "+" button.
3. From the drop down menu, click on "Regional Statistics".
4. Again from the drop down menu, click on "Regional Economy."
5. Finally, select "Gross Domestic Product, Small regions TL3." Make sure that "Measure" is selected as "Millions USD, constant prices, constant PPP, base year 2015".
6. Download the file by selecting "Export," then "Text File (CSV)."
7. When a pop-up appears, select "Default format" then "Download." Rename the file as `REGION_ECONOM.csv` (due to it having random-ish numeric parts in the name). Note that this step may take a longer time than others.
8. Finally, move the said file to `sset.DIR_OECD_REGIONS_RAW`.

### IMF investment-to-GDP ratio, population, and GDP

1. Travel to this [link](https://www.imf.org/en/Publications/SPROLLs/world-economic-outlook-databases#sort=%40imfdate%20descending) to get to the World Economic Outlook Databases page.
2. Click on the latest "World Economic Outlook Database" link on the page; for our purposes, we have used the latest available one, which was "World Economic Outlook Database, October 2021" (may be updated in the future).
3. Click "By Countries", then click "ALL COUNTRIES", then click "CONTINUE" on the page that says "Select Countries."
4. Under the "NATIONAL ACCOUNTS" tab, check the following categories:
   - Gross domestic product, current prices (U.S. DOLLARS)
   - Gross domestic product per capita, current prices (U.S. DOLLARS)
   - Gross domestic product per capita, constant prices (PURCHASING POWER PARITY; 2017 INTERNATIONAL DOLLARS)
   - Total investment (PERCENT OF GDP)
5. Under the "PEOPLE" tab, check the category "Population," then click on "CONTINUE."
6. Under the tab "DATE RANGE," use the earliest year for "Start Year" (1980, in our case), and the latest non-future year for "End Year" (2020, in our case).
7. Under the tab "ADVANCED SETTINGS", click on "ISO Alpha-3 Code" for getting country codes. 
8. Click on "PREPARE REPORT." Then, click on "DOWNLOAD REPORT." Saved data should be in Excel format and be named `WEO_Data.xls`.
9. Open the said file on Excel, and re-save it in a preferred format of choice (we chose `.xlsx`); this is because the original file formatting is incompatible with Python and causes the error `ValueError: Excel file format cannot be determined, you must specify an engine manually.`
10. In our implementation, we save this file as `sset.PATH_IMF_WEO_RAW`.

### World Bank Intercomparison Project 2017 (WB ICP 2017): Construction Cost Index

While most World Bank data can be downloaded by using `pandas_datareader.wb`, it seems that variables in WB ICP 2017 - including `1501200:CONSTRUCTION`, which is necessary for SLIIDERS-ECON - cannot be downloaded using the said module (despite being searchable in the module using `pandas_datareader.wb.search`). Therefore, we follow the below manual process for downloading the WB ICP 2017 dataset.
1. Use [this link](https://databank.worldbank.org/embed/ICP-2017-Cycle/id/4add74e?inf=n) to access WB ICP 2017 in table format.
2. Upon entering the webpage, look to the upper right corner and click on the icon with downward arrow with an underline. This should prompt the download.
3. When the download finishes, there should be a `.zip` file called `ICP 2017 Cycle.zip`. Access the `.csv` file whose name ends in `_Data.csv` (there should be two files in the `.zip` file, the other being a file whose name ends in `_Series - Metadata.csv`).
4. Save the said `.csv` file as `sset.PATH_EXPOSURE_WB_ICP`.

### IIASA and OECD models' GDP and population projections (2010-2100, every 5 years)

1. Go to the following IIASA SSP Database website: link [here](https://tntcat.iiasa.ac.at/SspDb); you may need to register and create your log-in.
2. In the above tabs, there is a tab called "Download"; click on it.
3. Under "SSP Database Version 2 Downloads (2018)" and under the sub-header "Basic Elements", there is a download link for `SspDb_country_data_2013-06-12.csv.zip`. Click and download the said `.zip` file.
4. Extract and save the `SspDb_country_data_2013-06-12.csv`. Again, for our purposes, we save this in `sset.DIR_IIASA_PROJECTIONS`.

### LandScan 2019

1. To download this dataset, you need to first apply for an Oak Ridge National Laboratory account (link [here](https://landscan.ornl.gov/user/apply)).
2. After having gained access, go to the said website, click on "DOWNLOAD" -> "LandScan Datasets" -> "Continue to download" next to LandScan 2019.
3. Click on "By downloading LandScan 2019 I agree to the above terms" in the following webpage; this will download the file `LandScan Global 2019.zip`. We save this in `sset.DIR_LANDSCAN_RAW`.

### CIA World Factbook (compiled by Coleman [2020])

1. Travel to this [link](https://github.com/iancoleman/cia_world_factbook_api) (credit to Coleman [2020]), and scroll down to the `readme.md`.
2. In the **Data** section of the `readme.md` file, there should be a link on "Historical"; click on this link to travel to a `mega.nz` website having `weekly_json.7z` file.
3. After checking that the filename to download is `weekly_json.7z`, download the said file by clicking on the "Download" button.
4. When download is successful, import `weekly_json.7z` to the preferred directory (`sset.DIR_YPK_RAW` in this implementation).

### HydroSHEDS
1. Go to https://hydrosheds.org/downloads
2. Download the "standard" level-0 HydroBASINS files for each continent (use the Dropbox link if available--this appears as "NOTE: you may also download data from here." as of 8/16/21. Download the shapefiles into the directory defined in `settings.py` by `DIR_HYDROBASINS_RAW`