## Notebook for downloading inputs to create SLIIDERS

This notebook contains directions for downloading various input datasets to create the final product for this directory, the **SLIIDERS** dataset.

In general, we will keep the format, file name, and data unaltered, but apply changes when
- file name is not human-readable, too long, or is not much informative about the dataset (assign appropriate file names)
- file format causes errors (save in a similar file format that is not error-prone)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import shutil
import subprocess
import tarfile
import tempfile
from io import BytesIO
from pathlib import Path
from tempfile import TemporaryDirectory
from zipfile import ZipFile

import cartopy.io.shapereader as shpreader
import geopandas as gpd
import numpy as np
import pandas as pd
import requests
from pandas_datareader import wb as dr_wb
from pint import UnitRegistry
from sliiders import settings as sset
from sliiders.io import save, save_geoparquet
from tqdm import tqdm

ureg = UnitRegistry()

## Fetching raw data from various sources

## NLDB

In [205]:
USACE_URL = "https://levees.sec.usace.army.mil/api-local/"

In [244]:
# get list of all
ID_PARAMS = {
    "in": "@nation:USA",
    "md": "false",
    "syarray": "@PURPOSE_IDS:(7)",
    "return": "id,name",
}
system_ids = pd.read_json(
    requests.get(USACE_URL + "systems/query", params=ID_PARAMS).url
)

In [234]:
SHP_PARAMS = {"type": "leveed-area", "format": "geo", "props": "false", "coll": "false"}


def return_shp(sid):
    out = requests.get(
        USACE_URL + "geometries/query", params={**SHP_PARAMS, "systemId": sid}
    ).json()[0]
    # drop the 3d z-coord
    out["coordinates"] = [
        [np.array(i)[..., :2].tolist() for i in out["coordinates"][0]]
    ]
    out = str(out).replace(" ", "").replace("'", '"')
    return gpd.read_file(out, driver="GeoJSON").iloc[0, 0]

In [249]:
system_ids["geometry"] = gpd.GeoSeries(
    system_ids.id.map(return_shp), index=system_ids.index
)
system_ids = gpd.GeoDataFrame(system_ids, geometry="geometry")

In [252]:
def get_height(sid):
    out = requests.get(USACE_URL + "segments", params={"system_id": sid}).json()
    min_heights = [i["minHeight"] for i in out if "minHeight" in i.keys()]
    max_heights = [i["maxHeight"] for i in out if "maxHeight" in i.keys()]
    heights = min_heights + max_heights
    if not len(heights):
        return np.nan
    return min(heights)

In [253]:
system_ids["min_height"] = system_ids.id.map(get_height)
system_ids = system_ids.set_index("id")
system_ids["min_height"] = (system_ids.min_height.values * ureg.feet).to(ureg.meter)

In [271]:
save_geoparquet(system_ids, sset.PATH_NLDB)


This metadata specification does not yet make stability promises.  We do not yet recommend using this in a production setting unless you are able to rewrite your Parquet/Feather files.

  obj.to_parquet(_path, **kwargs)


### Natural Earth

In [16]:
# first, clear cache to download newest version
shpfilename = shpreader.natural_earth("10m", "physical", "land")
shutil.rmtree(Path(shpfilename).parent)

# now download
for kind in ["land", "ocean"]:
    shpfilename = shpreader.natural_earth("10m", "physical", kind)

sset.DIR_NATEARTH.upload_from(Path(shpfilename).parent)



### Penn World Tables 10.0 (PWT 10.0)

In [3]:
# PWT10.0
pwt100_data = pd.read_excel("https://www.rug.nl/ggdc/docs/pwt100.xlsx", sheet_name=2)

# PWT10.0 capital details
pwt100_data_K = pd.read_excel(
    "https://www.rug.nl/ggdc/docs/pwt100-capital-detail.xlsx", sheet_name=2
)

In [4]:
pwt_filenames = ["pwt_100.xlsx", "pwt_K_detail_100.xlsx"]
for i, data in enumerate([pwt100_data, pwt100_data_K]):
    data.to_excel(
        (sset.PATH_PWT_RAW.parent / pwt_filenames[i]).open("wb"),
        sheet_name="Sheet1",
        index=False,
    )

### Maddison Project Dataset (MPD, Maddison Project Database 2020)

In [5]:
madd = pd.read_excel(
    "https://www.rug.nl/ggdc/historicaldevelopment/maddison/data/mpd2020.xlsx",
    sheet_name=2,
)

In [6]:
madd.to_excel(
    excel_writer=sset.PATH_MPD_RAW.open("wb"),
    index=False,
    sheet_name="Sheet1",
)

### World Bank WDI (WB WDI)

#### Investment-to-GDP ratio, GDP and GDPpc (nominal and PPP), and Population

In [7]:
# country name and iso3 country code information
country_info = dr_wb.get_countries()[["name", "iso3c"]].rename(
    columns={"name": "country", "iso3c": "ccode"}
)

# relevant indicator information for the `dr_wb` module to fetch the variables
wbwdi_indicators = [
    "SP.POP.TOTL",  # population
    "NE.GDI.FTOT.ZS",  # investment-to-GDP ratio
    "NY.GDP.MKTP.PP.KD",  # GDP PPP
    "NY.GDP.PCAP.PP.KD",  # GDP per capita PPP
    "NY.GDP.MKTP.KD",  # GDP nominal
    "NY.GDP.PCAP.KD",  # GDP per capita nominal
]

j = 0
for indi in wbwdi_indicators:
    indi_info = (
        dr_wb.download(indicator=indi, country="all", start=1950, end=2020)
        .reset_index()
        .astype({"year": "int64"})
        .merge(country_info, on=["country"], how="left")
        .set_index(["ccode", "year"])
    )

    if j == 0:
        j += 1
        wbwdi_info = indi_info.copy()
    else:
        wbwdi_info = wbwdi_info.merge(
            indi_info.drop(["country"], axis=1),
            left_index=True,
            right_index=True,
            how="outer",
        )

# excluding those that have no information and saving the data
wb_info_vars = [x for x in wbwdi_info.columns if x != "country"]
wbwdi_info = wbwdi_info.loc[~pd.isnull(wbwdi_info[wb_info_vars]).all(axis=1), :]

In [8]:
save(wbwdi_info, sset.DIR_WB_WDI_RAW / "wdi_pop_iy_gdp.parquet")

#### WB WDI: exchange rate

In [9]:
# country name and iso3 country code information
country_info = dr_wb.get_countries()[["name", "iso3c"]].rename(
    columns={"name": "country", "iso3c": "ccode"}
)

xr_code = "PA.NUS.FCRF"
xr_wb = (
    dr_wb.download(indicator=xr_code, country="all", start=1950, end=2019)
    .reset_index()
    .astype({"year": "int64"})
    .merge(country_info, on=["country"], how="left")
    .set_index(["ccode", "year"])
    .rename(columns={xr_code: "xrate"})
)

In [10]:
save(xr_wb, sset.DIR_WB_WDI_RAW / "wdi_xr.parquet")

### UN WPP populations (overall and by-population-group data)

In [11]:
for ix, sex in enumerate(["MALE", "FEMALE"]):
    fname = f"WPP2022_POP_F02_{ix+2}_POPULATION_5-YEAR_AGE_GROUPS_{sex}"
    r = requests.get(
        "https://population.un.org/wpp/Download/Files/1_Indicators%20(Standard)"
        f"/EXCEL_FILES/2_Population/{fname}.xlsx"
    )
    with (sset.DIR_UN_WPP_RAW / f"{fname}.xlsx").open("wb") as f:
        f.write(r.content)

### Åland Island GDP and population (from Statistics and Research Åland or ÅSUB)

Note when newer versions are available, old links from ÅSUB will become deprecated; the below links in `ALA_GDP_LINK` and `ALA_POP_LINK` are valid as of 2022-09-19.

In [12]:
for name in [
    "nr005en.xls",
    "alv01_aland_faroe_islands_and_greenland_-_an_overview_with_comparable_data.xlsx",
]:
    with (sset.DIR_ALAND_STATISTICS_RAW / name).open("wb") as f:
        f.write(
            requests.get(
                f"https://www.asub.ax/sites/default/files/attachments/page/{name}"
            ).content
        )

### Global Wealth Databook (from Credit Suisse)

We download the 2021 vintage (latest as of 2022-09-19).

In [13]:
with sset.PATH_GWDB_RAW.open("wb") as f:
    f.write(
        requests.get(
            "https://www.credit-suisse.com/media/assets/corporate/docs/about-us"
            f"/research/publications/global-wealth-databook-{sset.GWDB_YEAR}.pdf"
        ).content
    )

### LitPop (Eberenz et al. 2020, Earth Syst. Sci. Data)

#### Download Data from the Internet

In [14]:
# link for downloading the LitPop files
link_base = (
    "https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/331316"
)

# readme, data, normalized data, and metadata
links = [
    link_base + "/_readme_v1_2.txt?sequence=18&isAllowed=y",
    link_base + "/LitPop_v1_2.tar?sequence=16&isAllowed=y",
    link_base + "/Lit_Pop_norm_v1.tar?sequence=4&isAllowed=y",
    link_base + "/_metadata_countries_v1_2.csv?sequence=12&isAllowed=y",
]

In [15]:
for link in links:
    name = link.split("/")[-1].split("?")[0]
    with (sset.DIR_LITPOP_RAW / name).open("wb") as f:
        f.write(requests.get(link).content)

KeyboardInterrupt: 

#### Un-tar and clear storage

We only un-tar the regular (not normalized) LitPop data here.

In [None]:
# un-tar
regular_litpop = sset.DIR_LITPOP_RAW / "LitPop_v1_2.tar"
out_path = regular_litpop.parents[1] / "test" / regular_litpop.stem
with tarfile.open(regular_litpop) as file:
    with tempfile.TemporaryDirectory() as d:
        file.extractall(d)
        for file in Path(d).glob("*"):
            (sset.DIR_LITPOP_RAW / file.name).upload_from(file)

# clear storage for the existing tar file
regular_litpop.unlink()

### GEG-15

We download 2'30" GEG15 and unzip.

In [None]:
# downloading
zip_url = (
    "https://data.humdata.org/dataset/1c9cf1eb-c20a-4a06-8309-9416464af746/"
    "resource/e321d56d-022e-4070-80ac-f7860646408d/download/gar-exp.zip"
)

tmppath = Path("/tmp/gar-exp.zip")
outpath = Path("/tmp/gar-exp")
with tmppath.open("wb") as f:
    f.write(requests.get(zip_url).content)
subprocess.run(["unzip", str(tmppath), "-d", outpath])

In [None]:
# upload
for f in outpath.glob("*"):
    (sset.DIR_GEG15_RAW / f.name).upload_from(f, force_overwrite_to_cloud=True)

# remove local
shutil.rmtree(outpath)
tmppath.unlink()

### Country-level Construction Cost Index from [Lincke and Hinkel (2021, *Earth's Future*)](https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2020EF001965?campaign=woletoc)

The accompanying GitHub repository to Lincke and Hinkel (2021) is at [this link](https://github.com/daniellincke/DIVA_paper_migration).

In [None]:
with sset.PATH_EXPOSURE_LINCKE.open("w") as f:
    f.write(
        requests.get(
            "https://raw.githubusercontent.com/daniellincke/"
            "DIVA_paper_migration/master/data/csv/country_input.csv"
        ).content
    )

### SRTM 15+

We use version 2.5 (latest as of Aug 2022).

In [4]:
URL_SRTM15 = f"https://topex.ucsd.edu/pub/srtm15_plus/SRTM15_{sset.SRTM15_PLUS_VERS}.nc"
with sset.PATH_SRTM15_PLUS.open("wb") as f:
    f.write(requests.get(URL_SRTM15).content)

### Asian Development Bank (ADB) Key Indicators 2022 Economy Tables

Link is available [here](https://kidb.adb.org/economies).

In [None]:
adb_link_excel = (
    "https://www.adb.org/sites/default/files/publication/812946/"
    f"{sset.PATH_ADB_RAW.name}"
)
with sset.PATH_ADB_RAW.open("wb") as f:
    f.write(requests.get(adb_link_excel).content)

### GADM v4.1

In [10]:
GADM_NAME = sset.PATH_GADM.stem
URL_GADM = f"https://geodata.ucdavis.edu/gadm/gadm{sset.GADM_VERS}/{GADM_NAME}.zip"


with tempfile.NamedTemporaryFile() as tf:
    tf.file.write(requests.get(URL_GADM).content)
    with tempfile.TemporaryDirectory as d:
        shutil.unpack_archive(tf.name, d, format="zip")
        sset.PATH_GADM.upload_file(d / f"{GADM_NAME}.gpkg")

### National Levee Database

Initialize directory structure to fill manually in a later part of this notebook.

In [None]:
sset.DIR_NLDB.mkdir(parents=True, exist_ok=True)

for state in sset.NLDB_STATES:
    (sset.DIR_NLDB / state).mkdir(exist_ok=True)

### Gleditsch and Ward table of independent states

In [None]:
URL_GW_TABLE = (
    f"https://github.com/andybega/states/raw/master/data/{sset.PATH_GW_TABLE.name}"
)
with sset.PATH_GW_TABLE.open("wb") as f:
    f.write(requests.get(URL_GW_TABLE).content)

### CIA World Factbook

In [None]:
# CIA World Factbook, versions 2000 to 2020
print(f"Downloading CIA World Factbooks to {sset.DIR_CIA_RAW.as_uri()}...")
with TemporaryDirectory() as tmp:
    for f in tqdm([f"factbook-{x}.zip" for x in range(2000, 2021)]):
        this_dir = sset.DIR_CIA_RAW / f[:-4]
        if sset.FS.isdir(this_dir.as_uri()):
            continue
        print(f"...downloading {f}...")
        with ZipFile(
            BytesIO(
                requests.get(
                    "https://www.cia.gov/the-world-factbook/about/archives/"
                    f"download/{f}"
                ).content
            )
        ) as zipfile:
            zipfile.extractall(tmp)
        sset.FS.upload(
            str(Path(tmp) / this_dir.name), this_dir.as_uri(), recursive=True
        )

## Various manual input disaggregated sources

### Aland (`ALA`) information from Aland Statistics

#### GDPpc (current PPP in euro terms)

In [16]:
# Aland Islands current PPP GDPpc; we also extract Finnish GDPpc for ratio comparisons
aland_gdp = (
    pd.read_excel(sset.DIR_ALAND_STATISTICS_RAW / "nr005en.xls", skiprows=3)
    .rename(columns={"Unnamed: 0": "country"})
    .set_index(["country"])
)
aland_years = aland_gdp.columns.values
aland_cgdpo = aland_gdp.loc["Åland", :]
fin_cgdpo = aland_gdp.loc["Finland", :]

# exchange rate; EMU only has down to 1999, so for convenience's sake
# for 1995-1998, we will use 1999 rates
wdi_xrate = (
    pd.read_parquet(sset.DIR_WB_WDI_RAW / "wdi_xr.parquet")
    .loc[("EMU", list(range(1999, aland_years.max() + 1))), "xrate"]
    .values
)
aland_xrate = np.hstack([[wdi_xrate[0]] * (1999 - aland_years.min()), wdi_xrate])
aland_cgdpo, fin_cgdpo = aland_cgdpo * aland_xrate, fin_cgdpo * aland_xrate
aland_df = pd.DataFrame(data={"aland_cgdpo_pc": aland_cgdpo, "year": aland_years})
aland_df["ccode"] = "ALA"
fin_df = pd.DataFrame(data={"aland_cgdpo_pc": fin_cgdpo, "year": aland_years})
fin_df["ccode"] = "FIN"
aland_df = pd.concat([aland_df, fin_df]).set_index(["ccode", "year"])

#### Population

In [17]:
aland_pop_df_all = pd.read_excel(
    sset.DIR_ALAND_STATISTICS_RAW
    / "alv01_aland_faroe_islands_and_greenland_-_an_overview_with_comparable_data.xlsx",
    "Population development",
    skiprows=2,
    index_col=0,
)

aland_pop_df = aland_pop_df_all.loc["Population 1.1."].copy()
aland_pop_df.index = aland_pop_df_all[
    aland_pop_df_all.index.isin(["Åland", "Faroe Islands", "Greenland"])
].index

aland_pop_df["ccode"] = aland_pop_df.index.map(
    {"Åland": "ALA", "Faroe Islands": "FRO", "Greenland": "GRL"}
)
aland_pop_df = aland_pop_df.set_index("ccode")
aland_pop_df.columns = aland_pop_df.columns.astype(int)
aland_pop_df = aland_pop_df.rename_axis(columns="year").stack().rename("aland_pop")

In [18]:
aland_df = aland_df.join(aland_pop_df, how="outer")

### `BES`: Bonaire, St Eustatius, and Saba current and non-PPP GDP from CBS (Statistics Netherlands)

**Trends in the Caribbean Netherlands 2021** (link [here](https://longreads.cbs.nl/ticn2021/)) contains 2012-2018 current, non-PPP GDP for Bonaire, St Eustatius, and Saba separately. We will have them added up to represent `BES`.

In [19]:
# millions of USD (assumed to be current, non-PPP)
bonaire = np.array([417, 434, 452, 466, 487, 480, 505]) * 1e6
eustatius = np.array([133, 137, 131, 134, 131, 142, 128]) * 1e6
saba = np.array([43, 46, 47, 47, 48, 48, 48]) * 1e6
bes = pd.DataFrame(
    data={"year": range(2012, 2019), "bes_gov_gdp_nom": bonaire + eustatius + saba}
)
bes["ccode"] = "BES"
various_df = aland_df.join(
    bes.set_index(["ccode", "year"]).bes_gov_gdp_nom, how="outer"
)

### Saint Barthelmy (`BLM`), from CEROM

Link to the document is provided [here](https://www.cerom-outremer.fr/IMG/pdf/note_cerom_pib_saint-barthelemy_-_octobre_2014.pdf). Nominal GDPpc of Saint Barthelemy is 26000 euros (1999) and 35700 euros (2010).

In [20]:
# exchange rate information from WB WDI
xr_rate = pd.read_parquet(sset.DIR_WB_WDI_RAW / "wdi_xr.parquet").sort_index()

In [21]:
# exchange rate for the years 1999 and 2010 are retrieved and multiplied
cerom_years = [1999, 2010]
wdi_xrate = xr_rate.loc[("EMU", cerom_years), "xrate"].values
cerom_gdppc_nom = np.array([26000, 35700]) / wdi_xrate
blm = pd.DataFrame(data={"year": cerom_years, "cerom_gdppc_nom": cerom_gdppc_nom})
blm["ccode"] = "BLM"
various_df = various_df.join(
    blm.set_index(["ccode", "year"]).cerom_gdppc_nom, how="outer"
)

### 2010 nominal GDP of Cocos (Keeling) Island (`CCK`) and Christmas Island (`CXR`) from House of Representative Committees of Parliament of Australia

Link to the document is provided [here](https://www.aph.gov.au/parliamentary_business/committees/House_of_Representatives_Committees?url=ncet/economicenvironment/report/index.htm); please refer to Chapter 3. Only 2010 nominal GDP (in Australian dollars) are provided for these regions (`CXR`: 71 million AUS dollars, `CCK`: 15 million AUS dollars).

In [22]:
# multiplying the AUS exchange rate of 2010 (GDP values in millions of AUS dollars)
wdi_xrate = xr_rate.loc[("AUS", 2010), "xrate"].values
cxr_cck = np.array([71, 15]) * 1e6 / wdi_xrate
cxr_cck = pd.DataFrame(data={"aus_parl_gdp_nom": cxr_cck, "ccode": ["CXR", "CCK"]})
cxr_cck["year"] = 2010
various_df = various_df.join(
    cxr_cck.set_index(["ccode", "year"]).aus_parl_gdp_nom, how="outer"
)

  wdi_xrate = xr_rate.loc[("AUS", 2010), "xrate"].values


### Australian Government population information for Cocos (Keeling) Island (`CCK`), Christmas Island (`CXR`), and Norfolk Island (`NFK`)

These links are from the Australian Bureau of Statistics (ABS):
- `CCK`: years [2001](https://www.abs.gov.au/census/find-census-data/quickstats/2001/910053009), [2006](https://www.abs.gov.au/census/find-census-data/quickstats/2006/910053009), [2011](https://www.abs.gov.au/census/find-census-data/quickstats/2011/90102), and [2016](https://quickstats.censusdata.abs.gov.au/census_services/getproduct/census/2016/quickstat/90102)
- `CXR`: years [2001](https://www.abs.gov.au/census/find-census-data/quickstats/2001/910052009), [2006](https://www.abs.gov.au/census/find-census-data/quickstats/2006/910052009), [2011](https://www.abs.gov.au/census/find-census-data/quickstats/2011/910052009), and [2016](https://www.abs.gov.au/census/find-census-data/quickstats/2016/90101)

The following links are for `NFK`, partially from the ABS but also from a separate report from the Government of Norfolk Island:
- 2011 and 2016 population values are provided [here](https://www.infrastructure.gov.au/territories-regions-cities/territories/norfolk-island)
- 1986, 1991, 1996, 2001, and 2006 population values are provided [here](http://www.norfolkisland.gov.nf/sites/default/files/public/documents/ANIReports/Census/Census_2006.pdf) (see page 8, Table A2).

In [23]:
# CCK and CXR, for the years 2001, 2006, 2011, 2016
cck = [621, 572, 550, 544]
cxr = [1446, 1349, 2072, 1843]
aus_gov_pop = pd.DataFrame(
    data={
        "aus_gov_pop": cck + cxr,
        "year": [2001, 2006, 2011, 2016] * 2,
        "ccode": ["CCK"] * 4 + ["CXR"] * 4,
    }
)

# NFK, for the years 1986, 1991, 1996, 2001, 2006, 2011, 2016
nfk = [2367, 2285, 2181, 2601, 2523, 1796, 1748]
nfk = pd.DataFrame(
    data={
        "aus_gov_pop": nfk,
        "year": list(range(1986, 2017))[0::5],
        "ccode": ["NFK"] * len(nfk),
    }
)
aus_gov_pop = pd.concat([aus_gov_pop, nfk], axis=0).set_index(["ccode", "year"])

various_df = various_df.join(aus_gov_pop.aus_gov_pop, how="outer")

### Falkland Government GDP (current GBP) for Falkland (`FLK`)

The report is provided [here](http://www.falklands.gov.fk/policy/jdownloads/Reports%20&%20Publications/Economy%20and%20Economic%20Development/State%20of%20the%20Economy%20Reports/State%20of%20the%20Falkland%20Islands%20Economy%202020.pdf) on page 18, and we will turn the values in current British Pound to current USD.

In [24]:
# GBR xrate
gbr_xrate = xr_rate.loc["GBR", "xrate"]

# Falkland
flk = (
    np.array(
        [
            106.0,
            120.1,
            97.7,
            167.5,
            184.7,
            204.3,
            160.3,
            175.6,
            209.0,
            282.3,
            220.1,
            254.7,
        ]
    )
    * 1e6
)
flk = pd.DataFrame(data={"year": range(2007, 2019), "flk_gov_gdp_curr": flk})
flk["ccode"] = "FLK"
flk = flk.set_index(["ccode", "year"]).join(gbr_xrate, how="left")
flk["flk_gov_gdp_curr"] = flk["flk_gov_gdp_curr"].div(flk["xrate"])

various_df = various_df.join(flk.flk_gov_gdp_curr, how="outer")

  gbr_xrate = xr_rate.loc["GBR", "xrate"]


### Gibraltar Government GDP per capita (2006-2020) for Gibraltar (`GIB`)

We will assume that the original data (link [here](https://www.gibraltar.gov.gi/uploads/statistics/2021/National%20Income/GDP%20Estimates.pdf)) is in current British Pound and turn them into current USD.

In [25]:
gib = [
    24859,
    26714,
    29357,
    32570,
    34247,
    37369,
    40381,
    45032,
    48522,
    53433,
    59403,
    66691,
    72228,
    75467,
    71787,
]
gib = pd.DataFrame(data={"year": range(2006, 2021), "gib_gov_gdppc_curr": gib})
gib["ccode"] = "GIB"
gib.set_index(["ccode", "year"], inplace=True)
gib = gib.join(gbr_xrate, how="left")
gib.loc[("GIB", 2020), "xrate"] = gib.loc[("GIB", 2019), "xrate"]
gib["gib_gov_gdppc_curr"] = gib["gib_gov_gdppc_curr"].div(gib["xrate"])

various_df = various_df.join(gib.gib_gov_gdppc_curr, how="outer")

### Guernsey Government information of Guernsey (`GGY`)

- Population (1986, 1991, 1996, 2001, and 2009-2021): see [March 2021 Report](https://www.gov.gg/CHttpHandler.ashx?id=149564&p=0) for 2011-2021 on p.3; [March 2019 Report](https://www.gov.gg/CHttpHandler.ashx?id=123171&p=0) for 2009-2019 on p.3; [traditional census report](https://www.gov.gg/census) for the years 1986, 1991, 1996, and 2001 (click on the download file "Historic Population and Employment Data" to see the `.xls` file)
- GDP (2004-2020): reports [here for 2010-2020](https://www.gov.gg/CHttpHandler.ashx?id=147608&p=0), [here for 2009](https://gov.gg/CHttpHandler.ashx?id=111088&p=0), and [here for 2004-2008](https://gov.gg/CHttpHandler.ashx?id=90671&p=0); note that we will later use `ypk_fn.smooth_fill` for 2004-2008 values due to there being discrepancies between pre-2015 reports and the later ones.

In [26]:
# 2011-2021 information from March 2021 Census
ggy_2021_census = [
    62915,
    63085,
    62732,
    62341,
    62234,
    62208,
    62106,
    62290,
    62681,
    63083,
    63448,
]

# 2009-2010 information from March 2019 Census
ggy_2019_census = [62274, 62431]

# 1986, 1991, 1996, 2001 information from 'Historic Population and Employment Data'
ggy_traditional = [55482, 58867, 58681, 59807]

# merging them into a dataset
ggy = pd.DataFrame(
    data={
        "ggy_gov_pop": np.hstack([ggy_traditional, ggy_2019_census, ggy_2021_census]),
        "year": np.hstack([[1986, 1991, 1996, 2001], range(2009, 2022)]),
    }
)
ggy["ccode"] = "GGY"
ggy.set_index(["ccode", "year"], inplace=True)
ggy["ggy_gov_gdp_curr"] = np.nan
ggy.loc[("GGY", list(range(2009, 2021))), "ggy_gov_gdp_curr"] = (
    np.array(
        [
            2458,
            2423,
            2629,
            2615,
            2715,
            2779,
            2816,
            2934,
            3101,
            3170,
            3244,
            3178,
        ]
    )
    * 1e6
)
ggy_alt = np.array([1453, 1465, 1584, 1774, 1841, 1832, 1909, 2033, 2117, 2186]) * 1e6
ggy = ggy.join(
    pd.DataFrame(
        data={
            "year": range(2004, 2014),
            "ggy_gov_gdp_alt": ggy_alt,
            "ccode": ["GGY"] * len(ggy_alt),
        }
    ).set_index(["ccode", "year"]),
    how="outer",
).join(gbr_xrate, how="left")
ggy.loc[("GGY", 2020), "xrate"] = ggy.loc[("GGY", 2019), "xrate"]
for i in ["ggy_gov_gdp_curr", "ggy_gov_gdp_alt"]:
    ggy[i] = ggy[i].div(ggy["xrate"])
various_df = various_df.join(ggy.drop(["xrate"], axis=1), how="outer")

### Jersey Government information for `JEY`

- Population (2000-2019): link to the Jersey Resident Population from Statistics Jersey is [here](https://www.gov.je/SiteCollectionDocuments/Government%20and%20administration/R%20Population%20Estimate%20Current%2020180620%20SU.pdf); see page 2.
- GDP (2012-2020): report [here](https://opendata.gov.je/dataset/national-accounts/resource/ae620bf3-41be-4461-adb8-2220ab7cb000?inner_span=True); these values are in constant 2020, non-PPP GBP. We turn these into constant 2017, non-PPP USD using official exchange rate of 2019 (since 2020 value is unavailable) and deflate 2019 USD to 2017 USD.

In [27]:
# USD deflator
defla = (
    pd.read_excel(sset.PATH_PWT_RAW)
    .set_index(["countrycode", "year"])
    .loc["USA", "pl_gdpo"]
)

# population data 2000-2019, in millions
jey = (
    np.array(
        [
            0.088400,
            0.088900,
            0.089300,
            0.089600,
            0.090100,
            0.091000,
            0.092300,
            0.094000,
            0.095400,
            0.096200,
            0.097100,
            0.098100,
            0.099000,
            0.100000,
            0.101000,
            0.102700,
            0.104200,
            0.105600,
            0.106700,
            0.107800,
            np.nan,
        ]
    )
    * 1e6
)
jey = pd.DataFrame(data={"jey_gov_pop": jey, "year": list(range(2000, 2021))})
jey["ccode"], jey["jey_gov_gdp_const"] = "JEY", np.nan
jey.set_index(["ccode", "year"], inplace=True)
jey.loc[("JEY", list(range(2012, 2021))), "jey_gov_gdp_const"] = (
    np.array(
        [
            4495,
            4504,
            4678,
            4737,
            4745,
            4787,
            4881,
            4988,
            4528,
        ]
    )
    * 1e6
)
jey["jey_gov_gdp_const"] /= gbr_xrate.loc[2019] * defla.loc[2019]
various_df = various_df.join(jey, how="outer")

### Norfolk Island (`NFK`) GDP per capita information (as a percentage of Australia) from Treadgold (1999) and Treadgold (1998)

GDPpc as a percentage of the Australian level for the years 1951-52 are shown in [Treadgold (Asia Pacific Viewpoint, 1999)](https://doi.org/10.1111/1467-8373.00095) and similar percentage for 1995-96 are shown in [Treadgold (Pacific Economic Bulletin, 1998)](https://openresearch-repository.anu.edu.au/handle/1885/157535). Let us multiply these with the Australian GDP per capita from Penn World Tables.

In [28]:
nfk_yrs = [1951, 1952, 1995, 1996]
nfk_ratios = np.array([0.39, 0.39, 1.12, 1.12])
aus_info = (
    pd.read_excel(sset.PATH_PWT_RAW)
    .set_index(["countrycode", "year"])
    .loc[("AUS", nfk_yrs), ["cgdpo", "rgdpna", "pop"]]
)
aus_info["treadgold_cgdpo_pc"] = aus_info["cgdpo"].div(aus_info["pop"]) * nfk_ratios
aus_info["treadgold_rgdpna_pc"] = aus_info["rgdpna"].div(aus_info["pop"]) * nfk_ratios
nfk_info = aus_info.reset_index().rename(columns={"countrycode": "ccode"})[
    ["ccode", "year", "treadgold_rgdpna_pc", "treadgold_cgdpo_pc"]
]
nfk_info["ccode"] = "NFK"
nfk_info.set_index(["ccode", "year"], inplace=True)
various_df = various_df.join(
    [nfk_info.treadgold_cgdpo_pc, nfk_info.treadgold_rgdpna_pc], how="outer"
)

### Pitcairn Island (`PCN`) information from the Government of Pitcairn

#### Nominal GDP of 2006

`PCN` has an estimate of approximately 217,000 New Zealand dollars in 2006 (from [this link](https://web.archive.org/web/20150705134639/http://www.government.pn/policies/Pitcairn%20Island%20SDP%202012-2016.pdf#page=4) for a WayBackMachine Archive of the Government of Pitcairn's "Pitcairn Islands Strategic Development Plan").

#### Population in 1937

1937 population is 237, and this information is from the Pitcairn Island [government website](http://www.immigration.gov.pn/community/the_people/index.html). Population information for 2000-2016 and 2020-2021 is provided in CIA WFB.

In [29]:
# GDP in millions of USD and population in millions of people
pcn_gdp_nom_2006 = 0.217 * 1e6 / xr_rate.loc[("NZL", 2006), "xrate"].values[0]
pcn = pd.DataFrame(
    [["PCN", 1937, np.nan, 237], ["PCN", 2006, pcn_gdp_nom_2006, np.nan]],
    columns=["ccode", "year", "pcn_gov_gdp_nom", "pcn_gov_pop"],
).set_index(["ccode", "year"])
various_df = various_df.join([pcn.pcn_gov_gdp_nom, pcn.pcn_gov_pop], how="outer")

  pcn_gdp_nom_2006 = 0.217 * 1e6 / xr_rate.loc[("NZL", 2006), "xrate"].values[0]


### North Korea (`PRK`) GDP growth estimates from the Bank of Korea

The link to the information is [here](https://www.bok.or.kr/portal/main/contents.do?menuNo=200091); we use the real GDP (in South Korean won) to calculate the real GDP growth for 2018-2020, and use these rates later with MPD GDP values (which only have GDPpc values up to 2018).

In [30]:
# 2018-2020 real GDP in billions of South Korean won (approx. millions of USD)
prk_2018_2020_real_gdp = np.array([328030, 329189, 314269])
prk_2019_2020_gr = prk_2018_2020_real_gdp[1:] / prk_2018_2020_real_gdp[0:-1]
prk = pd.DataFrame(
    data={
        "ccode": ["PRK"] * 2,
        "year": [2019, 2020],
        "bok_prk_real_gdp_gr": prk_2019_2020_gr,
    }
).set_index(["ccode", "year"])
various_df = various_df.join(prk.bok_prk_real_gdp_gr, how="outer")

### 2018-2019 nominal GDP per capita from St. Helena Government

Link to **St Helena's Sustainable Economic Development Plan** is [here](https://www.sainthelena.gov.sh/wp-content/uploads/2020/07/SEDP-EOY-Progress-Report-Final-160720.pdf); see page 5 for the nominal GDP per capita values. Note that this is **not** the GDP per capita value for St. Helena, Ascension, and Tristan de Cunha (represented by `SHN`) but **just St. Helena**. We will use this GDP per capita with the entire `SHN` population to get the GDP values. Note also that for convenience, these values are recorded under the country code `SHN`. 

In [31]:
# values are in British pounds
st_helena = pd.DataFrame(
    [["SHN", 2018, 8490], ["SHN", 2019, 8320]],
    columns=["ccode", "year", "st_helena_gov_gdppc_nom"],
).set_index(["ccode", "year"])
st_helena["st_helena_gov_gdppc_nom"] = (
    st_helena["st_helena_gov_gdppc_nom"]
    / xr_rate.loc[("GBR", [2018, 2019]), "xrate"].values
)
various_df = various_df.join(st_helena.st_helena_gov_gdppc_nom, how="outer")

### Svalbard and Jan Mayen (`SJM`) 2009-2021 population from Statistics Norway

Link to the relevant page from **Statistics Norway** is [here](https://www.ssb.no/en/statbank/table/07429); the values below are half-year populations (`h1` having the first half, `h2` having the second half), and we will average them to get the yearly populations.

In [32]:
# data from 2009-2021
h1 = [2085, 2052, 2017, 2115, 2158, 2100, 2185, 2152, 2145, 2214, 2258, 2428, 2459]
h2 = [2140, 2071, 2140, 2195, 2195, 2118, 2189, 2162, 2210, 2310, 2379, 2417, 2552]
sjm = np.round(0.5 * (np.array(h1) + np.array(h2)), 0)
sjm = pd.DataFrame(data={"stat_nor_pop": sjm, "year": list(range(2009, 2022))})
sjm["ccode"] = "SJM"
various_df = various_df.join(sjm.set_index(["ccode", "year"]).stat_nor_pop, how="outer")

### United States Minor Outlying Islands (`UMI`) 1980, 1990, 2000 population from the U.S. Census

Link to the Census report is [here](https://www.census.gov/history/pdf/2000-minoroutlyingislands.pdf). 

As the report says, `UMI` is composed of Baker, Howland, and Jarvis Islands, Johnston Atoll, Kingman Reef, Midway Islands, Navassa Island, Palmyra Atoll, and Wake Island. According to the 2022 CIA World Factbook (links [here for Wake Island](https://www.cia.gov/the-world-factbook/countries/wake-island/#people-and-society) and [here for others](https://www.cia.gov/the-world-factbook/countries/united-states-pacific-island-wildlife-refuges/#people-and-society)), all of these locations are either closed to public or used as wildlife refuges; therefore, in our projections (2010-2100), we will assume that  population, GDP, and capital stock are 0.

In [33]:
umi = np.array([1082, 193, 316])
umi = pd.DataFrame(data={"us_census_pop": umi, "year": [1980, 1990, 2000]})
umi["ccode"] = "UMI"
various_df = various_df.join(
    umi.set_index(["ccode", "year"]).us_census_pop, how="outer"
)

### Exporting cleaned, various-source dataset

In [34]:
save(various_df, sset.PATH_INC_POP_AUX)

## CoDEC GTSM Historical Surge Values (Muis et al. 2020)

In [None]:
r = requests.get(
    f"https://zenodo.org/record/3660927/files/{sset.PATH_GEOG_GTSM_SURGE.name}"
)
with sset.PATH_GEOG_GTSM_SURGE.open("wb") as f:
    f.write(r.content)

## Further data requiring separate manual instructions

### UN Statistics National Accounts (Analysis of Main Aggregates; abbreviated as UN AMA)

#### UN AMA nominal (current prices) GDP per capita information

1. Travel to this [link](https://unstats.un.org/unsd/snaama/Basic) to get to the UN Statistics National Accounts search page.
2. Select all countries and all years available, and select "GDP, Per Capita GDP - US Dollars".
3. Select "Export to CSV", and you will download the file `Results.csv`. Rename this file as `un_snaama_nom_gdppc.csv`. We save this in `sset.DIR_UN_AMA_RAW`.

#### UN AMA nominal (current prices) GDP information

1. Similar to the nominal GDP per capita information, travel to this [link](https://unstats.un.org/unsd/snaama/Basic) to get to the UN Statistics National Accounts search page.
2. Select all countries and all years available, and select "GDP, at current prices - US Dollars".
3. Select "Export to CSV", and you will download the file `Results.csv`. Rename this file as `un_snaama_nom_gdp.csv`. We save this in `sset.DIR_UN_AMA_RAW`.

#### Mapping to region and subregion

1. Go to [https://unstats.un.org/unsd/methodology/m49/overview/](https://unstats.un.org/unsd/methodology/m49/overview/) and click on the `CSV` download button.
2. Save this to `sset.PATH_UN_REGION_DATA`

### OECD region-level information

#### OECD: population (region-level)
1. Go to the following OECD Stat website: link [here](https://stats.oecd.org/)
2. On the left, find the header "Regions and Cities" and click the "+" button.
3. From the drop down menu, click on "Regional Statistics".
4. Again from the drop down menu, click on "Regional Demography."
5. Finally, select "Population by 5-year age groups, small regions TL3." Make sure that "Indicator" is selected as "Population, All ages".
6. Download the file by selecting "Export," then "Text File (CSV)."
7. When a pop-up appears, select "Default format" then "Download." Rename the file as `REGION_DEMOGR.csv`.
8. Place this file in `sset.DIR_OECD_REGIONS_RAW`.

#### OECD: GDP (region-level, in millions of constant 2015 PPP USD)
1. Similar to the population information, go to the following OECD Stat website: link [here](https://stats.oecd.org/)
2. On the left, find the header "Regions and Cities" and click the "+" button.
3. From the drop down menu, click on "Regional Statistics".
4. Again from the drop down menu, click on "Regional Economy."
5. Finally, select "Gross Domestic Product, Small regions TL3." Make sure that "Measure" is selected as "Millions USD, constant prices, constant PPP, base year 2015".
6. Download the file by selecting "Export," then "Text File (CSV)."
7. When a pop-up appears, select "Default format" then "Download." Rename the file as `REGION_ECONOM.csv`
8. Place this file in `sset.DIR_OECD_REGIONS_RAW`.

### IMF investment-to-GDP ratio, population, and GDP

1. Travel to this [link](https://www.imf.org/en/Publications/SPROLLs/world-economic-outlook-databases#sort=%40imfdate%20descending) to get to the World Economic Outlook Databases page.
2. Click on the latest "World Economic Outlook Database" link on the page; for our purposes, we have used the latest available one, which was "World Economic Outlook Database, April 2022" (may be updated in the future).
3. Click "By Countries", then click "ALL COUNTRIES", then click "CONTINUE" on the page that says "Select Countries."
4. Under the "NATIONAL ACCOUNTS" tab, check the following categories:
   - Gross domestic product, current prices (U.S. DOLLARS)
   - Gross domestic product per capita, current prices (U.S. DOLLARS)
   - Gross domestic product per capita, constant prices (PURCHASING POWER PARITY; 2017 INTERNATIONAL DOLLARS)
   - Implied PPP conversion rate (NATIONAL CURRENCY PER CURRENT INTERNATIONAL DOLLAR)
   - Total investment (PERCENT OF GDP)
5. Under the "PEOPLE" tab, check the category "Population," then click on "CONTINUE."
6. Under the tab "DATE RANGE," use the earliest year for "Start Year" (1980, in our case), and the latest non-future year for "End Year" (2020, in our case).
7. Under the tab "ADVANCED SETTINGS", click on "ISO Alpha-3 Code" for getting country codes. 
8. Click on "PREPARE REPORT." Then, click on "DOWNLOAD REPORT." Saved data should be in Excel format.
10. Save this file as `sset.PATH_IMF_WEO_RAW`.

### World Bank Intercomparison Project 2017 (WB ICP 2017): Construction Cost Index

While most World Bank data can be downloaded by using `pandas_datareader.wb`, it seems that variables in WB ICP 2017 - including `1501200:CONSTRUCTION`, which is necessary for SLIIDERS - cannot be downloaded using the said module (despite being searchable in the module using `pandas_datareader.wb.search`). Therefore, we follow the below manual process for downloading the WB ICP 2017 dataset.
1. Use [this link](https://databank.worldbank.org/embed/ICP-2017-Cycle/id/4add74e?inf=n) to access WB ICP 2017 in table format.
2. Upon entering the webpage, look to the upper right corner and click on the icon with downward arrow with an underline. This should prompt the download.
3. When the download finishes, there should be a `.zip` file called `ICP 2017 Cycle.zip`. Access the `.csv` file whose name ends in `_Data.csv` (there should be two files in the `.zip` file, the other being a file whose name ends in `_Series - Metadata.csv`).
4. Save that `.csv` file as `sset.PATH_EXPOSURE_WB_ICP`.

### IIASA and OECD models' GDP and population projections (2010-2100, every 5 years)

1. Go to the following IIASA SSP Database website: link [here](https://tntcat.iiasa.ac.at/SspDb); you may need to register and create your log-in.
2. In the above tabs, there is a tab called "Download"; click on it.
3. Under "SSP Database Version 2 Downloads (2018)" and under the sub-header "Basic Elements", there is a download link for `SspDb_country_data_2013-06-12.csv.zip`. Click and download the said `.zip` file.
4. Extract and save the `SspDb_country_data_2013-06-12.csv`. Again, for our purposes, we save this as `sset.PATH_IIASA_PROJECTIONS_RAW`.

### LandScan 2021

1. Go to https://landscan.ornl.gov/
2. Click Download in the upper right. Provide your information and select the Landscan Global product for 2021.
3. Extract and save all extracted files to `sset.PATH_LANDSCAN_RAW`.

### Global geoids, based on select Earth Gravitational Models (EGMs)
1. Go to the following International Centre for Global Earth Models (ICGEM) website (link [here](http://icgem.gfz-potsdam.de/calcgrid)) to reach the page "Calculation of Gravity Field Functionals on Ellipsoidal Grids".
2. Under **Model Selection**, select `XGM2019e_2159`.
3. Under **Functional Selection**, select `geoid`.
4. Under **Grid selection**, there's a **Grid Step [°]** option. Change the value to **0.05**. Also, make sure that the **Reference System** is `WGS84`.
5. Due to download size constraints, we need to download this data in 4 chunks. Do the following:
   - Split the full range of latitudes and longitudes in half, which yields the following 4 combinations of longitude-latitude ranges: $([-180, 0], [-90, 0]), ([-180, 0], [0, 90]), ([0, 180], [-90, 0])$, and $([0, 180], [0, 90])$.
   - Under **Grid selection** again, one can select the range of longitudes and latitudes. Select one of the above combinations and press `start computation`.
   - This will open up a new tab for calculations, which may take some time to complete. Once this is done, press **Download Grid**.
   - Once the download is complete, go back to the previous page with **Model selection**, **Functional selection**, and more. Make sure the selections you made are intact, select another longitude-latitude combination, and repeat the process until there are no combinations left.
6. Once the above steps are done, go back to Step 2 above; but instead of selecting `XGM2019e_2159` for **Model selection**, select `EGM96`. Go through the Steps 3 to 5 again with this new selection.
7. Once the downloads for `XGM2019e_2159` and `EGM96` are complete, you should have 4 files for each model (8 in total, in `.gdf` format). Save the `XGM2019e_2159` files in `sset.DIR_GEOG_DATUMS_XGM2019e_WGS84` and `EGM96` files in `sset.DIR_GEOG_DATUMS_EGM96_WGS84`.

### Global Mean Dynamic Ocean Topography (MDT) from AVISO+
**Note**: While this dataset has a relatively open license, you will first need to obtain a MY AVISO+ account, which requires verification from the AVISO+ team and may take several days or weeks.
1. Go to the following AVISO+ website for **MDT CNES-CLS18**: link [here](https://www.aviso.altimetry.fr/en/data/products/auxiliary-products/mdt/mdt-global-cnes-cls18.html).
2. Once on the page, download the dataset through your MY AVISO+ account (click on `access via MY AVISO+` link and follow the instructions).
3. After following the instructions, you will acquire the file `mdt_cnes_cls18_global.nc.gz`. Extract the file `mdt_cnes_cls18_global.nc` from the `.gz` file and save it as `sset.PATH_GEOG_MDT_RAW`.

### HydroSHEDS
1. Go to https://www.hydrosheds.org/products/hydrobasins#downloads
2. Download the "standard" level-0 HydroBASINS files for each continent (use the Dropbox link if available--this appears as "NOTE: you may also download data from here." as of 8/16/21. Download the shapefiles into the directory defined in `sset.DIR_HYDROBASINS_RAW`

### Aggregate Capital Stock Estimations from 122 Countries from Berlemann and Jan-Erik Wesselhöft (2017, Review of Economics)

Link to the paper is [here](https://www.degruyter.com/document/doi/10.1515/roe-2017-0004/html?lang=en#j_roe-2017-0004_tab_001_w2aab2b8c20b1b7b1ab1b1c10Aa). This dataset can be acquired by contacting Michael Berlemann (one of the authors). Original dataset file was named `Capital Stock Data Update 2017.xlsx`; we save it as `Berlemann_Wesselhoft_2017.xlsx` and place it in `sset.DIR_YPK_RAW`.

### Fariss, Anders, Markowitz, and Barnum (2022, Journal of Conflict Resolutions) data for GDP, GDP per capita, and population

1. Go to the following [Dataverse link](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/FALCGS).
2. Click on **Access Dataset**, and a drop-down menu will appear. Click on **Download ZIP**.
3. Once the download is complete, you will have a `.zip` file named `dataverse_files.zip`. We save this as `Fariss_JCR_2022.zip` and place it in `sset.DIR_YPK_RAW`.

### Other SLIIDERS input datasets

There are some datasets that were manually constructed for use in `SLIIDERS`. They are available for download on Zenodo. Please download each file from the Zenodo deposit [here](https://doi.org/10.5281/zenodo.6449231) and copy to the paths designated for each dataset.

#### 1. `gtsm_stations_eur_tothin.parquet`
Path: `sset.PATH_GTSM_STATIONS_TOTHIN`

These 5,637 station points are a subset of the full CoDEC dataset (n=14,110) representing sites along European coastlines that are roughly five times more densely-spaced compared to the rest of the globe, as described in Muis et al. 2020. This subset of points are those that will be thinned by 5x to approximately match the density of CoDEC coast stations globally. Some manual inclusion criteria for this subset was applied in GIS due to the fact that simply seeking to select dense European stations based on the “station_name” field in the dataset, which contains the substring “eur” for all European locations, results in an over-selection of desired points (n=6,132), with many North African coastal points that are not densely-spaced containing this substring in their “station_name” as well. Therefore, European points were manually identified, with small islands, such as in the Mediterranean, included if their land mass contained 5 or more station points, which guarantees that they will be represented by at least one station point following the 5x thinning process. The resultant subset of points is used as a data input for the coastal segment construction in the preprocessing of the SLIIDERS dataset.

#### 2. `gtsm_stations_ciam_ne_coastline_snapped.parquet`
Path: `sset.PATH_GEOG_GTSM_SNAPPED`

This contains the locations of all of the CoDEC (Muis et al. 2020) nodes, snapped to the Natural Earth coastlines dataset.

#### 3. `ciam_segment_pts_manual_adds`
Path: `sset.PATH_SEG_PTS_MANUAL`

This contains a list of additional segment centroids to create in order to ensure that each coastal admin1 has at least one segment assigned to it.

### CoastalDEM v2.1
1. Acquire the global 1 arc-second CoastalDEM dataset from Climate Central (https://go.climatecentral.org/coastaldem/).
2. Save all 1-degree GeoTIFF files in `sset.DIR_COASTALDEM`