## Code for output of historical (1950-2020) country-level capital stock projections (and actual values whenever possible)

Using the I-Y (investment-to-GDP) ratio projections and GDP projections in the previous notebooks, we project the capital stock values (at the country-level).

## Setting

### Importing necessary modules

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.cluster import KMeans
from tqdm.auto import tqdm

from sliiders import country_level_ypk as ypk_fn
from sliiders import settings as sset

## variables header
v_ = ["v_" + str(x) for x in range(1950, 2020)]

## Getting the investment values

The investment values can be found by multiplying the GDP values ($Y_{c, t}$) with investment-to-GDP ratios (i.e., I-Y ratios and denoted $\left(\frac{I}{Y}\right)_{c, t}$). This is needed as we would like to project missing capital stock values in the years 1950-2020.

We will use the GDP (`cgdpo` series) in conjunction with actual + predicted I-Y ratios as current PPP values are what are used in PWT's method of finding the "initial capital stock value" (at 1950 or a later year that is as early as possible).

### Preparations and creating current PPP, 2017 USD

In [None]:
## importing GDP (cgdpo, current PPP, 2017 USD) and I-Y ratio and create investment
histinfo = pd.read_parquet(
    sset.DIR_YPK_INT / "gdp_gdppc_pop_capital_1950_2020_post_ypk4.parquet"
)
histinfo.loc[pd.isnull(histinfo.iy_ratio_fit), "iy_ratio_fit"] = 0
histinfo["curr_ppp_invest"] = histinfo["cgdpo_17"] * histinfo["iy_ratio_fit"]

### Fetching the PPP conversion table for year-to-year conversion

For the PWT method of finding the initial capital, what we want is to add year-$t$ current PPP investment (generated above) to the year-$t$ current PPP capital stock, take care of capital depreciation, and get the year-$t+1$ capital stock. However, the said year-$t+1$ value will be in year-$t$ PPP, so we need to get year-$t$-to-year-$t+1$ PPP conversion rates.

In [None]:
## ppp table for capital stock
ppp_to_2017_K = ypk_fn.ppp_conversion_specific_year(2017, True, True, pwtvar="pl_n")
ppp_to_2017_K.loc[pd.isnull(ppp_to_2017_K.conv), "conv"] = 1

ppp_K_yr_to_yr = ppp_to_2017_K[["conv"]].rename(columns={"conv": "conv_curr_yr"})
ppp_K_next_yr = ppp_to_2017_K[["conv"]].rename(columns={"conv": "conv_next_yr"})
ppp_K_next_yr.reset_index(inplace=True)
ppp_K_next_yr["year"] = ppp_K_next_yr["year"] - 1
ppp_K_yr_to_yr = ppp_K_yr_to_yr.merge(
    ppp_K_next_yr.set_index(["ccode", "year"]),
    how="left",
    left_index=True,
    right_index=True,
)
ppp_K_yr_to_yr["conv"] = ppp_K_yr_to_yr["conv_curr_yr"] / ppp_K_yr_to_yr["conv_next_yr"]

## we don't have 2019-to-2020 rates, so we will assume that there is no PPP rate change
ppp_K_yr_to_yr.loc[(slice(None), 2019), "conv"] = 1

## Projecting missing values of capital stock

The overall methodology for projection of missing capital stock values can be summarized as follows:
1) For countries whose information exist only in LitPop, organize 2014 data and use (estimated) investment and depreciation rate values to project 2015-2019 data.
2) For countries whose information exist only in GEG-15, organize 2005 data and use (estimated) investment and depreciation rate values to project 2006-2019 data.
3) After Steps 1 and 2, 2014-2019 capital stock values will be available for all countries. Turn those capital stock values into current PPP terms (for PWT10.0 ones, just use `cn`) and calculate the ratios of current-PPP capital stock to current-PPP GDP (`cgdpo`, current PPP, 2017 USD in particular).
4) Use `k`-nearest neighbors to make unsupervised classifications of countries based on the above-calculated capital-stock-to-GDP ratios.

### Preparations (importing data, cleaning to current PPP, 2017 USD)

We import the `cn` (current PPP, 2017 USD) and `rnna` (constant 2017 PPP USD) capital stock series from PWT. 

Also, we import LitPop data which is assumed to be in constant 2014 PPP USD (and are in ones of USD). LitPop's original source data for capital is from World Bank (link [here](https://datacatalog.worldbank.org/dataset/wealth-accounting)), which multiplies 1.24 to their values to also account for land values. We want only the capital values, so land values must be removed; LitPop already has taken care of this multiplier, so it can be used as is (can be confirmed by comparing numbers in the link above).

Finally, we import GEG-15 data which is assumed to be in constant 2005 PPP USD (and are in millions of USD). This data also includes land values, so we divide the values by 1.24 to acquire only the capital stock values.

In [None]:
# skip this cell if GEG has been already cleaned up at the country level
geg_coord = pd.read_parquet(sset.PATH_GEG15_INT).rename(columns={"iso3": "ccode"})
geg = geg_coord.groupby("ccode")["tot_val"].sum()
geg = pd.DataFrame(data={"ccode": geg.index, "value": geg.values})

# country-level information
geg.to_parquet(sset.DIR_YPK_INT / "geg-15_ctry_lv.parquet")

In [None]:
# PWT10.0
pwt100 = (
    pd.read_excel(sset.PATH_PWT_RAW)
    .rename(columns={"countrycode": "ccode"})
    .set_index(["ccode", "year"])
)
capdata = histinfo[["cgdpo_17", "rgdpna_17", "curr_ppp_invest", "delta"]].merge(
    pwt100[["cn", "rnna"]],
    how="left",
    left_index=True,
    right_index=True,
)

# for litpop and geg-15, we retain current PPP but adjust from current USD to
# constant USD.

# litpop
litpop_meta = pd.read_csv(sset.DIR_LITPOP_RAW / "_metadata_countries_v1_2.csv").rename(
    columns={"iso3": "ccode", "total_value [USD]": "litpop_cn"}
)
litpop_meta = litpop_meta[~pd.isnull(litpop_meta.litpop_cn)]
litpop_meta["year"] = 2014
usd_14_to_17 = pwt100.loc[("USA", 2017), "pl_n"] / pwt100.loc[("USA", 2014), "pl_n"]
litpop_meta["litpop_cn"] = litpop_meta["litpop_cn"] / 1000000 * usd_14_to_17
litpop_meta.set_index(["ccode", "year"], inplace=True)

### geg-15
ctry_lv_geg = pd.read_parquet(sset.DIR_YPK_INT / "geg-15_ctry_lv.parquet").reset_index()
ctry_lv_geg["year"] = 2005
ctry_lv_geg = ctry_lv_geg.astype({"value": "float64"}).set_index(["ccode", "year"])
usd_05_to_17 = pwt100.loc[("USA", 2017), "pl_n"] / pwt100.loc[("USA", 2005), "pl_n"]
ctry_lv_geg["value"] = ctry_lv_geg["value"] / 1.24 * usd_05_to_17
ctry_lv_geg.rename(columns={"value": "geg_cn"}, inplace=True)

## merging all
capdata = capdata.merge(ctry_lv_geg, left_index=True, right_index=True, how="left")
capdata = capdata.merge(
    litpop_meta[["litpop_cn"]], left_index=True, right_index=True, how="left"
)

## we also merge the year-to-year PPP conversion rates
capdata = capdata.merge(
    ppp_K_yr_to_yr[["conv"]], left_index=True, right_index=True, how="left"
).drop(["index"], axis=1)
capdata.loc[pd.isnull(capdata.conv), "conv"] = 1

Let us fill in the capital values of uninhabited areas to be 0.

In [None]:
## uninhabited areas
for i in sset.UNINHABITED_ISOS:
    capdata.loc[i, ["cn"]] = 0
    capdata.loc[i, ["rnna"]] = 0

### 2014-2020 projection for LitPop values, 2005-2020 projection for GEG-15 values, and 2020 projection for PWT10.0 (all in current PPP)

#### Log-linear interpolation for LitPop and GEG-15

In the case where 2014 value exists in LitPop and 2005 value exists in GEG-15, we will not try to extrapolate the 2005-2014 values via perpetual inventory method (PIM) but rather by log-linear interpolation. Note that this will only be done for countries *not* having PWT10.0 capital stock information.

In [None]:
## getting the relevant ccodes
ccodes = capdata.index.get_level_values("ccode").unique()
pwt_cc = capdata.loc[~pd.isnull(capdata.cn), :].index.get_level_values("ccode").unique()
lp_cc = (
    capdata.loc[~pd.isnull(capdata.litpop_cn), :]
    .index.get_level_values("ccode")
    .unique()
)
geg_cc = (
    capdata.loc[~pd.isnull(capdata.geg_cn), :].index.get_level_values("ccode").unique()
)
lp_cc = np.setdiff1d(lp_cc, pwt_cc)
geg_cc = np.setdiff1d(geg_cc, pwt_cc)

We notice, however, that there are some additional *inhabited* countries or regions that are absolutely missing all (1950-2020) capital stock information. In this case, we follow LitPop and assume that the 2014 value of capital stock for these countries is **1.247240** times the 2014 value of GDP (`cgdpo_17`, in this case). We will include these in the column `litpop_cn`. We will include these in the set of LitPop countries for now.

In [None]:
no_k_cc = np.setdiff1d(
    ccodes,
    np.union1d(np.union1d(np.union1d(lp_cc, geg_cc), pwt_cc), sset.UNINHABITED_ISOS),
)
print(no_k_cc)

litpop_ky_ratio = 1.247240
for i in no_k_cc:
    capdata.loc[(i, 2014), "litpop_cn"] = (
        litpop_ky_ratio * capdata.loc[(i, 2014), "cgdpo_17"]
    )

lp_cc = np.union1d(lp_cc, no_k_cc)
lp_geg_cc = np.intersect1d(lp_cc, geg_cc)

In [None]:
## interpolating
capdata["litpop_geg_cn"] = np.nan
for i in lp_geg_cc:
    val05 = capdata.loc[(i, 2005), "geg_cn"]
    val14 = capdata.loc[(i, 2014), "litpop_cn"]
    val05_14 = np.exp(
        np.interp(range(2005, 2015), [2005, 2014], np.log([val05, val14]))
    )
    capdata.loc[(i, list(range(2005, 2015))), "litpop_geg_cn"] = val05_14

#### PIM projection for LitPop-GEG (2014-2020), LitPop-only (2014-2020), GEG-15-only (2005-2020), and PWT10.0 (2019-2020)

In [None]:
def capital_perp_inven(
    currK_var="litpop_cn",
    currI_var="curr_ppp_invest",
    depre_var="delta",
    ppp_conv_var="conv",
    begin_end=[2014, 2020],
    df=capdata,
):
    """Using the investment values in `currI_var`, depreciation rate values in
    `depre_var`, and capital stock values in `currK_var`, conduct the perpertual
    inventory method to acquire capital stock values' estimates. In every step,
    the capital values are calculated as current PPP values.

    Parameters
    ----------
    currK_var : str
        variable name in `df` to contain known current-PPP capital stock values
    currI_var : str
        variable name in `df` to contain current-PPP investment values
    depre_var : str
        variable name in `df` to contain depreciation rate values
    ppp_conv_var : str
        variable name in `df` to contain the year-to-next-year conversion rates in PPP
    begin_end : array-like of int
        contains the year to begin the perpetual inventory method on and to end the said
        method on
    df : pandas DataFrame
        containing the necessary variables (`currK_var`, `currI_var`, `depre_var`, and
        `ppp_conv_var`) with indices `ccode` for country-code and `year`, in that order

    Returns
    -------
    df : pandas DataFrame
        containing information with the perpetual inventory method applied to produce
        estimates for (future) capital stock values

    """

    newvar = currK_var + "_proj"
    df[newvar] = df[currK_var].values
    for i in range(begin_end[0], begin_end[-1]):
        grossK = df.loc[
            (slice(None), i), [newvar, currI_var, depre_var, ppp_conv_var]
        ].copy()
        grossK["next_year_K"] = (
            (grossK[newvar] + grossK[currI_var])
            * (1 - grossK[depre_var])
            * (grossK[ppp_conv_var])
        )
        grossK.reset_index(inplace=True)
        grossK["year"] = grossK["year"] + 1
        grossK.set_index(["ccode", "year"], inplace=True)
        df = df.merge(
            grossK[["next_year_K"]], left_index=True, right_index=True, how="left"
        )
        df.loc[(slice(None), i + 1), newvar] = df.loc[
            (slice(None), i + 1), "next_year_K"
        ].values
        df.drop(["next_year_K"], axis=1, inplace=True)

    return df

In [None]:
## updating litpop
capdata = capital_perp_inven(df=capdata)

## updating geg-15
capdata = capital_perp_inven("geg_cn", begin_end=[2005, 2020], df=capdata)

## updating litpop-geg-15
capdata = capital_perp_inven("litpop_geg_cn", df=capdata)

## updating cn for PWT10.0
capdata = capital_perp_inven("cn", begin_end=[2019, 2020], df=capdata)

#### Creating a single current PPP, 2017 USD capital stock series (`cn_extrap`) for the data so far and tagging sources

Again, we prioritize PWT10.0, then LitPop-GEG-15, then LitPop, then finally GEG-15.

In [None]:
## filling in the values
capdata["cn_extrap"] = capdata["cn_proj"].values
capdata.loc[(lp_geg_cc, slice(None)), "cn_extrap"] = capdata.loc[
    (lp_geg_cc, slice(None)), "litpop_geg_cn_proj"
].values
lp_only = np.setdiff1d(lp_cc, lp_geg_cc)
capdata.loc[(lp_only, slice(None)), "cn_extrap"] = capdata.loc[
    (lp_only, slice(None)), "litpop_cn_proj"
].values
geg_only = np.setdiff1d(geg_cc, lp_geg_cc)
capdata.loc[(geg_only, slice(None)), "cn_extrap"] = capdata.loc[
    (geg_only, slice(None)), "geg_cn_proj"
].values

## filling in the source information
capdata["cs"] = "-"
capdata.loc[~pd.isnull(capdata.cn), "cs"] = "PWT"
capdata.loc[~pd.isnull(capdata.cn_proj) & (capdata.cs == "-"), "cs"] = "PWT_perp_inven"
capdata.loc[~pd.isnull(capdata.litpop_cn) & (capdata.cs == "-"), "cs"] = "LitPop"
capdata.loc[~pd.isnull(capdata.geg_cn) & (capdata.cs == "-"), "cs"] = "GEG-15"

capdata.loc[
    ~pd.isnull(capdata.litpop_geg_cn) & (capdata.cs == "-"), "cs"
] = "LitPop_GEG-15_interp"
capdata.loc[
    ~pd.isnull(capdata.litpop_geg_cn_proj) & (capdata.cs == "-"),
    "cs",
] = "LitPop_perp_inven"
capdata.loc[
    ~pd.isnull(capdata.litpop_cn_proj) & (capdata.cs == "-"),
    "cs",
] = "LitPop_perp_inven"
capdata.loc[
    ~pd.isnull(capdata.geg_cn_proj) & (capdata.cs == "-"), "cs"
] = "GEG-15_perp_inven"

capdata.loc[(no_k_cc, [2014]), "cs"] = "mult_LitPop_ratio"
capdata.loc[(no_k_cc, list(range(2015, 2021))), "cs"] = "mult_LitPop_perp_inven"

### Finding the initial capital stock (at the year 1950)

#### Grouping the countries (via $k$-means) to find the optimal rate of change of capital intensity (capital to GDP ratio) and the range of initial capital

The methodology implemented in PWT (as documented in [this PWT9.1 appendix](https://www.rug.nl/ggdc/docs/pwt91_capitalservices_ipmrevision.pdf)) to estimate the initial capital stock is as follows:
1. Set a lower bound and an upper bound of capital intensity at the initial available year ($t_0$), and multiple the year-$t_0$ value of current PPP GDP to get lower and upper bounds of year-$t_0$ capital stock estimates. In PWT9.1, the values of lower and upper bounds of year-$t_0$ capital intensities are 0.5 and 4.0.
2. Add on investment values and account for depreciation via the perpetual inventory method (PIM) and "grow" the upper and lower capital stocks.
3. Due to depreciation, there will be a year (call this $t^*$) at which the two (upper and lower) tracks of current-PPP capital become close; PWT9.1 sets the "closeness" as 10% (so upper-bound capital is less than 1.1 times lower-bound capital).
4. Calculate the upper and lower capital intensities at $t^*$, and calculate the simple mean of year-$t^*$ capital intensity (denote by $\kappa_{t_0}$).
5. Decrease this year-$t^*$ capital intensity by per-annum capital intensity growth rate (set as $g_\kappa=0.02$), until it reaches the initial year. So the new initial capital intensity at $t_0$ is $\kappa_{t_0}= \kappa_{t^*}- g_\kappa(t^* - t_0) $.
6. Multiply this value with the GDP at year-$t_0$ to acquire the year-$t_0$ capital stock value.

While we will follow this methodology, the problem is that the capital intensity growth rate seems to vary a lot country-by-country. Further, the initial lower and upper bounds of capital intensity being 0.5 and 4.0 each does not seem to fit the PWT10.0 update. Therefore, what we will do is the following:

1. Group the countries via $k$-means using their available capital intensities.
2. For each group, find the earliest-year (preferably 1950) upper and lower bounds of capital intensity.
3. Also for each group, find the per-annum capital intensity growth rates.
4. Apply the above-mentioned methodology for each group, using the updated lower / upper bounds of capital intensity at the initial year (1950) and capital intensity growth rates.

For grouping, we try regular $k$-means with only the years that are available currently for all countries (2014-2020) or by filling in the missing pieces using the EM algorithm. The former, in terms of balanced classification, seems to work better, so we go with the regular $k$-means methodology with $k=3$. The EM-augmented $k$-means algorithm is from this [Stack Overflow post](https://stackoverflow.com/questions/35611465/python-scikit-learn-clustering-with-missing-data).

In [None]:
def kmeans_missing(X, n_clusters, max_iter=10, rand_state=60607):
    """Perform K-Means clustering on data with missing values.

    Parameters
    ----------
    X : array-like
        wide-format array (with each row being different countries) to conduct the
        EM algorithm and k-means clustering on
    n_clusters : int
        number of clusters to form
    max_iter : int
        maximum number of EM iterations to perform
    rand_state : int
        random state, for replicability

    Returns
    -------
    labels : array-like of int
        containing integer labels, based on the EM-augmented k-means algorithm, for each
        row in the array-like `X`
    centroid : array-like
        containing the centroid for each of the k-means label
    X_hat : array-like
        copy of `X` with the missing values filled in using the EM algorithm

    """

    # Initialize missing values to their column means
    missing = ~np.isfinite(X)
    mu = np.nanmean(X, 0, keepdims=1)
    X_hat = np.where(missing, mu, X)

    for i in range(max_iter):
        if i > 0:
            # initialize KMeans with the previous set of centroids. this is much
            # faster and makes it easier to check convergence (since labels
            # won't be permuted on every iteration), but might be more prone to
            # getting stuck in local minima.
            clus = KMeans(n_clusters, init=prev_centroids, random_state=rand_state)
        else:
            # do multiple random initializations in parallel
            clus = KMeans(n_clusters, random_state=rand_state)

        # perform clustering on the filled-in data
        labels = clus.fit_predict(X_hat)
        centroids = clus.cluster_centers_

        # fill in the missing values based on their cluster centroids
        X_hat[missing] = centroids[labels][missing]

        # when the labels have stopped changing then we have converged
        if i > 0 and np.all(labels == prev_labels):
            break

        prev_labels = labels
        prev_centroids = clus.cluster_centers_

    return labels, centroids, X_hat

In [None]:
## set aside the uninhabited areas
uninh_capdata = capdata.loc[sset.UNINHABITED_ISOS, :].copy()
capdata = capdata.loc[
    ~capdata.index.get_level_values("ccode").isin(sset.UNINHABITED_ISOS), :
].sort_index()

## creating K-Y ratios dataset, horizontal form (for k-means)
capdata["cap_intensity"] = capdata["cn_extrap"] / capdata["cgdpo_17"]
cap_intensity = ypk_fn.organize_ver_to_hor(
    capdata,
    "cap_intensity",
    "year",
    "ccode",
    range(1950, 2021),
)
all_kys = ["v_" + str(X) for X in range(1950, 2021)]
cap_intensity[all_kys] = cap_intensity[all_kys].astype("float64")

## we can use only the filled information; initializing clustering algorithms
cluster_3 = KMeans(n_clusters=3, random_state=60607)
comp_ky_s_filled = ["v_" + str(X) for X in range(2014, 2021)]
cap_intensity["cl3"] = cluster_3.fit(cap_intensity[comp_ky_s_filled].values).labels_

## based on balanced classification, 3 seems to be the most optimal
## with EM algorithm as well
em_kmeans = kmeans_missing(cap_intensity[all_kys], 3)
cap_intensity["cl3_em"] = em_kmeans[0]

In [None]:
## trying to see the balancedness between regular k-means and EM-augmented version
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))
ax1.hist(cap_intensity["cl3"].astype("int64"))
ax1.set_xticks([0, 1, 2])

ax2.hist(cap_intensity["cl3_em"].astype("int64"))
ax2.set_xticks([0, 1, 2])

ax1.set_ylim([0, 160]), ax2.set_ylim([0, 160])
ax1.set_yticks([0, 40, 80, 120, 160]), ax2.set_yticks([0, 40, 80, 120, 160])

fig.show()

In [None]:
# em grouping gives only 1 country assigned to the final group
cap_intensity.reset_index().groupby(["cl3"]).count()[["ccode"]]

In [None]:
# replication of Table 1 in Inklaar et al. (Intl Productivity Monitor 2019)
rows = []
for i in [1950, 1960, 1970, 1980, 1990, 2000, 2011, 2017]:
    v = f"v_{i}"
    row = [i, cap_intensity.loc[~pd.isnull(cap_intensity[v]), :].shape[0]]
    row += [
        round(cap_intensity[v].mean(), 1),
        round(cap_intensity[v].std(), 1),
        round(cap_intensity[v].min(), 1),
        round(cap_intensity[v].max(), 1),
    ]
    rows.append(row)
np.set_printoptions(suppress=True)
print(np.array(rows))

In [None]:
## attaching the cluster types
capdata = capdata.merge(
    cap_intensity["cl3"], left_index=True, right_index=True, how="left"
)

In [None]:
def calculate_min_max_growthrate_by_group(
    df=capdata, group="cl3", ratio="cap_intensity"
):
    """By specified `group` designation, calculate the lower and upper bounds of the
    variable `ratio` contained in DataFrame `df`, as well as the said variable's average
    annual growth rate.

    Parameters
    ----------
    df : pandas DataFrame
        containing information about the `group` and `ratio`. Should also contain the
        variable `year` as growth rate values are calculated yearly.
    group : str
        column name in `df` that represents the grouping (by k-means clustering or
        other methods)
    ratio : str
        column name in `df` that represents the variable for calculating the lower,
        upper bounds and annual growth rates

    Returns
    -------
    growth_rate_df : pandas DataFrame
        containing, by group, the information about lower bound of `ratio` (`ky_lower`,
        and set to be the 10th quantile from the bottom), upper bound of `ratio`
        (`ky_upper`, and set to be the 90th quantile from the bottom), and growth rate
        per annum of `ratio` (`ky_growth`). Also stores the grouping information in the
        variable `cl`.

    """

    growth_rate_df = []
    for cl in np.sort(df[group].unique()):
        cl_df = df.loc[df[group] == cl, [ratio]].copy()
        nona_ratios = cl_df[ratio].values
        nona_ratios = nona_ratios[~pd.isnull(nona_ratios)]
        cl_lower, cl_upper = np.quantile(nona_ratios, [0.1, 0.9])

        cl_df = cl_df.loc[~pd.isnull(cl_df[ratio]), :].reset_index()
        cl_growth = sm.OLS(
            cl_df[ratio].astype("float64"),
            sm.add_constant(cl_df[["year"]]).astype("float64"),
        )
        cl_growth = cl_growth.fit().params["year"]
        growth_rate_df.append([cl, cl_lower, cl_upper, cl_growth])

    growth_rate_df = pd.DataFrame(
        np.vstack(growth_rate_df),
        columns=["cl", "ky_lower", "ky_upper", "ky_growth"],
    )

    return growth_rate_df

In [None]:
## growth rates, upper and lower bounds for capital intensity
cl_gr = calculate_min_max_growthrate_by_group(df=capdata).rename(columns={"cl": "cl3"})
cl_gr["cl3"] = cl_gr["cl3"].astype("int64")
capdata = (
    capdata.reset_index()
    .merge(cl_gr, on=["cl3"], how="left")
    .set_index(["ccode", "year"])
)

In [None]:
cl_gr

#### Applying PWT 9.1's method, cluster by cluster, and interpolating with the known values of capital

We will also need our investment values to apply the PWT 9.1's method, so we will do so below.

In [None]:
def find_init_k(
    df=capdata,
    begin_end=[1950, 2020],
    lb="ky_lower",
    ub="ky_upper",
    gr="ky_growth",
    currK_var="cn",
    currY_var="cgdpo_17",
    currI_var="curr_ppp_invest",
    depre_var="delta",
    ytoy_ppp="conv",
    cluster="cl3",
    ub_lb_thresh=0.1,
):
    """Finding the initial value of capital (at the year specified by `begin_end`)
    based on the methdology of PWT 9.1.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame to contain all necessary information (current PPP GDP, investment,
        depreciation rates, growth rate, lower bound and upper bound for the capital
        intensity)
    begin_end : array-like of ints
        array-like containing two elements - initial year and the final year to be
        considered by the process
    lb : str
        column name in `df` to indicate the lower bound of capital intensity
    ub : str
        column name in `df` to indicate the upper bound of capital intensity
    gr : str
        column name in `df` to indicate the average yearly growth of capital intensity
    currK_var : str
        column name in `df` for current-PPP capital
    currY_var : str
        column name in `df` for current-PPP GDP
    currI_var : str
        column name in `df` for current-PPP investment
    depre_var : str
        column name in `df` for depreciate rate
    ytoy_ppp : str
        column name in `df` for year-to-next-year PPP conversion rate
    cluster : str
        column name in `df` for cluster (based on capital intensity values)
    ub_lb_thresh : float
        difference between upper- and lower-bound capital stock values to halt and
        acquire year `tstar`

    Returns
    -------
    estimated : pandas.DataFrame
        DataFrame with `ccode` (country code) as the index containing initial-year
        capital stock estimations (based on the PWT 9.1 method); only contains
        information if a country was actually missing the initial-year capital stock

    """

    cl_df = df[[cluster, lb, ub, gr, currY_var, currI_var, ytoy_ppp, depre_var]].copy()
    cl_df["low_k"], cl_df["high_k"] = np.nan, np.nan
    for yr in range(begin_end[0], begin_end[-1]):
        ## setting the initial year's lower and upper bound capital
        if yr == begin_end[0]:
            cl_df.loc[(slice(None), yr), "low_k"] = (
                cl_df.loc[(slice(None), yr), [currY_var, lb]].product(axis=1).values
            )
            cl_df.loc[(slice(None), yr), "high_k"] = (
                cl_df.loc[(slice(None), yr), [currY_var, ub]].product(axis=1).values
            )
        nxt = yr + 1
        for i in ["low_k", "high_k"]:
            cl_df.loc[(slice(None), nxt), i] = (
                cl_df.loc[(slice(None), yr), [i, currI_var]].sum(axis=1).values
                * (1 - cl_df.loc[(slice(None), yr), depre_var].values)
                * cl_df.loc[(slice(None), yr), ytoy_ppp].values
            )
    cl_df["hi_lo_ratio"] = cl_df["high_k"] / cl_df["low_k"] - 1

    ## finding t-star, the year that high- and low-trajectories are lesser than
    ## the threshold set by `ub_lb_thresh`
    tstar_df = (
        cl_df.loc[cl_df.hi_lo_ratio < ub_lb_thresh, :]
        .reset_index()
        .groupby(["ccode"])
        .min()[["year"]]
        .rename(columns={"year": "tstar"})
    )
    cl_df = cl_df.merge(tstar_df, how="left", left_index=True, right_index=True)

    ## if tstar is not acquired, get the latest year to be the tstar
    cl_df.loc[pd.isnull(cl_df["tstar"]), "tstar"] = begin_end[-1]

    ## country-by-country calculation of initial capital for those missing them
    init = df.loc[(slice(None), begin_end[0]), [currK_var]].copy()
    msng_ccodes = (
        init.loc[pd.isnull(init[currK_var]), :].index.get_level_values("ccode").unique()
    )
    estimated = []
    for cc in msng_ccodes:
        ## how many years from tstar to initial year
        tstar = cl_df.loc[(cc, begin_end[0]), "tstar"]
        tstar_t0 = tstar - begin_end[0]

        ## initial-year capital-to-GDP ratio
        init_ky = cl_df.loc[(cc, [tstar]), ["high_k", "low_k"]].mean(axis=1).values[
            0
        ] / cl_df.loc[(cc, tstar), currY_var] - (
            tstar_t0 * cl_df.loc[(cc, begin_end[0]), gr]
        )
        if init_ky < cl_df.loc[(cc, tstar), lb]:
            init_ky = cl_df.loc[(cc, tstar), lb]
        elif init_ky > cl_df.loc[(cc, tstar), ub]:
            init_ky = cl_df.loc[(cc, tstar), ub]

        ## initial-year capital value
        init_K = init_ky * cl_df.loc[(cc, begin_end[0]), currY_var]
        estimated.append([cc, init_K])
    estimated = pd.DataFrame(
        np.vstack(estimated), columns=["ccode", "cn_init_estim"]
    ).set_index(["ccode"])

    return estimated

In [None]:
## estimating the "cn_init_estim" (missing initial-year capital)
cn_init_estim = find_init_k(capdata)

## merging with the rest
capdata = capdata.merge(cn_init_estim, left_index=True, right_index=True, how="left")
capdata.loc[
    (capdata.cs == "-")
    & (~pd.isnull(capdata.cn_init_estim))
    & (capdata.index.get_level_values("year") == 1950),
    "cs",
] = "init_K_estim"

In [None]:
## interpolating the rest, and filling the said values to cn_extrap
msng_ccodes = (
    capdata.loc[(~pd.isnull(capdata.cn_init_estim)), :]
    .index.get_level_values("ccode")
    .unique()
)
capdata["cn_init_estim"] = capdata["cn_init_estim"].astype("float64")
capdata["cn_extrap"] = capdata["cn_extrap"].astype("float64")
for i in msng_ccodes:
    ## initial capital that was estimated
    init_K = capdata.loc[(i, 1950), "cn_init_estim"]

    filled_K = capdata.loc[
        (capdata.index.get_level_values("ccode") == i)
        & (~pd.isnull(capdata.cn_extrap)),
        ["cn_extrap"],
    ]
    filled_yr_min = filled_K.index.get_level_values("year").min()
    filled_yr_min_K = capdata.loc[(i, filled_yr_min), "cn_extrap"]

    interp_K = np.interp(
        range(1950, filled_yr_min + 1),
        [1950, filled_yr_min],
        np.log([init_K, filled_yr_min_K]),
    )
    interp_K = np.exp(interp_K)
    i_yrs = list(range(1950, filled_yr_min + 1))
    capdata.loc[(i, i_yrs), "cn_extrap"] = interp_K
    capdata.loc[(i, i_yrs[1:-1]), "cs"] = "init_K_estim_interp"

## filling in information for ratio-extrapolated
capdata.loc[(no_k_cc, 2014), "cs"] = "LitPop_ratio_extrap"

capdata = pd.concat([capdata, uninh_capdata], axis=0).sort_index()

We will merge the acquired result for current-PPP capital stock (and their sources) with the other historical data.

In [None]:
histinfo = histinfo.merge(
    capdata[["cn_extrap", "cs"]].rename(columns={"cs": "capital_source"}),
    how="left",
    left_index=True,
    right_index=True,
)

## Filling in the missing `rnna` values, generating current PPP, 2019 USD capital values (`cn_19`) and constant 2019 PPP USD capital values (`rnna_19`)

### For the missing `rnna` values (current PPP, 2017 USD)

For these ones, we need to make sure that $rnna_{c, 2017} = cn_{c, 2017}$ for any country $c$. For the countries whose `rnna` information is missing entirely, we will use the (extrapolated) conversion rates to turn the `cgdpo` to `rnna` values. But for the countries whose `rnna` information does exist partially, we first apply the conversion rates, get `rnna` equivalents, get the growth rates of `rnna`-equivalents for the missing years, and apply them to the pre-existing `rnna` values.

In [None]:
## conversion rates (PPP) attached (from current to 2017 PPP)
histinfo = histinfo.merge(
    ppp_to_2017_K[["conv"]].rename(columns={"conv": "curr_to_cnst"}),
    how="left",
    left_index=True,
    right_index=True,
)
histinfo.loc[(slice(None), 2020), "curr_to_cnst"] = histinfo.loc[
    (slice(None), 2019), "curr_to_cnst"
].values
histinfo.loc[pd.isnull(histinfo.curr_to_cnst), "curr_to_cnst"] = 1

## creating `rnna equivalents`
histinfo["rnna_equiv"] = histinfo["cn_extrap"] * histinfo["curr_to_cnst"]

In [None]:
## merging the actual rnna values from PWT10.0, and detecting which are missing
## rnna values completely
histinfo = histinfo.merge(
    pwt100[["rnna"]], left_index=True, right_index=True, how="left"
)

## detecting those that have some rnna information vs. don't
count_rnna = histinfo.reset_index().groupby("ccode").count()[["rnna"]]
no_rnna = count_rnna.loc[count_rnna.rnna == 0, :].index.values
some_rnna = count_rnna.loc[count_rnna.rnna > 0, :].index.values

## filling in the information for those that absolutely do not have rnna information
histinfo["rnna_extrap"] = np.nan
histinfo.loc[(no_rnna, slice(None)), "rnna_extrap"] = histinfo.loc[
    (no_rnna, slice(None)), "rnna_equiv"
].values

In [None]:
## for the partially-filled countries, fill in by using growth rates
for cc in tqdm(some_rnna):
    nona_yrs = histinfo.loc[
        (histinfo.index.get_level_values("ccode") == cc) & (~pd.isnull(histinfo.rnna)),
        :,
    ]
    nona_yrs = nona_yrs.index.get_level_values("year")
    nona_maxyr, nona_minyr = nona_yrs.max(), nona_yrs.min()

    ## copying information into the rnna_extrap column
    histinfo.loc[(cc, nona_yrs), "rnna_extrap"] = histinfo.loc[
        (cc, nona_yrs), "rnna"
    ].values

    ## using growth rates for extrapolation
    rnna_1950, rnna_2020 = histinfo.loc[(cc, [1950, 2020]), "rnna"].values
    if pd.isnull(rnna_1950):
        fill_yrs = list(range(1950, nona_minyr + 1))
        equiv = histinfo.loc[(cc, fill_yrs), "rnna_equiv"].values
        actual_extrap = (equiv / equiv[-1]) * histinfo.loc[(cc, nona_minyr), "rnna"]
        histinfo.loc[(cc, fill_yrs), "rnna_extrap"] = actual_extrap

    if pd.isnull(rnna_2020):
        fill_yrs = list(range(nona_maxyr, 2021))
        equiv = histinfo.loc[(cc, fill_yrs), "rnna_equiv"].values
        actual_extrap = (equiv / equiv[0]) * histinfo.loc[(cc, nona_maxyr), "rnna"]
        histinfo.loc[(cc, fill_yrs), "rnna_extrap"] = actual_extrap

### Creating `cn_19` and `rnna_19`

Again, for these, it must be that $cn\_19_{c, 2019} = rnna\_19_{c, 2019}$ for all countries.

In [None]:
## cn_19 is created simply by chaning from USD of 2017 to USD of 2019
usd_17_19 = pwt100.loc[("USA", 2019), "pl_n"] / pwt100.loc[("USA", 2017), "pl_n"]
histinfo["cn_19"] = histinfo["cn_extrap"] * usd_17_19

## creating rnna_19; first creating scale factors with 2019 values being 1
rnna_17_2019_vals = (
    histinfo.loc[(slice(None), 2019), ["rnna_extrap"]]
    .reset_index()
    .drop(["year"], axis=1)
    .set_index(["ccode"])
    .rename(columns={"rnna_extrap": "rnna_2019_vals"})
)
histinfo = histinfo.merge(
    rnna_17_2019_vals, left_index=True, right_index=True, how="left"
)
histinfo["rnna_2019_scale"] = histinfo["rnna_extrap"] / histinfo["rnna_2019_vals"]

## multiplying the cn_19 values of 2019
cn_19_2019_vals = (
    histinfo.loc[(slice(None), 2019), ["cn_19"]]
    .reset_index()
    .drop(["year"], axis=1)
    .set_index(["ccode"])
    .rename(columns={"cn_19": "cn_19_2019"})
)
histinfo = histinfo.merge(
    cn_19_2019_vals, left_index=True, right_index=True, how="left"
)
histinfo["rnna_19"] = histinfo["rnna_2019_scale"] * histinfo["cn_19_2019"]

## Creating capital and population scales, organizing the variable names, and exporting

### Creating capital scale (with respect to `cn_19` of 2019) and population scale (with respect to `pop` of 2019)

In [None]:
## pop scale
pop2019 = (
    histinfo.loc[(slice(None), 2019), ["pop"]]
    .reset_index()
    .drop(["year"], axis=1)
    .set_index(["ccode"])
    .rename(columns={"pop": "pop_2019"})
)
histinfo = histinfo.merge(pop2019, left_index=True, right_index=True, how="left")
histinfo["pop_scale"] = histinfo["pop"] / histinfo["pop_2019"]

In [None]:
## capital scale
cn2019 = (
    histinfo.loc[(slice(None), 2019), ["cn_19"]]
    .reset_index()
    .drop(["year"], axis=1)
    .set_index(["ccode"])
    .rename(columns={"cn_19": "cn_2019"})
)
histinfo = histinfo.merge(cn2014, left_index=True, right_index=True, how="left")
histinfo["rnna_19_scale"] = histinfo["rnna_19"] / histinfo["cn_2019"]
histinfo["cn_19_scale"] = histinfo["cn_19"] / histinfo["cn_2019"]

### Variable name cleanup

In [None]:
histinfo_columns = [
    "pop_unit",
    "gdppc_unit",
    "gdp_capital_unit",
    "pop_source",
    "gdp_source",
    "iy_ratio_source",
    "k_ratio_source",
    "delta_source",
    "capital_source",
    "pop",
    "pop_scale",
    "rgdpna_pc_17",
    "rgdpna_17",
    "rgdpna_pc_19",
    "rgdpna_19",
    "cgdpo_pc_17",
    "cgdpo_17",
    "cgdpo_pc_19",
    "cgdpo_19",
    "iy_ratio",
    "iy_ratio_fit",
    "k_movable_ratio",
    "k_struc_ratio",
    "k_mach_ratio",
    "k_traeq_ratio",
    "k_other_ratio",
    "delta",
    "rnna_17",
    "rnna_19",
    "rnna_19_scale",
    "cn_17",
    "cn_19",
    "cn_19_scale",
]
histinfo_final = histinfo.copy()
histinfo_final.rename(
    columns={
        "rnna_extrap": "rnna_17",
        "cn_extrap": "cn_17",
        "gdp_unit": "gdp_capital_unit",
    },
    inplace=True,
)

## filling in the nan's with 0s
fill0 = [
    "rnna_17",
    "rnna_19",
    "rnna_19_scale",
    "cn_17",
    "cn_19",
    "cn_19_scale",
    "pop_scale",
]
for i in fill0:
    histinfo_final.loc[pd.isnull(histinfo_final[i]), i] = 0

histinfo_final = histinfo_final[histinfo_columns].copy()

### Exporting the data

In [43]:
os.makedirs(sset.DIR_YPK_FINAL, exist_ok=True)
histinfo_final.to_parquet(
    sset.DIR_YPK_FINAL / "gdp_gdppc_pop_capital_1950_2020.parquet"
)