# Feature Engineering

**Feature categories:**
- Productivity and structural indicators
- Time-based features (lags, growth rates, smoothing)
- Transformations and interactions
- Investment and capital dynamics
- Labor market dynamics
- Price level comparisons
- Convergence analysis
- TFP and efficiency measures
- Volatility and crisis indicators

In [32]:
import pandas as pd
import numpy as np

## Load Data

In [33]:
df = pd.read_csv("dataset/pwt110_cleaned.csv")
print(f"Loaded {df.shape[0]:,} rows, {df.shape[1]} columns")

Loaded 11,201 rows, 51 columns


## Productivity and Structural Indicators

**GDP per capita** (`gdp_pc`): Standard of living measure; enables cross-country comparisons and tracks economic development over time.

**GDP per worker** (`gdp_pw`): Labor productivity without adjusting for skill levels; useful for understanding raw output efficiency.

**GDP per hour** (`gdp_ph`): Most precise productivity measure; accounts for differences in work hours across countries; limited by data availability.

**Employment rate** (`emp_rate`): Labor market participation indicator; reflects demographic structure and labor market institutions.

**Capital per worker** (`k_per_worker`): Capital intensity; key input in growth accounting and convergence regressions.

**Trade openness** (`trade_open`): Sum of exports and imports as share of GDP; measures economic integration with global markets.

**Net exports** (`net_exports`): Current account proxy; indicates whether country is net lender or borrower internationally.

**Effective labor** (`eff_labor`): Human capital-adjusted employment; recognizes that skilled workers contribute more than raw headcount.

**GDP per effective worker** (`gdp_per_eff_worker`): Productivity adjusted for education/skills; isolates technology and capital effects from human capital.

In [34]:
df["gdp_pc"] = df["rgdpo"] / df["pop"]
df["gdp_pw"] = df["rgdpo"] / df["emp"]
df["gdp_ph"] = df["rgdpo"] / (df["emp"] * df["avh"])

df["emp_rate"] = df["emp"] / df["pop"]

if "rkna" in df.columns:
    df["k_per_worker"] = df["rkna"] / df["emp"]

df["trade_open"] = df["csh_x"] + df["csh_m"]
df["net_exports"] = df["csh_x"] - df["csh_m"]

df["eff_labor"] = df["hc"] * df["emp"]
df["gdp_per_eff_worker"] = df["rgdpo"] / df["eff_labor"]

## Time-Based Features

**Lagged GDP per capita** (`gdp_pc_lag1`): Previous year's income level; essential for beta convergence tests (do poor countries grow faster?).

**Growth rates** (`*_growth`): Year-over-year percentage changes; standard metric for economic dynamics -> enables comparison across countries of different sizes.

**5-year smoothed growth** (`gdp_pc_growth_5yr`): Reduces noise from business cycles and measurement error; reveals underlying trends more clearly than annual changes.

In [35]:
df = df.sort_values(["countrycode", "year"])

df["gdp_pc_lag1"] = df.groupby("countrycode")["gdp_pc"].shift(1)

for var in ["gdp_pc", "pop", "hc", "rtfpna", "csh_i"]:
    df[f"{var}_growth"] = df.groupby("countrycode")[var].pct_change() * 100

df["gdp_pc_growth_5yr"] = (
    df.groupby("countrycode")["gdp_pc_growth"]
    .rolling(window=5, min_periods=1)
    .mean()
    .reset_index(level=0, drop=True)
)

  df[f"{var}_growth"] = df.groupby("countrycode")[var].pct_change() * 100
  df[f"{var}_growth"] = df.groupby("countrycode")[var].pct_change() * 100


## Transformations and Interactions


**Log GDP per capita** (`log_gdp_pc`): Makes exponential growth easier to analyze; turns coefficients into percentage changes.

**Log lagged GDP** (`log_gdp_pc_lag1`): Used in Solow growth model regressions; coefficient tests conditional convergence hypothesis.

**GDP relative to world mean** (`gdp_pc_rel_world`): Normalizes for global trends; values >1 indicate above-average income; useful for sigma convergence analysis.

**Globalization era** (`era_globalization`): Post-1990 flag; Marks the period after 1990 to account for major changes from trade opening, new technologies, and policy shifts.

**Human capital × TFP** (`hc_x_tfp`): Interaction term; tests whether education amplifies technology adoption.

**Human capital × Investment** (`hc_x_investment`): Tests whether skilled workers make capital more productive; relevant for endogenous growth models.

**Trade × Human capital** (`trade_x_hc`): Captures that open economies with skilled workers may benefit more from technology transfer and learning-by-exporting.

In [36]:
df["log_gdp_pc"] = np.log(df["gdp_pc"])
df["log_gdp_pc_lag1"] = np.log(df["gdp_pc_lag1"])

df["gdp_pc_rel_world"] = df["gdp_pc"] / df.groupby("year")["gdp_pc"].transform("mean")

df["era_globalization"] = (df["year"] >= 1990).astype(int)

df["hc_x_tfp"] = df["hc"] * df["rtfpna"]
df["hc_x_investment"] = df["hc"] * df["csh_i"]
df["trade_x_hc"] = df["trade_open"] * df["hc"]

## Investment and Capital Dynamics

**Net investment** (`net_investment`): Investment minus depreciation; actual net addition to capital stock; can be negative if depreciation exceeds gross investment.

**Capital intensity** (`capital_intensity`): Capital-output ratio; high values suggest capital-abundant economies; key parameter in growth accounting.

**Capital deepening growth** (`k_per_worker_growth`): Rate of capital accumulation per worker; distinguishes capital deepening from capital widening.

**Investment efficiency** (`investment_efficiency`): TFP per unit of investment; measures how effectively investment translates into productivity gains; higher values indicate better allocation or technology adoption.

In [37]:
df["net_investment"] = df["csh_i"] - df["delta"]
df["capital_intensity"] = df["rkna"] / df["rgdpo"]

df["k_per_worker_growth"] = df.groupby("countrycode")["k_per_worker"].pct_change() * 100

df["investment_efficiency"] = df["rtfpna"] / df["csh_i"]

  df["k_per_worker_growth"] = df.groupby("countrycode")["k_per_worker"].pct_change() * 100


## Labor Market Dynamics

**Labor share growth** (`labsh_growth`): Changes in income distribution between capital and labor; declining labor shares may indicate skill-biased technical change or bargaining power shifts.

**Human capital returns** (`hc_returns`): Marginal productivity gain from education/skills; positive values show that skilled workers produce proportionally more; useful for education policy evaluation.

**Dependency ratio** (`dependency_ratio`): Population per worker; higher values indicate more dependents (children, elderly, unemployed) per employed person; affects savings rates and fiscal sustainability.

In [38]:
df["labsh_growth"] = df.groupby("countrycode")["labsh"].pct_change() * 100

df["hc_returns"] = (df["gdp_per_eff_worker"] - df["gdp_pw"]) / df["gdp_pw"]

df["dependency_ratio"] = df["pop"] / df["emp"]

  df["labsh_growth"] = df.groupby("countrycode")["labsh"].pct_change() * 100


## Price Level Comparisons

**Relative price level** (`rel_price_level`): GDP price level relative to US; values >1 mean country is more expensive; tests Balassa-Samuelson effect (rich countries have higher prices).

**Terms of trade** (`terms_of_trade`): Ratio of export to import prices; improvement means country gets more imports per unit of exports; affects real income and welfare.

**Investment price** (`investment_price`): Relative price of capital goods; high values discourage investment; explains differences in capital accumulation across countries.

**Penn effect** (`penn_effect`): Interaction between income and price levels; tests whether richer countries systematically have higher price levels due to non-tradable sector productivity.

In [39]:
df["rel_price_level"] = df["pl_gdpo"]

df["terms_of_trade"] = df["pl_x"] / df["pl_m"]

df["investment_price"] = df["pl_i"]

df["penn_effect"] = df["log_gdp_pc"] * df["pl_gdpo"]

## Convergence Analysis

**Distance from frontier** (`dist_from_frontier`): Absolute income gap to richest country; measures how far behind a country is;.

**Catch-up speed** (`catchup_speed`): Growth rate relative to potential (gap to frontier); values >0 indicate convergence; tests whether distance affects growth speed.

**Relative to frontier** (`rel_to_frontier`): Income as fraction of frontier; values near 1 mean country is at technological frontier.

In [40]:
frontier_gdp = df.groupby("year")["gdp_pc"].transform("max")
df["dist_from_frontier"] = frontier_gdp - df["gdp_pc"]

df["catchup_speed"] = (df["gdp_pc"] - df["gdp_pc_lag1"]) / (frontier_gdp - df["gdp_pc_lag1"])

df["rel_to_frontier"] = df["gdp_pc"] / frontier_gdp

## TFP and Efficiency Measures

**TFP relative to US** (`tfp_rel_us`): Technology gap measure; US typically serves as frontier; values <1 indicate technology lag; used in development accounting.

**Welfare-relevant TFP** (`welfare_tfp`): Adjusts for terms of trade changes; countries benefiting from favorable export prices show higher welfare TFP than standard TFP.

**TFP growth acceleration** (`tfp_growth_accel`): Change in TFP growth rate; positive values indicate accelerating innovation.

In [41]:
us_tfp = df[df["countrycode"] == "USA"].set_index("year")["rtfpna"]
df["tfp_rel_us"] = df.apply(lambda row: row["rtfpna"] / us_tfp.get(row["year"], np.nan) if pd.notna(row["rtfpna"]) else np.nan, axis=1)

df["welfare_tfp"] = df["rwtfpna"]

df["tfp_growth_accel"] = df.groupby("countrycode")["rtfpna_growth"].diff()

## Volatility and Crisis Indicators

Macroeconomic stability and shock measures:

**Growth volatility** (`growth_volatility`): 5-year rolling standard deviation of GDP growth; high values indicate instability; associated with lower investment and welfare.

**Recession dummy** (`recession_dummy`): Binary indicator for negative growth years; identifies economic contractions for event studies.

**Recession count** (`recession_count`): Cumulative number of recessions; measures chronic instability; countries with frequent recessions may have structural vulnerabilities.

**Crisis dummy** (`crisis_dummy`): Severe contraction indicator (growth < -3%); more restrictive than recession; captures major economic disruptions like financial crises or political shocks.

In [42]:
df["growth_volatility"] = (
    df.groupby("countrycode")["gdp_pc_growth"]
    .rolling(window=5, min_periods=1)
    .std()
    .reset_index(level=0, drop=True)
)

df["recession_dummy"] = (df["gdp_pc_growth"] < 0).astype(int)
df["recession_count"] = df.groupby("countrycode")["recession_dummy"].cumsum()

df["crisis_dummy"] = (df["gdp_pc_growth"] < -3).astype(int)

## Export

In [43]:
df.to_csv("dataset/pwt110_features.csv", index=False)
df.to_excel("dataset/pwt110_features.xlsx", index=False)

print("Saved enriched dataset with engineered features.")

Saved enriched dataset with engineered features.


## Create Legend Sheet

Add a Legend sheet to the Excel output combining original PWT variable definitions with all engineered features.

In [44]:
original_legend = pd.read_excel('dataset/pwt110.xlsx', sheet_name='Legend')

new_features = [
    ['gdp_pc', 'GDP per capita', '2017 US$ (thousands)'],
    ['gdp_pw', 'GDP per worker', '2017 US$ (thousands)'],
    ['gdp_ph', 'GDP per hour worked', '2017 US$'],
    ['emp_rate', 'Employment rate (employment/population)', 'ratio'],
    ['k_per_worker', 'Capital per worker', '2017 US$ (thousands)'],
    ['trade_open', 'Trade openness (sum of export and import shares)', 'share of GDP'],
    ['net_exports', 'Net exports (export share minus import share)', 'share of GDP'],
    ['eff_labor', 'Effective labor (human capital adjusted employment)', 'millions'],
    ['gdp_per_eff_worker', 'GDP per effective worker', '2017 US$ (thousands)'],
    ['gdp_pc_lag1', 'GDP per capita, lagged one year', '2017 US$ (thousands)'],
    ['gdp_pc_growth', 'GDP per capita growth rate', 'percent'],
    ['pop_growth', 'Population growth rate', 'percent'],
    ['hc_growth', 'Human capital growth rate', 'percent'],
    ['rtfpna_growth', 'TFP growth rate', 'percent'],
    ['csh_i_growth', 'Investment share growth rate', 'percent'],
    ['gdp_pc_growth_5yr', 'GDP per capita growth rate, 5-year moving average', 'percent'],
    ['log_gdp_pc', 'Natural log of GDP per capita', 'log'],
    ['log_gdp_pc_lag1', 'Natural log of GDP per capita, lagged one year', 'log'],
    ['gdp_pc_rel_world', 'GDP per capita relative to world mean', 'ratio'],
    ['era_globalization', 'Globalization era indicator (1 if year >= 1990)', 'binary'],
    ['hc_x_tfp', 'Human capital × TFP interaction', 'index'],
    ['hc_x_investment', 'Human capital × Investment share interaction', 'index'],
    ['trade_x_hc', 'Trade openness × Human capital interaction', 'index'],
    ['net_investment', 'Net investment (investment share minus depreciation)', 'share of GDP'],
    ['capital_intensity', 'Capital intensity (capital stock/GDP)', 'ratio'],
    ['k_per_worker_growth', 'Capital per worker growth rate', 'percent'],
    ['investment_efficiency', 'Investment efficiency (TFP/investment share)', 'index'],
    ['labsh_growth', 'Labor share growth rate', 'percent'],
    ['hc_returns', 'Human capital returns (productivity gain from education)', 'ratio'],
    ['dependency_ratio', 'Dependency ratio (population/employment)', 'ratio'],
    ['rel_price_level', 'Relative GDP price level (USA=1)', 'index'],
    ['terms_of_trade', 'Terms of trade (export/import price ratio)', 'index'],
    ['investment_price', 'Relative price of investment goods', 'index'],
    ['penn_effect', 'Penn effect (log GDP per capita × price level)', 'index'],
    ['dist_from_frontier', 'Distance from frontier (max GDP per capita - GDP per capita)', '2017 US$ (thousands)'],
    ['catchup_speed', 'Catch-up speed (growth relative to frontier gap)', 'ratio'],
    ['rel_to_frontier', 'GDP per capita relative to frontier', 'ratio'],
    ['tfp_rel_us', 'TFP relative to USA', 'ratio'],
    ['welfare_tfp', 'Welfare-relevant TFP (terms of trade adjusted)', 'index'],
    ['tfp_growth_accel', 'TFP growth acceleration (change in growth rate)', 'percentage points'],
    ['growth_volatility', 'Growth volatility (5-year rolling standard deviation)', 'percent'],
    ['recession_dummy', 'Recession indicator (1 if GDP growth < 0)', 'binary'],
    ['recession_count', 'Cumulative recession count', 'count'],
    ['crisis_dummy', 'Crisis indicator (1 if GDP growth < -3%)', 'binary']
]

new_legend_df = pd.DataFrame(new_features, columns=original_legend.columns)
combined_legend = pd.concat([original_legend, new_legend_df], ignore_index=True)

with pd.ExcelWriter('dataset/pwt110_features.xlsx', engine='openpyxl', mode='a', if_sheet_exists='replace') as writer:
    combined_legend.to_excel(writer, sheet_name='Legend', index=False)

print(f"Legend sheet created with {len(combined_legend)} variables ({len(original_legend)} original + {len(new_features)} engineered)")

ValueError: 2 columns passed, passed data had 3 columns