## 1. Purpose

This section addresses **RQ3: Is there a significant correlation between the level and growth of the Composite Leading Indicator (CLI)?**

Since GDP data are not included in this project scope, we approximate **economic growth** using the **percentage change in CLI values** for each country.
This approach allows us to assess whether short-term fluctuations in CLI (its “momentum”) align with the overall CLI level trends across countries.

We will:

1. Compute **monthly CLI growth rates** (as percentage change).
2. Merge CLI level and growth data per country.
3. Calculate **cross-country correlations** between CLI level and CLI growth.
4. Visualize the results using **heatmaps** and **scatterplots**.
5. Interpret which economies show strong or weak coherence between CLI level and movement.



## 2. Load CLI and GDP Data

This step loads the **<u>Composite Leading Indicator (CLI)</u>** dataset and the **<u>GDP growth</u>** dataset for correlation analysis.

- **CLI Data:** Cleaned monthly dataset (`df_cli_ready`) from the previous notebook.
- **GDP Growth Data:** Quarterly percentage change in real GDP from the OECD database.
- **Goal:** Merge both datasets by **<u>Country</u>** and **<u>Date</u>** to create a single table for correlation computation.

We will inspect both datasets to confirm consistent variable names, time coverage, and country codes before merging.


In [9]:
# Step 2 – Load CLI Data
import pandas as pd
import os

# If you ever hit FileNotFoundError, uncomment the next line to see where the notebook is running:
# print("CWD:", os.getcwd())

FILE_CLI = "../data/cleaned_oecd_cli.csv"   # path is from /notebooks/ to /data/

usecols = ['LOCATION', 'Country', 'SUBJECT', 'MEASURE', 'TIME', 'Value']

# Read as strings first to avoid mixed-type warnings, then coerce
cli = pd.read_csv(
    FILE_CLI,
    usecols=usecols,
    dtype={c: 'string' for c in usecols},
    low_memory=False
)

# --- Type fixes ---
# TIME -> datetime
cli['TIME_PARSED'] = pd.to_datetime(cli['TIME'], errors='coerce')

# Value -> numeric (remove NBSP and commas just in case)
cli['Value'] = (
    cli['Value']
      .astype(str)
      .str.replace('\u00a0', '', regex=False)
      .str.replace(',', '', regex=False)
)
cli['Value'] = pd.to_numeric(cli['Value'], errors='coerce')

# Drop rows that cannot be used
cli = cli.dropna(subset=['TIME_PARSED', 'Value'])

# Optional guard: keep only index/seasonally adjusted measures if present
keep_measures = {'STSA', 'ST', 'IXOB', 'IXOBSA', 'GYSA'}
if 'MEASURE' in cli.columns:
    before = len(cli)
    tmp = cli[cli['MEASURE'].isin(keep_measures)]
    # Only apply if it does not wipe the table (robustness)
    if len(tmp) > 0:
        cli = tmp
        print(f"Applied MEASURE filter {keep_measures}: {before:,} -> {len(cli):,} rows")
    else:
        print("Skipped MEASURE filter to avoid empty result.")

# Sort chronologically per country
cli = cli.sort_values(['Country', 'TIME_PARSED']).reset_index(drop=True)

# Quick checks
print(f"[OK] CLI loaded: {len(cli):,} rows | Countries: {cli['Country'].nunique()}")
print("Time range:", cli['TIME_PARSED'].min().date(), "→", cli['TIME_PARSED'].max().date())
display(cli.head(3))


Applied MEASURE filter {'IXOBSA', 'GYSA', 'ST', 'IXOB', 'STSA'}: 36,609 -> 36,609 rows
[OK] CLI loaded: 36,609 rows | Countries: 48
Time range: 2016-01-01 → 2020-02-01


Unnamed: 0,LOCATION,Country,SUBJECT,MEASURE,TIME,Value,TIME_PARSED
0,AUS,Australia,LOCOBDNO,STSA,2016-01,100.252409,2016-01-01
1,AUS,Australia,LOCOBENO,STSA,2016-01,100.357863,2016-01-01
2,AUS,Australia,LOCOBPNO,STSA,2016-01,99.914066,2016-01-01


In this step, we calculate <u>monthly percentage changes</u> in CLI for each country to approximate short-term growth momentum.
This transformation helps us analyze how the indicator’s <u>level</u> and <u>growth rate</u> interact over time

### Step 3 – Compute CLI Growth

We calculate **<u>monthly percentage changes</u>** in CLI for each country to estimate short-term growth momentum.
This shows how each country's economic activity speeds up or slows down over time.

**Steps:**
1. Group data by **Country**.
2. Use `.pct_change()` on **Value** to get month-over-month growth.
3. Multiply by 100 to convert to percentage.
4. Drop missing values from the first entry of each country.

**Result:**
A new column **CLI_Growth (%)** is created for later correlation analysis.


In [17]:
# 必要导入
import numpy as np
import pandas as pd

# 确保时间排序与数值类型正确
cli = cli.sort_values(['Country', 'TIME_PARSED']).reset_index(drop=True)
cli['Value'] = pd.to_numeric(cli['Value'], errors='coerce')

# 可选：避免用极小值做分母导致爆炸
EPS = 1e-9
cli['Value'] = cli['Value'].mask(cli['Value'].abs() < EPS, np.nan)

# 计算环比增长率（百分比），显式关闭填充以避免 FutureWarning
cli['CLI_Growth'] = (
    cli.groupby('Country', group_keys=False)['Value']
       .apply(lambda s: s.pct_change(fill_method=None) * 100)
)

# 清理无效结果
cli['CLI_Growth'] = cli['CLI_Growth'].replace([np.inf, -np.inf], np.nan)
cli = cli.dropna(subset=['CLI_Growth'])

# 快速检查
print(f"[OK] CLI growth computed: {len(cli)} rows | Countries: {cli['Country'].nunique()}")
display(cli.head(5))
display(cli['CLI_Growth'].describe().to_frame().T)


[OK] CLI growth computed: 36248 rows | Countries: 48


Unnamed: 0,LOCATION,Country,SUBJECT,MEASURE,TIME,Value,TIME_PARSED,CLI_Growth
1,AUS,Australia,LOCOLTOR,ST,2016-01,2.73,2016-01-01,-97.253374
2,AUS,Australia,LOCOSPNO,STSA,2016-01,99.187418,2016-01-01,3533.238756
3,AUS,Australia,LOCOSPOR,IXOB,2016-01,90.901836,2016-01-01,-8.353461
4,AUS,Australia,LOCOTTNO,STSA,2016-01,97.897749,2016-01-01,7.696119
5,AUS,Australia,LOLITOAA,STSA,2016-01,99.852341,2016-01-01,1.996565


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
CLI_Growth,36248.0,44179.326775,1945004.0,-244603000.0,-84.458021,-0.017058,8.159621,171085700.0


In [18]:
import numpy as np
import pandas as pd

# 保留核心指标（STSA）防止多指标混合
cli_filtered = cli[cli['MEASURE'] == 'STSA'].copy()

# 确保数值有效
cli_filtered['Value'] = pd.to_numeric(cli_filtered['Value'], errors='coerce')
cli_filtered = cli_filtered[(cli_filtered['Value'] > 10) & (cli_filtered['Value'] < 200)]

# 按国家排序
cli_filtered = cli_filtered.sort_values(['Country', 'TIME_PARSED']).reset_index(drop=True)

# 环比增长率计算（百分比）
cli_filtered['CLI_Growth'] = (
    cli_filtered.groupby('Country', group_keys=False)['Value']
        .apply(lambda s: s.pct_change(fill_method=None) * 100)
        .replace([np.inf, -np.inf], np.nan)
)

# 去除空值
cli_filtered = cli_filtered.dropna(subset=['CLI_Growth'])

# 检查结果
print(f"[OK] CLI growth computed: {len(cli_filtered)} rows | Countries: {cli_filtered['Country'].nunique()}")
display(cli_filtered.head(5))
display(cli_filtered['CLI_Growth'].describe().to_frame().T)


[OK] CLI growth computed: 25556 rows | Countries: 48


Unnamed: 0,LOCATION,Country,SUBJECT,MEASURE,TIME,Value,TIME_PARSED,CLI_Growth
1,AUS,Australia,LOCOTTNO,STSA,2016-01,97.897749,2016-01-01,-1.300235
2,AUS,Australia,LOLITOAA,STSA,2016-01,99.852341,2016-01-01,1.996565
3,AUS,Australia,LOLITONO,STSA,2016-01,99.802972,2016-01-01,-0.049442
4,AUS,Australia,LOLITOTR,STSA,2016-01,101.435976,2016-01-01,1.636228
5,AUS,Australia,LORSGPNO,STSA,2016-01,100.028818,2016-01-01,-1.387237


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
CLI_Growth,25556.0,16.658907,106.337679,-90.64162,-1.042312,0.004131,1.209905,908.391297
