## 1. Purpose

This section addresses **RQ3: Is there a significant correlation between the level and growth of the Composite Leading Indicator (CLI)?**

Since GDP data are not included in this project scope, we approximate **economic growth** using the **percentage change in CLI values** for each country.
This approach allows us to assess whether short-term fluctuations in CLI (its “momentum”) align with the overall CLI level trends across countries.

We will:

1. Compute **monthly CLI growth rates** (as percentage change).
2. Merge CLI level and growth data per country.
3. Calculate **cross-country correlations** between CLI level and CLI growth.
4. Visualize the results using **heatmaps** and **scatterplots**.
5. Interpret which economies show strong or weak coherence between CLI level and movement.



## 2. Load CLI and GDP Data

This step loads the **<u>Composite Leading Indicator (CLI)</u>** dataset and the **<u>GDP growth</u>** dataset for correlation analysis.

- **CLI Data:** Cleaned monthly dataset (`df_cli_ready`) from the previous notebook.
- **GDP Growth Data:** Quarterly percentage change in real GDP from the OECD database.
- **Goal:** Merge both datasets by **<u>Country</u>** and **<u>Date</u>** to create a single table for correlation computation.

We will inspect both datasets to confirm consistent variable names, time coverage, and country codes before merging.


In [9]:
# Step 2 – Load CLI Data
import pandas as pd
import os

# If you ever hit FileNotFoundError, uncomment the next line to see where the notebook is running:
# print("CWD:", os.getcwd())

FILE_CLI = "../data/cleaned_oecd_cli.csv"   # path is from /notebooks/ to /data/

usecols = ['LOCATION', 'Country', 'SUBJECT', 'MEASURE', 'TIME', 'Value']

# Read as strings first to avoid mixed-type warnings, then coerce
cli = pd.read_csv(
    FILE_CLI,
    usecols=usecols,
    dtype={c: 'string' for c in usecols},
    low_memory=False
)

# --- Type fixes ---
# TIME -> datetime
cli['TIME_PARSED'] = pd.to_datetime(cli['TIME'], errors='coerce')

# Value -> numeric (remove NBSP and commas just in case)
cli['Value'] = (
    cli['Value']
      .astype(str)
      .str.replace('\u00a0', '', regex=False)
      .str.replace(',', '', regex=False)
)
cli['Value'] = pd.to_numeric(cli['Value'], errors='coerce')

# Drop rows that cannot be used
cli = cli.dropna(subset=['TIME_PARSED', 'Value'])

# Optional guard: keep only index/seasonally adjusted measures if present
keep_measures = {'STSA', 'ST', 'IXOB', 'IXOBSA', 'GYSA'}
if 'MEASURE' in cli.columns:
    before = len(cli)
    tmp = cli[cli['MEASURE'].isin(keep_measures)]
    # Only apply if it does not wipe the table (robustness)
    if len(tmp) > 0:
        cli = tmp
        print(f"Applied MEASURE filter {keep_measures}: {before:,} -> {len(cli):,} rows")
    else:
        print("Skipped MEASURE filter to avoid empty result.")

# Sort chronologically per country
cli = cli.sort_values(['Country', 'TIME_PARSED']).reset_index(drop=True)

# Quick checks
print(f"[OK] CLI loaded: {len(cli):,} rows | Countries: {cli['Country'].nunique()}")
print("Time range:", cli['TIME_PARSED'].min().date(), "→", cli['TIME_PARSED'].max().date())
display(cli.head(3))


Applied MEASURE filter {'IXOBSA', 'GYSA', 'ST', 'IXOB', 'STSA'}: 36,609 -> 36,609 rows
[OK] CLI loaded: 36,609 rows | Countries: 48
Time range: 2016-01-01 → 2020-02-01


Unnamed: 0,LOCATION,Country,SUBJECT,MEASURE,TIME,Value,TIME_PARSED
0,AUS,Australia,LOCOBDNO,STSA,2016-01,100.252409,2016-01-01
1,AUS,Australia,LOCOBENO,STSA,2016-01,100.357863,2016-01-01
2,AUS,Australia,LOCOBPNO,STSA,2016-01,99.914066,2016-01-01
