# PIB Filtering and Trend-Cycle Decomposition
---

This notebook scans the ../data/raw/ directory for the **latest available GDP series** from IBGE (series can be downloaded using IBGE.ipynb notebook).

It automatically detects CSV files with filenames following the standard naming pattern, and for each unique series (identified by table and variable), it keeps only the most recent file.

These files are matched with IBGE metadata to present the user with an intuitive interface to select:

- Real or Nominal GDP  
- Seasonally Adjusted or Non-Adjusted  
- Quarterly or Annual Frequency  

Once a series is selected, the notebook loads and processes the data. The following transformations and filters are applied:

- **Natural Log** — computed using **NumPy**
- **First Difference** — computed using **pandas**
- **Percentage Change** — computed using **pandas**
- **Hodrick-Prescott Filter** — using **statsmodels**
- **Baxter-King Filter** — using **statsmodels**
- **Christiano-Fitzgerald Filter** — using **statsmodels**

All of these operations are implemented using well-established, trusted Python libraries for time series and econometric analysis:

Finally, the notebook provides an interactive plotting interface so you can visually explore trends, cycles, and transformations of the GDP series with ease.

This environment is ideal for filtering, comparing smoothing methods, and preparing data for macroeconomic analysis and visualization.


## Notebook Setup and Dependencies Loading
---

Run the cell below in order to load dependencies, metadata, and start the logging session.

In [1]:
# Importing external libraries and functions
import os
import sys
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import ipywidgets as widgets
import matplotlib.pyplot as plt

from datetime import datetime
from IPython.display import display, clear_output

# Add the 'src' folder to the Python path so project-specific modules can be imported
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..", "src")))

# Import project-specific functions
from logger import start_logger
from ibge import load_ibge_series_metadata
from utils import compute_file_hash
from ui import file_explorer, raw_cleanup_widget, plot_columns_selector

# Enable automatic reloading of modules when their source code changes
%reload_ext autoreload
%autoreload 2

# Define Session ID
session_type = "Filtering"
session_ID = datetime.now().strftime("%Y%m%d_%H%M%S")

# Setup Logging
log_file_name = f"../logs/{session_type}_{session_ID}.log"
logger_name = "root"
logger = start_logger(logger_name, log_file_name)

raw_cleanup_widget()

2025-04-09 17:54:46,184 - INFO - Logger started. File path: ../logs/Filtering_20250409_175446.log


VBox(children=(Button(button_style='danger', description='🧹 Delete old raw CSV and JSON files', style=ButtonSt…

## Select GDP data from available series
---

In [2]:
# Load metadata
df_ibge_series_metadata = load_ibge_series_metadata()
GDP_file_explorer_refs = file_explorer(df_ibge_series_metadata )

2025-04-09 17:54:49,225 - INFO - Loaded IBGE Metadata from file: ../data/metadata/ibge_series.json


VBox(children=(HTML(value='<h3>🔎 Analyze All Files by Source and Series</h3>'), Dropdown(description='Source:'…

## Filter data
---

In [46]:
# Function to get data from Widget selection
def get_data(selected_filename): 
    # Load the DataFrame
    df = pd.read_csv(selected_filename)

    # Convert columns if present
    if "data" in df.columns:
        df["data"] = pd.to_datetime(df["data"], errors="coerce")

    if "valor" in df.columns:
        df["valor"] = pd.to_numeric(df["valor"], errors="coerce")
    return df

# Get data acoridng to Widget Selection
df = get_data(GDP_file_explorer_refs["get_selected_file"]())
df.rename(columns={"valor": "gdp"}, inplace=True)

# Create log of gdp
df["gdp"] = df["gdp"]
df["log_gdp"] = df["gdp"].apply(lambda x: np.log(x) if x > 0 else np.nan)

#--------------------------
# Detrending

# Independent variable (x): time
x = df['data'].apply(lambda d: d.toordinal())
x = sm.add_constant(x)  # Adds intercept term

# Dependent variable (y): value
y = df['log_gdp']

# Fit model
model = sm.OLS(y, x).fit()

# Get all coefficients
coefficients = model.params

# Add predicted values (trend) to the DataFrame
df['detrending_trend'] = model.predict(x)

# Calculate the cycle (residual)
df['detrending_cycle'] = df['log_gdp'] - df['detrending_trend']

# Print summary
#print(model.summary())
#Trend = (coefficients['const'] + coefficients['data']*x['data'])
#Cycle = df['log_gdp'] - (coefficients['const'] + coefficients['data']*x['data'])

#--------------------------
# Create first difference of log_gdp
df["fdiff_cycle"] = df["log_gdp"].diff() - df["log_gdp"].diff().mean()
df["fdiff_trend"] = df["log_gdp"] - df["fdiff_cycle"]

#--------------------------
# HP Filter
df["hp_cycle"], df["hp_trend"]  = sm.tsa.filters.hpfilter(df["log_gdp"], 1600)

#--------------------------
# BK Filter
df["bk_cycle"] = sm.tsa.filters.bkfilter(df["log_gdp"], 6, 32, 12)
df["bk_trend"] = df["log_gdp"] - df["bk_cycle"] 

#--------------------------
# CF Filter
df["ck_cycle"], df["ck_trend"]  = sm.tsa.filters.cffilter(df["log_gdp"], 6,32,False)



### Plot filtered GDP data
---

In [47]:
# Create the widget to select columns (except 'data')
column_selector = widgets.SelectMultiple(
    options=[col for col in df.columns if col != "data"],
    description="Y Columns:",
    layout=widgets.Layout(width="400px", height="200px")
)

# Output area for the plot
plot_output = widgets.Output()

# Function to update the plot
def update_plot(change):
    with plot_output:
        clear_output()
        selected = list(column_selector.value)

        if not selected:
            print("Select at least one column to plot.")
            return

        # Plot
        sns.set_theme()
        sns.set_context("notebook")
        plt.figure(figsize=(12, 6))

        for col in selected:
            plt.plot(df["data"], df[col], label=col)

        plt.xlabel("Date")
        plt.ylabel("Value")
        plt.title("Selected Columns Over Time")
        plt.legend()
        sns.despine()
        plt.tight_layout()
        plt.show()

# Connect widget to function
column_selector.observe(update_plot, names="value")

# Display UI
display(widgets.HTML("<b>Select columns to plot (X axis is always 'data'):</b> Use CTRL or CMD to select multiple rows"))
display(
    widgets.HBox([
        column_selector,
        plot_output
    ])
)

# Initial plot
update_plot({"new": column_selector.value})


HTML(value="<b>Select columns to plot (X axis is always 'data'):</b> Use CTRL or CMD to select multiple rows")

HBox(children=(SelectMultiple(description='Y Columns:', layout=Layout(height='200px', width='400px'), options=…

## Select Inflation Data
---

In [48]:
# Load metadata
IPCA_file_explorer_refs = file_explorer(df_ibge_series_metadata )

VBox(children=(HTML(value='<h3>🔎 Analyze All Files by Source and Series</h3>'), Dropdown(description='Source:'…

In [49]:
# Load selected IPCA file from the file explorer widget
dfa = get_data(IPCA_file_explorer_refs["get_selected_file"]())

# Convert monthly percent change to decimal (for compounding)
dfa["decimal"] = 1 + dfa["valor"] / 100

# Set date as index and resample to quarterly using compounded product
dfa.set_index("data", inplace=True)
dfa = dfa.resample("QE").prod()  # 'QE' = quarter end

# Shift quarterly dates from end-of-quarter to start-of-quarter
dfa = dfa.reset_index()[["data", "decimal"]]
dfa["data"] = dfa["data"] + pd.Timedelta(days=1)

# Convert decimal back to percent change and drop intermediate column
dfa["pi"] = (dfa["decimal"] - 1)
dfa = dfa[["data", "pi"]]

# Compute required lags
dfa["pi_lead1"] = dfa["pi"].shift(-1)
dfa["pi_lead2"] = dfa["pi"].shift(-2)
dfa["pi_lead3"] = dfa["pi"].shift(-3)
dfa["pi_lead4"] = dfa["pi"].shift(-4)
dfa["pi_t"]   = dfa["pi"]
dfa["pi_lag1"] = dfa["pi"].shift(1)
dfa["pi_lag2"] = dfa["pi"].shift(2)
dfa["pi_lag3"] = dfa["pi"].shift(3)
dfa["pi_lag4"] = dfa["pi"].shift(4)

# Parameters
a1l = 0.24
a1i = 0.38
a4 = 0.12

# Apply formula
dfa["GDP_gap_calc"] = (1/a4)*(
    dfa["pi_t"] 
    - a1l * dfa["pi_lag1"] 
    - (a1i / 4) * (dfa["pi_lag1"] + dfa["pi_lag2"] + dfa["pi_lag3"] + dfa["pi_lag4"])
    - ((1-a1l-a1i)/4)*(dfa["pi_lead1"]+dfa["pi_lead2"]+ dfa["pi_lead3"]+ dfa["pi_lead4"])
    )

dfb = dfa.merge(df)

# Select columns ending with "_cycle"
cycle_columns = [col for col in dfb.columns if col.endswith("_cycle")]

# Drop all rows with any NaNs
dfc = dfb.dropna()

# Compute MSE for each cycle column
mse_results = {
    col: np.mean((dfc["GDP_gap_calc"] - dfc[col]) ** 2)
    for col in cycle_columns
}

# Convert to DataFrame and display
df_mse = pd.DataFrame.from_dict(mse_results, orient="index", columns=["MSE"])
df_mse = df_mse.sort_values("MSE")

display(df_mse)

Unnamed: 0,MSE
hp_cycle,0.005388
ck_cycle,0.005564
bk_cycle,0.005585
fdiff_cycle,0.005879
detrending_cycle,0.009767


In [None]:
curve_selection_refs = plot_columns_selector(dfb)

HBox(children=(SelectMultiple(description='Y columns:', layout=Layout(height='300px', width='250px'), options=…

### Optimize Parameters for one reference column
---

In [8]:
from scipy.optimize import minimize
import numpy as np

# Reference column name
reference = "bk_cycle"
print(f"Reference Column:{reference}")
def compute_gdp_gap(dfa, a1l, a1i, a4):
    return (1 / a4) * (
        dfa["pi"] 
        - a1l * dfa["pi"].shift(1)
        - (a1i / 1) * (dfa["pi"].shift(1) )
        - ((1 - a1l - a1i) / 1) * (dfa["pi"].shift(-1) )
    )

def objective(params, dfa, reference_col):
    a1l, a1i, a4 = params
    gdp_calc = compute_gdp_gap(dfa, a1l, a1i, a4)
    diff = gdp_calc - dfa[reference_col]
    return np.nanmean(diff ** 2)

# Drop NA rows caused by shifting
dfa_fit = dfb.copy()
dfa_fit = dfa_fit.dropna(subset=["pi"])

# Initial guess
initial_params = [0.24, 0.38, 0.12]

# Optional: bounds to keep coefficients reasonable
bounds = [(0, 1), (0, 1), (1e-3, 1)]

# Minimize
result = minimize(objective, initial_params, args=(dfa_fit, reference), bounds=bounds)

# Print best-fit parameters
a1l_opt, a1i_opt, a4_opt = result.x
print("Optimized parameters:")
print(f"a1l = {a1l_opt:.4f}")
print(f"a1i = {a1i_opt:.4f}")
print(f"a4  = {a4_opt:.4f}")


Reference Column:bk_cycle
Optimized parameters:
a1l = 0.2501
a1i = 0.3901
a4  = 1.0000


### Optimize Parameters for all cycles, without upper boundary, for a choosen interval
---

In [9]:
from scipy.optimize import minimize
import numpy as np
import pandas as pd

def compute_gdp_gap(dfa, a1l, a1i, a4):
    return (1 / a4) * (
        dfa["pi"] 
        - a1l * dfa["pi"].shift(1)
        - (a1i / 4) * (dfa["pi"].shift(1) + dfa["pi"].shift(2) + dfa["pi"].shift(3) + dfa["pi"].shift(4))
        - ((1 - a1l - a1i) / 4) * (dfa["pi"].shift(-1) + dfa["pi"].shift(-2) + dfa["pi"].shift(-3) + dfa["pi"].shift(-4))
    )

def objective(params, dfa, reference_col):
    a1l, a1i, a4 = params
    gdp_calc = compute_gdp_gap(dfa, a1l, a1i, a4)
    diff = gdp_calc - dfa[reference_col]
    return np.nanmean(diff ** 2)

dfc = dfb[10:34]

# Make sure all _cycle columns are present in dfa
cycle_cols = [col for col in dfc.columns if col.endswith("_cycle")]
results = []

# Drop rows with missing inflation data
dfa_base = dfc.dropna().copy()

for ref_col in cycle_cols:
    # Drop rows where the current reference is missing
    dfa_fit = dfa_base.dropna(subset=[ref_col])
    
    # Initial guess and bounds
    initial_params = [0.24, 0.38, 0.12]
    bounds = [(0, 10), (0, 10), (1e-3, 10)]
    
    # Optimize
    result = minimize(objective, initial_params, args=(dfa_fit, ref_col), bounds=bounds)
    
    # Collect results
    a1l_opt, a1i_opt, a4_opt = result.x
    mse = result.fun
    results.append({
        "reference": ref_col,
        "a1l": a1l_opt,
        "a1i": a1i_opt,
        "a4": a4_opt,
        "mse": mse
    })
    
df_results = pd.DataFrame(results)
df_results = df_results.sort_values("mse")
display(df_results)



Unnamed: 0,reference,a1l,a1i,a4,mse
4,ck_cycle,0.998076,1.221425,4.007203,6.3e-05
2,hp_cycle,1.076586,1.211615,3.545839,7.5e-05
3,bk_cycle,1.033785,1.16585,3.550858,8.6e-05
0,fdiff_cycle,1.300999,0.0,4.123444,0.000174
1,pct_change_cycle,1.304248,0.0,4.125177,0.000177
5,OLS_cycle,1.606323,0.098148,3.749304,0.001346


In [None]:
# Load selected IPCA file from the file explorer widget
dfa = get_data(IPCA_file_explorer_refs["get_selected_file"]())

# Convert monthly percent change to decimal (for compounding)
dfa["decimal"] = 1 + dfa["valor"] / 100

# Set date as index and resample to quarterly using compounded product
dfa.set_index("data", inplace=True)
dfa = dfa.resample("QE").prod()  # 'QE' = quarter end

# Shift quarterly dates from end-of-quarter to start-of-quarter
dfa = dfa.reset_index()[["data", "decimal"]]
dfa["data"] = dfa["data"] + pd.Timedelta(days=1)

# Convert decimal back to percent change and drop intermediate column
dfa["pi"] = (dfa["decimal"] - 1)
dfa = dfa[["data", "pi"]]

# Compute required lags
dfa["pi_lead1"] = dfa["pi"].shift(-1)
dfa["pi_lead2"] = dfa["pi"].shift(-2)
dfa["pi_lead3"] = dfa["pi"].shift(-3)
dfa["pi_lead4"] = dfa["pi"].shift(-4)
dfa["pi_t"]   = dfa["pi"]
dfa["pi_lag1"] = dfa["pi"].shift(1)
dfa["pi_lag2"] = dfa["pi"].shift(2)
dfa["pi_lag3"] = dfa["pi"].shift(3)
dfa["pi_lag4"] = dfa["pi"].shift(4)

# Parameters
a1l = 0.998076
a1i = 1.221425
a4 = 4.007203
		
# Apply formula
dfa["GDP_gap_calc"] = (1/a4)*(
    dfa["pi_t"] 
    - a1l * dfa["pi_lag1"] 
    - (a1i / 4) * (dfa["pi_lag1"] + dfa["pi_lag2"] + dfa["pi_lag3"] + dfa["pi_lag4"])
    - ((1-a1l-a1i)/4)*(dfa["pi_lead1"]+dfa["pi_lead2"]+ dfa["pi_lead3"]+ dfa["pi_lead4"])
    )

dfb = dfa.merge(df)

curve_selection_refs = plot_columns_selector(dfb)

HBox(children=(SelectMultiple(description='Y columns:', layout=Layout(height='300px', width='250px'), options=…

## Compare Forecast Errors
---

In [None]:
# Get data acoridng to Widget Selection
df = get_data()

# Set Window Size (i.e. 4*10 = 40 quarters = 10 years of quarterly data)
ws = 4*10

# Set Forecast Size (i.e. 4 = 4 quarters of forecast)
fs = 4

# Calculate number of windows in set
nw = len(df)-ws-fs

In [71]:
# Window Counter, from 0 to nw
i = 0

# Get Window Data
dfa = df[i:i+ws]

# Get Data to be forecasted
dfx = df[i+ws:i+ws+fs]

In [None]:
# Independent variable (x): time
x = dfa['data'].apply(lambda d: d.toordinal())
x = sm.add_constant(x)  # Adds intercept term

# Dependent variable (y): value
y = dfa['log_gdp']

# Fit model
model = sm.OLS(y, x).fit()

# Get all coefficients
coefficients = model.params

# Add predicted values (trend) to the DataFrame
df['OLS_trend'] = model.predict(x)

# Calculate the cycle (residual)
df['OLS_cycle'] = df['log_gdp'] - df['OLS_trend']