An R pipeline for cleaning and processing the FRED-MD macroeconomic database. Implements McCracken & Ng (2016, JBES) transformation codes and Bai & Ng (2002, ECTA) factor extraction via a compiled C++ (RcppArmadillo) backend.
FRED-MD is a large monthly macroeconomic database maintained by the Federal Reserve Bank of St. Louis. This pipeline reads the raw CSV, applies the 7-code transformation scheme to each series, removes outliers, and extracts common factors using an EM algorithm with PCA. The output is a balanced panel (fred_md_balanced) ready for downstream modeling.
- R ≥ 4.0
- A C++ compiler (Xcode Command Line Tools on macOS,
build-essentialon Linux)
Install R packages once:
install.packages(c("tidyverse", "lubridate", "janitor", "Rcpp", "RcppArmadillo", "testthat"))Place the raw FRED-MD CSV in Data/ (e.g. Data/FRED_MD_062024.csv), then run:
Rscript load.Rsrc/fred_factors.cpp is compiled automatically on the first run (~10–30 s). Subsequent runs reuse the cached shared library.
All tunable parameters are at the top of load.R:
| Variable | Default | Description |
|---|---|---|
demean |
2 |
Normalization: 1 = demean only, 2 = demean + standardize, 3 = recursive, 0 = none |
jj |
2 |
Information criterion: 1 = PC_p1, 2 = PC_p2, 3 = PC_p3 |
kmax |
8 |
Max factors to consider; set to 99 to force exactly 8 factors |
The pipeline produces fred_md_balanced: a tibble with a date column followed by all imputed series (snake_case names), with any remaining NA rows dropped.
Rscript -e "testthat::test_file('tests/test_fred_factors.R')"Expected: [ FAIL 0 | WARN 0 | SKIP 0 | PASS 23 ]
load.R # Orchestrator
R/prepare-missing.R # Applies 7-code transformations per series
R/remove-outliers.R # Flags |x − median| > 10×IQR as NA
src/fred_factors.cpp # Compiled C++ (RcppArmadillo): EM + PCA kernels
tests/test_fred_factors.R # C++ vs. R reference regression tests
Data/FRED_MD_062024.csv # Raw input (127 monthly series, 1959–2024)
prepare_missing()— reads transformation codes from row 1 of the CSV, appliestransxf()to each series (levels, differences, log-differences, etc.)remove_outliers()— replaces values more than 10×IQR from the median withNAfactors_em_cpp()— compiled EM algorithm: imputes missing values, demeans/standardizes, runs PCA, iterates until convergence (tolerance < 1e-6)
| Function | Description |
|---|---|
factors_em_cpp(x, kmax, jj, DEMEAN) |
Main entry point; hot T×N inner loop; kmax=99 forces 8 factors |
baing_cpp(X, kmax, jj) |
Bai & Ng (2002) information-criterion factor count selection |
pc2_cpp(X, nfac) |
PCA via economy SVD of the gram matrix X'X |
transform_data_cpp(x2, DEMEAN) |
Normalization; returns T0×N mut/std matrices |
minindc_cpp(X) |
Column-wise argmin (1-based row index) |
The factors_em_cpp function expects a pre-stripped numeric matrix (no date column). load.R strips the date before the call and re-attaches it to the output tibble.
- McCracken, M. W., & Ng, S. (2016). FRED-MD: A monthly database for macroeconomic research. Journal of Business & Economic Statistics, 34(4), 574–589.
- Bai, J., & Ng, S. (2002). Determining the number of factors in approximate factor models. Econometrica, 70(1), 191–221.