# Quantlab Data Pipeline Demo

Run the pipeline end-to-end: ingest raw data, export failures, refetch, transform, quality-check, clean invalid rows, and sample final data. Ensure `credentials.yml` is present at the repo root with Alpha Vantage and WRDS creds.


In [1]:
from datetime import date
import logging
import sys
from pathlib import Path

# Ensure repo root is on sys.path for editable/develop installs or direct use
try:
    root = Path(__file__).resolve().parents[1]
except NameError:
    root = Path.cwd().resolve().parents[0]
if str(root) not in sys.path:
    sys.path.insert(0, str(root))

from quantlab_data_pipeline import run_ingestion, transform_raw_to_final, run_quality_checks, get_final_data
from quantlab_data_pipeline.failure_utils import export_all_failures, refetch_failures, clean_final_invalid_calls
from quantlab_data_pipeline.logging_utils import configure_logging

configure_logging(level=logging.INFO)


## 1) Ingest raw data

Fetch Alpha Vantage + WRDS data into `../data/data-raw/`. Use `resume=False` to force refetch, adjust sleep/key as needed.


In [None]:
# run_ingestion(sleep_seconds=1.0, use_paid_key=True, resume=False)


## 2) Export failures (invalid API responses)

Scan raw Parquets for `Invalid API call` and write `../data/final/failures_all.csv`. Run after ingestion.


In [None]:
fail_path = export_all_failures()
fail_path


## 3) Refetch failures (optional)

Use the exported CSV to refetch failed endpoints and overwrite raw Parquets, then regenerate failures if desired.


In [None]:
# refetch_failures(fail_path, use_paid_key=True, sleep_seconds=1.0)
# fail_path = export_all_failures()
# fail_path


## 4) Transform raw to final

Build cleaned final Parquets (drops null-date price rows, strips noisy `Information` column from fundamentals).


In [None]:
outputs = transform_raw_to_final()
outputs


## 5) Quality checks

Run missing-value detail and price consistency/bounds checks across all final datasets.


In [4]:
reports = run_quality_checks("all", top_missing=5)
reports


2025-12-05 18:41:36,005 [INFO] quantlab_data_pipeline.quality - Quality checks passed for economic_indicators with no issues detected.
2025-12-05 18:41:36,006 [INFO] quantlab_data_pipeline.quality - Quality checks passed for fundamentals_dividends with no issues detected.
2025-12-05 18:41:36,006 [INFO] quantlab_data_pipeline.quality - Quality checks passed for fundamentals_income_statement with no issues detected.
2025-12-05 18:41:36,006 [INFO] quantlab_data_pipeline.quality - Quality checks passed for fundamentals_splits with no issues detected.
2025-12-05 18:41:36,007 [INFO] quantlab_data_pipeline.quality - Quality checks passed for price_daily with no issues detected.
2025-12-05 18:41:36,007 [INFO] quantlab_data_pipeline.quality - Quality checks passed for price_weekly with no issues detected.


{'company_overview': {'missing': ['Rows with any missing values: 5'],
  'consistency': [],
  'bounds': [],
  'missing_detail': ['Symbol: 5 (0.722%)',
   'PriceToBookRatio: 5 (0.722%)',
   'QuarterlyEarningsGrowthYOY: 5 (0.722%)',
   'QuarterlyRevenueGrowthYOY: 5 (0.722%)',
   'AnalystTargetPrice: 5 (0.722%)']},
 'economic_indicators': {'missing': [],
  'consistency': [],
  'bounds': [],
  'missing_detail': []},
 'fundamentals_balance_sheet': {'missing': ['Rows with any missing values: 18'],
  'consistency': [],
  'bounds': [],
  'missing_detail': ['fiscalDateEnding: 18 (10.286%)',
   'longTermDebtNoncurrent: 18 (10.286%)',
   'deferredRevenue: 18 (10.286%)',
   'currentDebt: 18 (10.286%)',
   'shortTermDebt: 18 (10.286%)']},
 'fundamentals_cash_flow': {'missing': ['Rows with any missing values: 19'],
  'consistency': [],
  'bounds': [],
  'missing_detail': ['fiscalDateEnding: 19 (10.795%)',
   'reportedCurrency: 19 (10.795%)',
   'netIncome: 19 (10.795%)',
   'changeInExchangeRate: 19 

## 6) Clean remaining invalid API rows in finals

Drop any rows in final Parquets that still contain `Invalid API call`.


In [2]:
removed = clean_final_invalid_calls(dataset="all")
removed


2025-12-05 18:38:43,131 [INFO] quantlab_data_pipeline.failure_utils - Removed 5 invalid-api-call rows from /Users/edl/Documents/dev/quantlab_v2/data/final/company_overview.parquet


{'company_overview': 5}

## 7) Sample final data

Query a subset of tickers/date range to validate the outputs.


In [6]:
# Examples of pulling final datasets
# Price sample
df_price = get_final_data(tickers=["AAPL", "MSFT"], start_date=date(2024, 1, 1), end_date=date(2024, 1, 5))
df_price.head()

# Fundamentals sample (income statement)
df_is = get_final_data(dataset="fundamentals_income_statement")
df_is.head()

# Company overview sample
df_co = get_final_data(dataset="company_overview", start_date=date(2024, 1, 1), end_date=date(2024, 1, 5))
df_co.head()


Unnamed: 0,Symbol,AssetType,Name,Description,CIK,Exchange,Currency,Country,Sector,Industry,...,52WeekLow,50DayMovingAverage,200DayMovingAverage,SharesOutstanding,SharesFloat,PercentInsiders,PercentInstitutions,DividendDate,ExDividendDate,ticker
0,CBRE,Common Stock,CBRE Group Inc Class A,"CBRE Group, Inc. is the preeminent global lead...",1138118,NYSE,USD,USA,REAL ESTATE,REAL ESTATE SERVICES,...,108.45,155.87,142.98,297593000,290697000,0.524,97.847,,2024-04-19,CBRE
1,CSR,Common Stock,Centerspace,Centerspace (CSR) is a leading real estate inv...,798359,NYSE,USD,USA,REAL ESTATE,REIT - RESIDENTIAL,...,52.07,61.38,60.76,16703500,16611900,0.55,90.508,2026-01-12,2025-12-30,CSR
2,STI,Common Stock,Solidion Technology Inc.,Solidion Technology Inc. (STI) is a premier fi...,1881551,NASDAQ,USD,USA,INDUSTRIALS,ELECTRICAL EQUIPMENT & PARTS,...,2.94,10.56,6.89,7252700,2191300,70.111,1.058,,,STI
3,LEG,Common Stock,Leggett & Platt Incorporated,"Leggett & Platt Incorporated (LEG), based in C...",58492,NYSE,USD,USA,CONSUMER CYCLICAL,"FURNISHINGS, FIXTURES & APPLIANCES",...,6.41,9.17,9.05,135436000,132405000,1.977,84.704,2026-01-15,2025-12-15,LEG
4,MCO,Common Stock,Moodys Corporation,Moody's Corporation (MCO) is a leading global ...,1059556,NYSE,USD,USA,FINANCIAL SERVICES,FINANCIAL DATA & STOCK EXCHANGES,...,376.55,481.84,482.38,178400000,164752000,13.948,79.262,2025-12-12,2025-11-21,MCO


## Fetch Fama-French factors (WRDS only)


In [6]:
from datetime import date
from quantlab_data_pipeline.wrds_client import fetch_ff_factors
from quantlab_data_pipeline.paths import raw_data_dir
from quantlab_data_pipeline.config_loader import load_credentials

creds = load_credentials()
ff = fetch_ff_factors(
    start=date(2000, 1, 1),
    end=date(2025, 11, 20),
    username=creds['username'],
    password=creds['password'],
)
out_path = (raw_data_dir().parent / 'data-processed' / 'fama_french_factors.parquet')
out_path.parent.mkdir(parents=True, exist_ok=True)
ff.to_parquet(out_path, index=False)
out_path


Loading library list...
Done


PosixPath('/Users/edl/Documents/dev/quantlab_v2/data/data-processed/fama_french_factors.parquet')