# KaxaNuk Data Corrector: error analysis

## 00: Preamble
Data provider issues are a rather common problem that financial analysts need to address. Data quality translates to model quality, model quality translates into alpha. The purpose of this challenge is to first gain an understanding of the problems by analyzing the problems themselves, and their underlying causes and patterns. This in order to propose better informed solutions to data correction or imputation.
Once the error is identified, an appropriate imputation method should be implemented. This approach explores deterministic, statistical and machine-learning error detection and imputation. Regardless of the methodology, error correction in financial data should always be accompanied by an error log forensic analysis in order to minimize false positives. The forensic analysis in this work was done using a large language model (claude opus 4.5) to scrape financial news sites for events that may be related to the observed error logs. After thoroughly examining the forensic analysis, false positive flagging should always be under the practicioner's criteria.

## 01: Provided error log analysis

Financial Modeling Prep's data errors are **not isolated incidents but systemic issues** concentrated in three categories: non-standard securities (preferred shares, warrants, senior notes), companies undergoing corporate actions, and illiquid micro-cap stocks. The errors stem from data architecture that struggles with securities that deviate from standard common stock data structures.

### The "sorted by date" errors

The 74 tickers generating "FundamentalData.rows not correctly sorted by date" errors reveal teh same patterns. Nearly **40% are non-common equity securities**—preferred shares, warrants, or senior notes—that have fundamentally different data reporting requirements than common stock. The B. Riley Financial family alone contributes 8 tickers (RILY, RILYG, RILYK, RILYL, RILYN, RILYT, RILYZ, RILYP), spanning common stock, preferred shares, and tradeable senior notes. Federal Agricultural Mortgage (Farmer Mac) adds 8 more (AGM and seven preferred series), while Presidio Property Trust contributes common stock, preferred shares, and warrants.

**Corporate actions create data discontinuities** across this list. At least 12 tickers underwent mergers, acquisitions, or name changes in 2024-2025:

| Ticker | Event | Date |
|--------|-------|------|
| FARO | Acquired by AMETEK | July 2025 |
| IVAC | Acquired by Seagate | March 2025 |
| SASR | Acquired by Atlantic Union | April 2025 |
| APDN | Rebranded to BNBX | October 2025 |
| MICS | Became RIME | September 2024 |
| ATON | Rebranded AlphaTON Capital | September 2025 |
| NBP | Former I-Mab, now NovaBridge | October 2025 |

Several companies are in financial distress: B. Riley Financial suspended dividends and faces Nasdaq delisting risk after **435-475M quarterly losses**; Ideanomics (IDEX) filed Chapter 11 bankruptcy in December 2024 following SEC fraud settlements; Staffing 360 Solutions (STAF) was delisted to OTC.

### Market cap distribution

The market cap breakdown exposes another pattern: **approximately 35-40% of affected tickers are micro-cap stocks** (under 300M market capitalization). These include XELB (8-11M), EVTV (20M), SOTK (40M), TPCS (35M), and DLPN ($50M). Micro-cap stocks typically have less rigorous data reporting, lower analyst coverage, and more frequent data quality issues due to limited institutional oversight.

| Market Cap Category | Percentage of Error Tickers |
|---------------------|---------------------------|
| Micro-cap (<$300M) | ~35-40% |
| Small-cap ($300M-$2B) | ~25-30% |
| Mid-cap ($2B-$10B) | ~20% |
| Large-cap (>$10B) | ~15% |

The large-cap tickers that appear—DOV (27B), DG (22-24B), JBL (18B), RBA (19.5B)—likely experience errors due to corporate actions rather than data quality. RBA (RB Global) completed a major merger with IAA in 2023, and MTZ (MasTec) faced shareholder lawsuits creating reporting complexities.

### Preferred shares dominate the "no data returned" errors

All 12 tickers generating "No data returned by unadjusted market data endpoint" errors are **preferred shares or eliminated share classes**:

- **PEI series** (PEI-PB, PEI-PC, PEI-PD): Pennsylvania REIT preferred shares—company emerged from Chapter 11 bankruptcy in 2020 with restructured capital
- **PSB series** (PSB-PX, PSB-PY, PSB-PZ): PS Business Parks preferred depositary shares—parent company was acquired by Blackstone for 7.6 billion in 2022, delisting common stock but potentially leaving preferred shares trading
- **NRZ series** (NRZ-PA, NRZ-PB, NRZ-PC): Rithm Capital (formerly New Residential Investment) fixed-to-floating rate preferreds experiencing LIBOR transition complications
- **STZ-B**: Constellation Brands Class B stock—**eliminated entirely in November 2022** when the Sands family exchanged their super-voting shares for $64.64 cash plus Class A shares
- **PNC-PP**: PNC Financial Series P preferred with complex fixed-to-floating rate structure
- **ALP-PQ**: Appears to be an invalid or delisted ticker

The pattern is unmistakable: FMP's unadjusted market data endpoint cannot handle preferred share structures, depositary shares, or securities that no longer trade but retain historical data.

### Warrants create impossible price relationships

The three tickers with "MarketDataDailyRow low > high" errors are all **SPAC warrants trading at near-zero prices**:

| Ticker | Company | Current Price | Status |
|--------|---------|---------------|--------|
| UWMC-WT | UWM Holdings | ~$0.01 | NYSE delisting proceedings initiated December 19, 2025 |
| BFLY-WT | Butterfly Network | ~$0.02 | Extremely illiquid, ~44K average daily volume |
| ML-WT | MoneyLion | ~$0.26 | Thinly traded, expires September 2026 |

When securities trade at fractions of a penny with minimal volume, bad tick data becomes inevitable. Wide bid-ask spreads, stale quotes, and erroneous trade reports create situations where recorded daily lows can exceed daily highs. UWMC-WT is actively being delisted for "abnormally low selling price"—the security is essentially worthless.

### Negative shares outstanding traces to corporate restructuring

The three "Negative shares outstanding" errors (HELE, QLGN, ELDN) correlate directly with significant corporate events:

**HELE** (Helen of Troy, error date May 1, 2017): No stock splits, but the company operates on a February fiscal year-end. The error date falls during fiscal year transitions when share counts from buyback programs may create calculation discrepancies across FMP's data sources.

**QLGN** (Qualigen Therapeutics, error date November 14, 2025): This company has undergone **two reverse stock splits** (1-for-10 in 2022, 1-for-50 in 2024), was acquired by Faraday Future as a 55% stakeholder, and rebranded to AIxCrypto Holdings in November 2025—all creating massive data discontinuities.

**ELDN** (Eledon Pharmaceuticals, error date November 14, 2025): A **$50 million dilutive offering** closed around November 12, 2025, adding 15+ million shares plus warrants, increasing share count by over 100% year-over-year. The error date coincides exactly with this offering.

### The single negative price error points to data corruption

ASB-PF (Associated Banc-Corp Series F Preferred) showing a negative low price is simply **data corruption**. Preferred stocks have complex ex-dividend adjustments, and a calculation error in FMP's dividend adjustment pipeline likely produced an impossible negative value. The security trades normally around $21 with a 6.65% yield.

### Conclusions: systemic issues with non-standard securities

These errors demonstrate FMP has **architectural limitations handling three categories of securities**:

**Non-standard security types**: Preferred shares, warrants, senior notes, and depositary shares have different data structures, reporting requirements, and pricing mechanics than common stock. FMP's fundamental data infrastructure appears designed primarily for common equity.

**Corporate action transitions**: Mergers, acquisitions, reverse splits, name changes, and bankruptcies create data discontinuities. When Constellation Brands eliminated STZ-B or Blackstone acquired PSB, historical data must be handled differently—FMP's pipeline struggles with these transitions.

**Illiquid and penny securities**: When UWMC-WT trades at $0.0098 with minimal volume, standard data validation breaks down. The "low > high" errors are essentially the data provider acknowledging bad tick data from nearly untradeable securities.

The pattern suggests it would be beneficial to disable the native lock the data curator has that yielded this error log, and implement robust handling taking into account non-common equity and apply different validation rules capable of handling illiquid instruments while logging the errors detected. For users, these errors serve as a useful audit tool to identify overall quality of a data provider.




## 02: Sanity check

The previous log analysis and data correction strategies were consequent to a thorough forensic examination of the critical errors that kept the datacurator from extracting the data. This next section documents the functions used to detect and correct error types found in the set of 6000+ tickers, including the errors found in the log and those that did not yield an error that prevented DataCurator from downloading the data. Still, financial logic dictates the encountered values should not be possible, so these values were all treated with a deterministic filter and imputation strategy.

These functions act as a post-download validation and correction layer. The data arrives intact but contains logical inconsistencies, impossible values, or violations of financial accounting identities. Left uncorrected, these issues propagate through backtests, valuations, and risk models—often without triggering obvious failures.

The functions below implement deterministic, auditable corrections with full logging of every modification made to the source data. These functions could be implemented as custom calculation within the DataCurator module (without explicit error logging) or by separate within the EDA module (with explicit error logging).

---

### 1. sort_dates

#### Error Addressed
**Out-of-order fundamental data rows.** Financial statements may arrive with dates that are not chronologically sorted due to:
- Amended filings (10-K/A, 10-Q/A) inserted after original statements
- Fiscal year-end changes creating overlapping periods
- Data provider ingestion timing mismatches

#### Approach
0. Edit the Data Curatos so the dates do not need to be sorted for data to be downloaded. Ingest the uncurated data.
1. Establish a date hierarchy: primary date column (`m_date`) → filing date (`f_filing_date`)
2. Preserve null-dated rows in their original positions (they cannot be sorted)
3. Sort only valid-dated rows while maintaining relative positions
4. Deduplicate by keeping earliest or latest filing per period
5. Log all position changes for audit

#### Correction Method
- **Reordering only**—no values are modified
- Null rows remain untouched
- Deduplication uses filing date to determine which version to keep

---

### 2. fill_negatives_fundamentals

#### Error Addressed
**Negative values in fundamental data columns where negatives are impossible.** Examples include:
- Negative shares outstanding
- Negative total assets
- Negative revenue (in contexts where it should be gross revenue)

These typically result from data entry errors, sign convention mismatches, or incorrect aggregation.

#### Approach
1. Scan specified columns for values < 0
2. Replace negatives with `NULL`
3. Apply forward-fill to propagate the last valid (non-negative) value
4. Log every replaced value with ticker, date, and original value

#### Correction Method
- **Forward-fill from last known good value**
- Assumes temporal continuity: if shares outstanding was 100M yesterday and shows -5M today, yesterday's value is more trustworthy
- Since fundamental data follows a stepwise function, interpolation would introduce values which never existed. This is why forward-fill is preferred.

---

### 3. fill_negatives_market

#### Error Addressed
**Negative prices in market data.** A stock price cannot be negative, but data corruption from:
- Dividend adjustment calculation errors
- Bad tick data propagation
- Corporate action misapplication

can produce impossible negative values in OHLC or VWAP columns.

#### Approach
1. Identify negative values in specified columns
2. Gather up to 4 previous valid (non-negative) data points
3. Fit a **backward-looking cubic spline** to extrapolate a replacement value
4. If fewer than 3 prior points exist, fall back to last valid value (if no such value exists, default to zero)
5. If spline produces negative or non-finite result, fall back to last valid value

#### Correction Method
- **Cubic spline interpolation (backward-looking only)**
- Explicitly avoids look-forward bias for backtesting integrity
- Preserves original null positions
- Falls back gracefully when insufficient history exists

---

### 4. zero_wipeout

#### Error Addressed
**Zero values in share-related columns when trading volume is positive.** This paradox indicates data corruption:
- If volume > 0, trading occurred
- If shares outstanding = 0, the company has no equity
- Both cannot be true simultaneously

Common cause: placeholder zeros inserted during data pipeline failures.

#### Approach
1. Identify rows where ANY target column equals 0 AND `m_volume` > 0
2. Replace the zero with `NULL`
3. Apply forward-fill to restore continuity
4. Log all affected rows

#### Correction Method
- **Conditional forward-fill**
- Only triggers when the logical impossibility (zero shares + positive volume) is detected
- Leaves legitimate zeros (pre-IPO, delisted) untouched if volume is also zero

---

### 5. mkt_cap_scale_error

#### Error Addressed
**10x or greater jumps in market cap or shares outstanding.** These typically indicate:
- Unit conversion errors (shares in units vs. thousands vs. millions)
- Data source switches mid-series
- Incorrect corporate action adjustments

The 10x jump parameter was arbitrarily chosen, this value should be adjusted based on the investment universe: it is impossible for large caps to jump 5x overnight (largest jump in history was Volkswagen in 2008 with a ~4x jump), but in small caps it is a reasonable outlier. Placing a 10x threshold on a universe that contains only large caps would yield false negatives. A reasonable generalized approach is a statistical filter (see section 4).

#### Approach
1. Compare each row's value to the previous row
2. Flag rows where value ≥ 10× prior value
3. Detect **correlated jumps**: if both market cap AND shares outstanding jump together, the error likely spans multiple rows
4. For correlated jumps, identify the entire error span (values within 20% of jumped value)
5. Apply forward-fill to replace the corrupted span

#### Correction Method
- **Forward-fill with span detection**
- Single-row spikes: replace with prior value
- Multi-row plateaus: identify the elevated region and replace entirely
- Logs include error type classification

---

### 6. ohlc_integrity

#### Error Addressed
**Violations of OHLC mathematical constraints:**
- High < max(Open, Close, Low) → High should be the maximum
- Low > min(Open, Close, High) → Low should be the minimum
- VWAP outside [Low, High] → VWAP must fall within the day's range

These violations break technical indicators, volatility calculations, and charting.

#### Approach
Validates three column groups independently:
1. Raw OHLC (`m_open`, `m_high`, `m_low`, `m_close`, `m_vwap`)
2. Split-adjusted OHLC
3. Dividend-and-split-adjusted OHLC

For each group:
- Compute actual max/min of OHLC values
- Compare against declared High/Low
- Check VWAP bounds

#### Correction Method
| Violation | Correction |
|-----------|------------|
| High < actual max | Set High = max(O, H, L, C) |
| Low > actual min | Set Low = min(O, H, L, C) |
| VWAP outside range | Set VWAP = (O + H + L + C) / 4 |

- Uses OHLC centroid as VWAP replacement (simple average, not volume-weighted, but mathematically valid)

---

### 7. validate_financial_equivalencies

#### Error Addressed
**Violations of fundamental accounting identities:**

**Hard Filters (corrected):**
- Assets ≠ Current Assets + Noncurrent Assets
- Liabilities ≠ Current Liabilities + Noncurrent Liabilities

**Soft Filters (flagged only):**
- Stockholder Equity ≠ Common Stock + APIC + Retained Earnings + Other Equity
- Period End Cash ≠ Cash and Cash Equivalents
- Assets ≠ Liabilities + Equity + Noncontrolling Interest

#### Approach
1. Compute component sums for each identity
2. Compare against declared totals with configurable tolerance (default 5%)
3. For hard filters: apply proportional scaling to force balance
4. For soft filters: set `data_warning` flag without modifying values

#### Correction Method (Hard Filters)
**Proportional Scaling:**
```
Factor = Total / (Current + Noncurrent)
Corrected_Current = Current × Factor
Corrected_Noncurrent = Noncurrent × Factor
```

**Edge Case:** If components sum to 0 but total ≠ 0, the entire total is assigned to the noncurrent bucket (residual plug).

Soft filter violations are logged but not corrected because:
- Equity components may have legitimate "other" buckets not captured
- Cash timing differences may reflect intra-period movements
- Balance sheet identity failures may indicate complex structures (e.g., variable interest entities)
- A sample of 500 tickers yielded 1M+ errors, the probability of false positives is too high to impute.

---

### 8. validate_market_split_consistency

#### Error Addressed
**Inconsistency between raw market data and split-adjusted market data.** The relationship should be deterministic:
- `Adjusted_Price = Raw_Price × K`
- `Adjusted_Volume = Raw_Volume / K`

Where K is the cumulative split adjustment factor. When K_implied (from raw/adjusted) ≠ K_expected (from split events), the data is internally inconsistent.

#### Approach
1. Calculate daily split factor: `factor = denominator / numerator` (1.0 if no split)
2. Calculate cumulative K: `K_expected = cumulative_product(daily_factors)`
3. For each price column pair: `K_implied = adjusted / raw`
4. For volume: `K_implied = raw / adjusted` (inverse relationship)
5. Flag rows where `|K_implied - K_expected| > tolerance × |K_expected|`

#### Correction Method
Recalculate adjusted values from raw values using K_expected:
- **Prices:** `corrected_adjusted = raw × K_expected`
- **Volume:** `corrected_adjusted = raw / K_expected`

Validated column pairs:
| Raw Column | Adjusted Column | Relationship |
|------------|-----------------|--------------|
| m_open | m_open_split_adjusted | Price (×K) |
| m_high | m_high_split_adjusted | Price (×K) |
| m_low | m_low_split_adjusted | Price (×K) |
| m_close | m_close_split_adjusted | Price (×K) |
| m_vwap | m_vwap_split_adjusted | Price (×K) |
| m_volume | m_volume_split_adjusted | Volume (÷K) |

---

### Summary Matrix

| Function | Error Type | Detection Method | Correction Method |
|----------|------------|------------------|-------------------|
| sort_dates | Unsorted rows | Date comparison | Reorder in place |
| fill_negatives_fundamentals | Negative fundamentals | value < 0 | Forward-fill |
| fill_negatives_market | Negative prices | value < 0 | Cubic spline / forward-fill |
| zero_wipeout | Zero shares + positive volume | shares = 0 AND volume > 0 | Forward-fill |
| mkt_cap_scale_error | 10x jumps | value ≥ 10 × prior | Forward-fill (span-aware) |
| ohlc_integrity | OHLC constraint violations | H < max, L > min, VWAP bounds | Set to computed bounds |
| validate_financial_equivalencies | Accounting identity failures | |Total - Sum| > tolerance | Proportional scaling |
| validate_market_split_consistency | Split adjustment mismatch | K_implied ≠ K_expected | Recalculate from K_expected |

---

### Design Principles

1. **Full Audit Trail:** Every function returns a log of all modifications, enabling forensic analysis and regulatory compliance.

2. **Type Preservation:** Functions accept and return the same type (DataFrame ↔ DataFrame, LazyFrame ↔ LazyFrame).

3. **No Look-Forward Bias:** Corrections use only historical data, preserving validity for backtesting.

4. **Graceful Degradation:** When ideal corrections aren't possible (insufficient history, missing columns), functions fall back to simpler methods rather than failing.

5. **Configurable Tolerance:** Validation thresholds are parameterized, allowing adjustment for different data quality contexts.


## 03: Sanity check forensic log audit

## 04: Statistical filter

The sanity check functions address deterministic, rule-based errors where values violate hard constraints (negative prices, accounting identities, OHLC ordering). Statistical filters complement this by identifying probabilistic outliers—values that are technically possible but statistically improbable given the surrounding data context. These outliers typically result from data transmission errors, calculation precision loss, or provider-side aggregation bugs that don't trigger logic violations but distort quantitative analysis.

The functions below implement statistical outlier detection using four complementary methods, each optimized for different data characteristics and error patterns. All methods preserve backtesting integrity through backward-looking analysis and provide full audit logging.

---

### 1. rolling_z_score

#### Outlier Pattern Addressed
**Time-series drift outliers in price and moving average data.** Market prices and technical indicators exhibit trending behavior—a stock trading at $50 may drift to $100 over months. Standard deviation calculated on the entire history treats $100 as an outlier relative to early $50 data, creating false positives. This method adapts to local trends using a rolling window.

Common causes:
- Flash crash recovery artifacts (partial tick data correction)
- Intraday quote consolidation errors creating spurious OHLC values
- Moving average calculation errors during corporate actions

#### Approach
1. For each target column, calculate rolling mean and standard deviation using a **63-day backward-looking window** (excludes current row via `.shift(1)` to prevent look-forward bias)
2. Compute Z-scores: `Z = (value - rolling_mean) / rolling_std`
3. Flag outliers where `|Z| > 3.5` and sufficient window data exists (`min_periods = 10`)
4. Treat outliers as missing values and fit a **cubic spline** on remaining valid data
5. Interpolate replacement values at outlier positions
6. Log all corrections with Z-score, rolling statistics, and interpolation method

#### Correction Method
- **Primary:** Cubic spline interpolation (fits smooth curve through clean data)
- **Fallback 1:** Nearest valid value (if spline fails due to geometric constraints)
- **Fallback 2:** Last valid value (if spline produces non-finite result)

All interpolation uses only historical data (positions before the outlier) to maintain backtest validity.

#### Columns Processed
**Time-series market data (OHLC prices, VWAP):**
- Raw: `m_open`, `m_high`, `m_low`, `m_close`, `m_vwap`
- Split-adjusted: `m_open_split_adjusted`, `m_high_split_adjusted`, `m_low_split_adjusted`, `m_close_split_adjusted`, `m_vwap_split_adjusted`
- Dividend-and-split-adjusted: `m_open_dividend_and_split_adjusted`, `m_high_dividend_and_split_adjusted`, `m_low_dividend_and_split_adjusted`, `m_close_dividend_and_split_adjusted`, `m_vwap_dividend_and_split_adjusted`

**Technical indicators (moving averages):**
- Simple Moving Averages: `c_simple_moving_average_5d_close_*`, `c_simple_moving_average_21d_close_*`, `c_simple_moving_average_63d_close_*`, `c_simple_moving_average_252d_close_*`
- Exponential Moving Averages: `c_exponential_moving_average_5d_close_*`, `c_exponential_moving_average_21d_close_*`, `c_exponential_moving_average_63d_close_*`, `c_exponential_moving_average_252d_close_*`

(The `*` suffix indicates both `split_adjusted` and `dividend_and_split_adjusted` variants)

---

### 2. mahalanobis_filter

#### Outlier Pattern Addressed
**Multivariate outliers in fundamental data violating cross-sectional relationships.** A company reporting $1B revenue with $10B operating expenses may pass univariate checks but violates industry norms. This method detects combinations of values that are improbable relative to sector peers.

Common causes:
- Unit scaling errors affecting multiple related fields (assets in millions, liabilities in thousands)
- Partial quarterly restatements creating temporary inconsistencies
- Wrong fiscal period data assigned to a calendar quarter

#### Approach
1. **Peer Group Identification:** Query metadata to identify all tickers in the same sector
2. **Robust Standardization:** For each quarter across all peer history, calculate robust Z-scores using median and MAD (Median Absolute Deviation) via Polars window functions
3. **Pooled Covariance Estimation:** Fit a single robust covariance matrix (MinCovDet) on the pooled Z-score matrix of all peers across all time
4. **Distance Calculation:** Compute Mahalanobis distance for the target ticker: `D² = (Z - μ) @ Σ⁻¹ @ (Z - μ)ᵀ`
5. **Outlier Detection:** Flag quarters where `D² > χ²(p, 1-α)` threshold (Chi-squared distribution with degrees of freedom = number of columns)
6. **Quarterly Imputation:** Replace entire flagged quarter with forward-filled values from the last clean quarter

#### Correction Method
- **Quarterly forward-fill:** When a quarter is flagged, all fundamental values for that quarter are replaced with the previous valid quarter's values
- Assumes fundamental data changes are gradual quarter-to-quarter; sudden multivariate shifts indicate corruption
- If no prior clean quarter exists, values remain unchanged (no correction possible)

#### Columns Processed
**Balance Sheet (`fbs_` prefix):**
- Core aggregates: `fbs_assets`, `fbs_current_assets`, `fbs_noncurrent_assets`, `fbs_liabilities`, `fbs_current_liabilities`, `fbs_noncurrent_liabilities`, `fbs_stockholder_equity`
- Cash components: `fbs_cash_and_cash_equivalents`, `fbs_cash_and_shortterm_investments`, `fbs_shortterm_investments`
- Operating assets: `fbs_net_inventory`, `fbs_net_property_plant_and_equipment`, `fbs_goodwill`, `fbs_net_intangible_assets_excluding_goodwill`, `fbs_net_intangible_assets_including_goodwill`
- Receivables/Payables: `fbs_current_accounts_receivable_after_doubtful_accounts`, `fbs_current_net_receivables`, `fbs_current_accounts_payable`, `fbs_current_accrued_expenses`, `fbs_other_receivables`, `fbs_other_payables`
- Debt components: `fbs_shortterm_debt`, `fbs_longterm_debt`, `fbs_total_debt_including_capital_lease_obligations`, `fbs_capital_lease_obligations`, `fbs_current_capital_lease_obligations`, `fbs_noncurrent_capital_lease_obligations`, `fbs_net_debt`
- Equity components: `fbs_common_stock_value`, `fbs_preferred_stock_value`, `fbs_treasury_stock_value`, `fbs_additional_paid_in_capital`, `fbs_retained_earnings`, `fbs_accumulated_other_comprehensive_income_after_tax`, `fbs_other_stockholder_equity`
- Other: `fbs_investments`, `fbs_longterm_investments`, `fbs_deferred_revenue`, `fbs_noncurrent_deferred_revenue`, `fbs_current_tax_payables`, `fbs_noncurrent_deferred_tax_assets`, `fbs_noncurrent_deferred_tax_liabilities`, `fbs_noncontrolling_interest`, `fbs_total_equity_including_noncontrolling_interest`, `fbs_total_liabilities_and_equity`, `fbs_other_assets`, `fbs_other_current_assets`, `fbs_other_noncurrent_assets`, `fbs_other_liabilities`, `fbs_other_current_liabilities`, `fbs_other_noncurrent_liabilities`, `fbs_total_payables_current_and_noncurrent`, `fbs_prepaid_expenses`

**Cash Flow Statement (`fcf_` prefix):**
- Core flows: `fcf_net_cash_from_operating_activities`, `fcf_net_cash_from_investing_activites`, `fcf_net_cash_from_financing_activities`, `fcf_free_cash_flow`
- Operating activities: `fcf_net_income`, `fcf_depreciation_and_amortization`, `fcf_stock_based_compensation`, `fcf_deferred_income_tax`, `fcf_other_noncash_items`, `fcf_working_capital_change`, `fcf_accounts_receivable_change`, `fcf_accounts_payable_change`, `fcf_inventory_change`, `fcf_other_working_capital`
- Investing activities: `fcf_capital_expenditure`, `fcf_property_plant_and_equipment_purchase`, `fcf_investments_purchase`, `fcf_investment_sales_maturities_and_collections_proceeds`, `fcf_net_business_acquisition_payments`, `fcf_other_investing_activities`
- Financing activities: `fcf_dividend_payments`, `fcf_common_stock_dividend_payments`, `fcf_preferred_stock_dividend_payments`, `fcf_common_stock_issuance_proceeds`, `fcf_common_stock_repurchase`, `fcf_preferred_stock_issuance_proceeds`, `fcf_net_common_stock_issuance_proceeds`, `fcf_net_stock_issuance_proceeds`, `fcf_net_debt_issuance_proceeds`, `fcf_net_shortterm_debt_issuance_proceeds`, `fcf_net_longterm_debt_issuance_proceeds`, `fcf_other_financing_activities`
- Tax and interest: `fcf_net_income_tax_payments`, `fcf_interest_payments`
- Cash reconciliation: `fcf_period_start_cash`, `fcf_period_end_cash`, `fcf_cash_and_cash_equivalents_change`, `fcf_cash_exchange_rate_effect`

**Income Statement (`fis_` prefix):**
- Revenue/Costs: `fis_revenues`, `fis_cost_of_revenue`, `fis_gross_profit`, `fis_costs_and_expenses`
- Operating expenses: `fis_operating_expenses`, `fis_selling_general_and_administrative_expense`, `fis_selling_and_marketing_expense`, `fis_general_and_administrative_expense`, `fis_research_and_development_expense`, `fis_depreciation_and_amortization`, `fis_other_expenses`
- Operating income: `fis_operating_income`, `fis_earnings_before_interest_and_tax`, `fis_earnings_before_interest_tax_depreciation_and_amortization`
- Non-operating items: `fis_interest_income`, `fis_interest_expense`, `fis_net_interest_income`, `fis_nonoperating_income_excluding_interest`, `fis_net_total_other_income`, `fis_other_net_income_adjustments`
- Pre-tax/Tax: `fis_income_before_tax`, `fis_income_tax_expense`
- Net income: `fis_net_income`, `fis_continuing_operations_income_after_tax`, `fis_discontinued_operations_income_after_tax`, `fis_net_income_deductions`, `fis_basic_net_income_available_to_common_stockholders`
- Per share: `fis_basic_earnings_per_share`, `fis_diluted_earnings_per_share`, `fis_weighted_average_basic_shares_outstanding`, `fis_weighted_average_diluted_shares_outstanding`

**Calculated Valuation Ratios (`c_` prefix):**
- Per-share metrics: `c_book_value_per_share`, `c_earnings_per_share`, `c_last_twelve_months_revenue_per_share`
- Valuation multiples: `c_book_to_price`, `c_earnings_to_price`, `c_sales_to_price`
- Trailing metrics: `c_last_twelve_months_net_income`, `c_last_twelve_months_revenue`
- Market cap: `c_market_cap`

---

### 3. mad_filter

#### Outlier Pattern Addressed
**Univariate spikes in bounded or sparse data.** Volume and technical oscillators often exhibit extreme spikes (10x-100x normal) that break standard deviation assumptions. RSI and similar indicators are bounded [0, 100], making them resistant to Gaussian modeling. MAD (Median Absolute Deviation) provides robust outlier detection for these distributions.

Common causes:
- End-of-day volume adjustments creating artificial spikes
- Oscillator calculation errors when price data contains gaps
- Dividend record date misalignments creating spurious dividend values

#### Approach
1. Calculate median and MAD for each column: `MAD = median(|X - median(X)|)`
2. Compute Modified Z-scores: `M = 0.6745 × (X - median) / MAD`
3. Flag outliers where `|M| > 3.5` (equivalent to ±3.5σ in Gaussian data)
4. Treat outliers as missing and fit **cubic spline** on valid data
5. Interpolate replacement values
6. Log all corrections with Modified Z-score, median, and MAD

The constant 0.6745 ≈ 1/Φ⁻¹(0.75) scales MAD to match standard deviation for Gaussian data, making the 3.5 threshold comparable to the rolling Z-score threshold.

#### Correction Method
- **Primary:** Cubic spline interpolation
- **Fallback 1:** Nearest valid value (if spline fails)
- **Fallback 2:** Last valid value (if spline produces non-finite result)

Same interpolation strategy as `rolling_z_score`, but operates on the full history (not windowed) since volume/oscillators don't drift.

#### Columns Processed
**Volume and traded value:**
- Volume: `m_volume`, `m_volume_split_adjusted`, `m_volume_dividend_and_split_adjusted`
- Traded value: `c_daily_traded_value`, `c_daily_traded_value_sma_5d`, `c_daily_traded_value_sma_21d`, `c_daily_traded_value_sma_63d`, `c_daily_traded_value_sma_252d`

**Technical oscillators and indicators:**
- RSI: `c_rsi_14d_dividend_and_split_adjusted`, `c_rsi_14d_split_adjusted`
- MACD: `c_macd_26d_12d_dividend_and_split_adjusted`, `c_macd_26d_12d_split_adjusted`, `c_macd_signal_9d_dividend_and_split_adjusted`, `c_macd_signal_9d_split_adjusted`
- Chaikin Money Flow: `c_chaikin_money_flow_21d_dividend_and_split_adjusted`, `c_chaikin_money_flow_21d_split_adjusted`

**Dividend amounts (magnitude checks):**
- Declaration date: `d_declaration_date_dividend`, `d_declaration_date_dividend_split_adjusted`
- Ex-dividend date: `d_ex_dividend_date_dividend`, `d_ex_dividend_date_dividend_split_adjusted`
- Record date: `d_record_date_dividend`, `d_record_date_dividend_split_adjusted`
- Payment date: `d_payment_date_dividend`, `d_payment_date_dividend_split_adjusted`

---

### 4. garch_residuals

#### Outlier Pattern Addressed
**Volatility-conditional price outliers.** A 5% daily return is normal during market crashes but anomalous during calm periods. Standard Z-scores fail to account for volatility clustering (periods of high volatility beget more high volatility). GARCH models dynamic volatility to detect returns that are extreme *given the current market regime*.

Common causes:
- Partial adjustment of stock splits or reverse splits (e.g., price halved but volume not doubled)
- Timestamp errors causing end-of-day vs. intraday price mismatches
- Consolidated tape errors during low-liquidity periods

#### Approach
1. Calculate percentage returns: `r = (P_t - P_{t-1}) / P_{t-1} × 100`
2. Fit **GARCH(1,1)** with Student's t-distribution: `σ²_t = ω + α·ε²_{t-1} + β·σ²_{t-1}`
3. Extract conditional volatility `σ_t` and compute standardized residuals: `z_t = r_t / σ_t`
4. Calculate dynamic threshold from t-distribution: `threshold = t^{-1}(1 - α/2, ν)` where ν = degrees of freedom
5. Flag outliers where `|z_t| > threshold`
6. **Validate stationarity:** If α + β ≥ 1, the model is non-stationary; skip correction
7. Interpolate using **cubic spline** on clean data (outliers masked)
8. Log corrections with standardized residual and threshold

GARCH operates on **returns** (first differences), not levels, to handle non-stationary price series. Outliers are identified in return space but corrections are applied to price levels.

#### Correction Method
- **Primary:** Cubic spline interpolation on price levels (using indices where return outliers were detected)
- **Fallback 1:** Nearest valid price (if spline fails)
- **Fallback 2:** Last valid price (if spline produces non-finite result)

The method requires **≥100 valid observations** to fit GARCH reliably and skips columns where returns have near-zero variance (e.g., delisted stocks with constant prices).

#### Columns Processed
**Returns:**
- Daily log returns: `c_log_returns_dividend_and_split_adjusted`
- Intraday log range: `c_log_difference_high_to_low`

**Volatility measures (annualized rolling standard deviation of log returns):**
- 5-day: `c_annualized_volatility_5d_log_returns_dividend_and_split_adjusted`
- 21-day: `c_annualized_volatility_21d_log_returns_dividend_and_split_adjusted`
- 63-day: `c_annualized_volatility_63d_log_returns_dividend_and_split_adjusted`
- 252-day: `c_annualized_volatility_252d_log_returns_dividend_and_split_adjusted`

---

### Summary Matrix

| Function | Target Data Type | Detection Method | Correction Method | Min Data Required |
|----------|------------------|------------------|-------------------|-------------------|
| rolling_z_score | Time-series (prices, MAs) | Rolling Z-score (63d window, threshold=3.5) | Cubic spline / nearest / last valid | 10 valid points in window |
| mahalanobis_filter | Fundamental data (quarterly) | Mahalanobis distance on robust Z-scores (χ² threshold) | Quarterly forward-fill | ≥ (5 × columns) peer observations |
| mad_filter | Spiky univariate (volume, oscillators) | Modified Z-score via MAD (threshold=3.5) | Cubic spline / nearest / last valid | 4 valid points |
| garch_residuals | Volatility/returns | GARCH(1,1) standardized residuals (t-distribution) | Cubic spline / nearest / last valid | 100 valid returns |

---

### Design Principles

1. **Complementary Coverage:** Each method targets different data characteristics—time-series drift (rolling), multivariate relationships (Mahalanobis), univariate spikes (MAD), and volatility regimes (GARCH)—minimizing overlap and false negatives.

2. **Robust Statistics:** All methods use robust estimators (MAD, MinCovDet, rolling medians) resistant to the very outliers being detected, avoiding breakdown from contaminated data.

3. **Backward-Looking Only:** All calculations exclude future data (via `.shift(1)`, historical-only splines, forward-fill imputation) to preserve backtesting validity.

4. **Graceful Degradation:** Methods fail softly when conditions aren't met (insufficient data, non-stationary GARCH, zero MAD) by logging the issue and skipping correction rather than propagating NaNs.

5. **Sector-Aware (Mahalanobis):** Fundamental data outliers are evaluated against peer norms, not absolute thresholds, accounting for industry-specific accounting patterns (e.g., negative retained earnings normal for growth tech, abnormal for utilities).

6. **Configurable Confidence:** The `confidence` parameter (default 0.01 = 99% confidence) controls false positive vs. false negative tradeoff across all methods via χ² thresholds (Mahalanobis) and t-distribution quantiles (GARCH).

7. **Full Audit Trail:** Logs include method-specific diagnostics (Z-scores, Mahalanobis distances, GARCH parameters) enabling forensic analysis of why values were flagged and whether corrections were appropriate.