# <img src="https://raw.githubusercontent.com/OlivierNDO/framecheck/main/images/logo.png" alt="FrameCheck" width="512" height="125">

# FrameCheck: Pandas DataFrame Validation

**FrameCheck** is a lightweight, flexible validation library for pandas DataFrames.

Instead of writing dozens of repetitive checks or dealing with complex schema configurations, FrameCheck offers a clean, fluent API that makes validation both readable and maintainable.

Key features:
- Simple, chainable validation methods
- Column and DataFrame-level validation
- Support for both error and warning-level assertions
- No configuration files or decorators

This notebook demonstrates how FrameCheck can help you implement robust validation with minimal code.

## Setup

In [4]:
# Install framecheck
!pip install framecheck -q

# Import required packages
import logging
import pandas as pd
from framecheck import FrameCheck

## Sample Data: Model Output Validation

Let's create a dataset representing ML model output that requires validation:

In [13]:
logger = logging.getLogger("model_validation")
logger.setLevel(logging.INFO)

df = pd.DataFrame({
    'transaction_id': ['TXN1001', 'TXN1002', 'TXN1003'],
    'user_id': [501, 502, 503],
    'transaction_time': ['2024-04-15 08:23:11', '2024-04-15 08:45:22', '2024-04-15 09:01:37'],
    'model_score': [0.0, 0.92, 0.95],
    'model_version': ['v2.1.0', 'v2.1.0', 'v2.1.0'],
    'flagged_for_review': [False, True, False]
})

## Validation Requirements

This data needs to meet these conditions:
- `transaction_id`: follows TXN format
- `user_id`: positive integer
- `transaction_time`: valid datetime
- `model_score`: float between 0-1 **(warn if equal to 0)**
- `model_version`: string
- `flagged_for_review`: boolean
- No missing values
- Business rule: high scores (>0.9) must be flagged

Define FrameCheck object

In [14]:
model_output_validator = (
    FrameCheck()
    .column('transaction_id', type='string', regex=r'^TXN\d{4,}$')
    .column('user_id', type='int', min=1)
    .column('transaction_time', type='datetime', before='now')
    .column('model_score', type='float', min=0.0, max=1.0)
    .column('model_score', type='float', not_in_set=[0.0], warn_only=True)
    .column('model_version', type='string')
    .column('flagged_for_review', type='bool')
    .custom_check(
        lambda row: row['model_score'] <= 0.9 or row['flagged_for_review'] is True,
        "flagged_for_review must be True when model_score > 0.9"
    )
    .not_null()
    .not_empty()
    .only_defined_columns()
)

result = model_output_validator.validate(df)

- Column 'model_score' contains disallowed values: [np.float64(0.0)].
  result = model_output_validator.validate(df)
- flagged_for_review must be True when model_score > 0.9 (failed on 1 row(s))
  result = model_output_validator.validate(df)


## Validation Results

FrameCheck shows two issues:

**Warning:**
- `model_score` contains value 0.0 (suspicious but allowed)

**Error:**
- Transaction with score > 0.9 not flagged for review (violates business rule)

## Using Validation Results

Get a summary of all validation issues

In [16]:
print(result.summary())

Validation FAILED
Errors:
  - flagged_for_review must be True when model_score > 0.9 (failed on 1 row(s))
  - Column 'model_score' contains disallowed values: [np.float64(0.0)].


Identify invalid rows

In [17]:
invalid_rows = result.get_invalid_rows(df)
invalid_rows

Unnamed: 0,transaction_id,user_id,transaction_time,model_score,model_version,flagged_for_review
0,TXN1001,501,2024-04-15 08:23:11,0.0,v2.1.0,False
2,TXN1003,503,2024-04-15 09:01:37,0.95,v2.1.0,False


Logging

In [20]:
if result.errors:
  logger.error(result.errors)

ERROR:model_validation:['flagged_for_review must be True when model_score > 0.9 (failed on 1 row(s))']


In [19]:
if result.warnings:
  logger.warning(result.warnings)



## Raising Exceptions

Use `.raise_on_error()` to throw exceptions for invalid data

In this example, we require that `id` be non-null and unique.

In [30]:
simple_df = pd.DataFrame({
    'id': ['A001', 'A001', 'A003', None],
    'value': [5, -1, 10, 7]
})

strict_validator = (
    FrameCheck()
    .column('id', type='string', regex=r'^A\d{3}$', not_null = True)
    .unique(columns=['id'])
    .raise_on_error()
)

# This will raise a ValueError with detailed validation message
strict_validator.validate(simple_df)

ValueError: FrameCheck validation failed:
Column 'id' contains missing values.
Rows are not unique based on columns: ['id']

## Design Patterns in Production

One of the framecheck principles is `No configuration files. Ever.`

To keep your codebase clean, you can define your FrameCheck objects in a module and import them.

#### validators.py

```python
from framecheck import FrameCheck

price_validator = (
    FrameCheck()
    .column('item_id', type='string')
    .column('price', type='float', min=0)
    .not_null()
)


```

#### main.py (or wherever)

```python
from validators import price_validator

result = price_validator.validate(df)
```