# DataProf: In-Memory DataFrame Profiling

This notebook demonstrates profiling pandas, polars, and PyArrow DataFrames
using DataProf's zero-copy Arrow PyCapsule interface.

### Requirements
```bash
pip install dataprof pandas polars pyarrow
```

In [1]:
import dataprof
import pandas as pd
import polars as pl
import pyarrow as pa

print(f"dataprof version: {dataprof.__version__}")

dataprof version: 0.5.0


## 1. Profiling a pandas DataFrame

Use `profile_dataframe()` to profile any pandas DataFrame directly,
without writing to disk first.

In [2]:
pdf = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "David", None],
    "age": [25, 30, 35, 28, None],
    "score": [85.5, 92.0, 78.3, 88.7, 95.2],
    "active": [True, False, True, True, False],
})

report = dataprof.profile_dataframe(pdf, name="pandas_example")
print(f"Quality Score: {report.quality_score():.1f}%")
print(f"Rows: {report.total_rows}, Columns: {report.total_columns}")
print(f"Scan time: {report.scan_time_ms}ms")

Quality Score: 100.0%
Rows: 5, Columns: 4
Scan time: 195ms




In [3]:
report.data_quality_metrics

## 2. Profiling a polars DataFrame

The same `profile_dataframe()` function auto-detects polars DataFrames
and uses the native `to_arrow()` path for zero-copy conversion.

In [4]:
pldf = pl.DataFrame({
    "product": ["Widget", "Gadget", "Doohickey", "Thingamajig"],
    "price": [9.99, 24.50, 5.75, 15.00],
    "quantity": [100, 50, 200, 75],
    "category": ["A", "B", "A", "C"],
})

report_pl = dataprof.profile_dataframe(pldf, name="polars_example")
print(f"Quality Score: {report_pl.quality_score():.1f}%")
print(f"Columns: {[col.name for col in report_pl.column_profiles]}")

Quality Score: 100.0%
Columns: ['quantity', 'product', 'category', 'price']




In [5]:
report_pl.data_quality_metrics

## 3. Using `profile_arrow()` with PyArrow Table

`profile_arrow()` is optimized for direct PyArrow input -- it skips
library auto-detection and goes straight to Arrow import.

In [6]:
table = pa.table({
    "id": pa.array([1, 2, 3, 4, 5]),
    "value": pa.array([10.5, 20.3, None, 40.1, 50.8]),
    "label": pa.array(["a", "b", "c", "d", "e"]),
})

report_arrow = dataprof.profile_arrow(table, name="arrow_example")
print(f"Quality Score: {report_arrow.quality_score():.1f}%")
print(f"Scan time: {report_arrow.scan_time_ms}ms")

Quality Score: 100.0%
Scan time: 0ms




## 4. Accessing Quality Metrics

DataProf provides ISO 8000/25012 quality metrics across five dimensions:
- **Completeness** -- missing values, complete records
- **Consistency** -- data type conformance, format violations
- **Uniqueness** -- duplicate rows, key uniqueness
- **Accuracy** -- outliers, range violations
- **Timeliness** -- stale data, temporal violations

In [9]:
metrics = report.data_quality_metrics

print("=== Completeness ===")
print(metrics.completeness_summary())

print("\n=== Consistency ===")
print(metrics.consistency_summary())

print("\n=== Uniqueness ===")
print(metrics.uniqueness_summary())

print("\n=== Accuracy ===")
print(metrics.accuracy_summary())

print("\n=== Timeliness ===")
print(metrics.timeliness_summary())

print(f"\nOverall Quality Score: {metrics.overall_quality_score():.1f}%")

=== Completeness ===
Missing values: 0.0% | Complete records: 100.0% | Null columns: 0

=== Consistency ===
Type consistency: 100.0% | Format violations: 0 | Encoding issues: 0

=== Uniqueness ===

=== Accuracy ===
Outlier ratio: 0.0% | Range violations: 0 | Negative values in positive fields: 0

=== Timeliness ===
Future dates: 0 | Stale data: 0.0% | Temporal violations: 0

Overall Quality Score: 100.0%


## 5. Column-Level Profiles

In [8]:
for col in report.column_profiles:
    print(
        f"  {col.name}: type={col.data_type}, "
        f"nulls={col.null_percentage:.1f}%, "
        f"unique={col.unique_count}"
    )

  age: type=float, nulls=20.0%, unique=5
  score: type=float, nulls=0.0%, unique=5
  active: type=string, nulls=0.0%, unique=2
  name: type=string, nulls=20.0%, unique=4
