# üìò Data Quality Audit ‚Äî `inspect_df`

This notebook delivers a **premium, client-facing data quality report** built on your `inspect_df` helper.

It is designed as a reusable asset for:
- Data audits
- Dataset onboarding diagnostics
- Pre-ML data health checks
- Client deliverables in data consulting missions


## üìë Table of Contents

- [Part A ‚Äî Executive & Strategic Overview](#part-a-2)
- [Part B ‚Äî Technical Audit](#part-b-2)
- [Code Appendix & Execution](#part-c-2)
- [Run on cleaned animal dataset](#run-cleaned-2)
- [Generic usage for any CSV](#generic-2)


---


# üß† Part A ‚Äî Executive & Strategic Overview
<a id="part-a-2"></a>

### üéØ Purpose
This notebook provides an **end-to-end structural and quality assessment** of a tabular dataset.

Using a single function, `inspect_df`, it quickly answers the questions:
- *What does this dataset look like structurally?*
- *Which columns are present, and in which formats?*
- *How much missingness, duplication, or inconsistency should we expect?*
- *Is the dataset ready for downstream EDA or machine learning, or do we need cleaning first?*

### üíº Value for stakeholders
- Reduces uncertainty around data readiness.
- Exposes data quality risks **before** investing time in modeling.
- Creates a transparent, shareable audit trail between data teams and business teams.
- Can be reused across projects as a standard *Data Quality Report*.


### üîó Position in the pipeline

This inspection sits **early in the data lifecycle**, typically after raw ingestion and basic structural fixes.

Example pipeline:
1. **Autofix** ‚Äî normalize CSV separator and column names (initial structural cleaning).
2. **Inspect_df (this notebook)** ‚Äî perform a full structural and statistical scan.
3. **Cleaning / Parsing** ‚Äî advanced normalization, mapping, feature engineering.
4. **EDA & Modeling** ‚Äî visualizations, feature selection, ML models.

Integrating `inspect_df` early avoids wasting time building models on broken data.


### ‚ùì Key questions this report answers

- Are there **enough rows and columns** to support the planned analysis?
- Which columns are **numeric vs categorical**, and are the types correct?
- Where do we see **missing values**, and how severe is the problem?
- Are there **duplicate rows** that might bias statistics or ML training?
- Do any columns show suspicious distributions (e.g. constant values, extreme imbalance)?


---

# üõ†Ô∏è Part B ‚Äî Technical Audit (How It Works)
<a id="part-b-2"></a>

We now document the `inspect_df` utility function and how to interpret each block of output.


## 1. Original function

```python
def inspect_df(df: pd.DataFrame, name: str = None, n: int = 20) -> None:

    title = f"=== DataFrame Inspection: {name} ===" if name else "=== DataFrame Inspection ==="
    print(f"\n==={title}===")
    print("=" * len(title))

    print("\n=== Dimension ===")
    print(df.shape)

    print("\n=== DF Info ===")
    df.info()

    print(f"\n=== {n} First Rows ===")
    display(df.head(n))

    print(f"\n=== {n} Random Rows ===")
    display(df.sample(n, random_state=42))

    print("\n=== Descriptive Stats ===")
    display(df.describe(include="all").T)

    print("\n=== Unique Value ===")
    print(df.nunique())

    print("\n=== Number of NaN Values ===")
    print(df.isna().sum())

    print("\n=== Number of Duplicates Rows ===")
    print(df.duplicated().sum())

    print("\n=== Duplicates Rows ===")
    print(df[df.duplicated()])
```


### 1.1 Function signature and title

- `df: pd.DataFrame` ‚Üí the dataset to inspect.
- `name: str | None` ‚Üí optional label used in the report title.
- `n: int` ‚Üí number of rows to display for head and random sampling.

The title banner:

```python
title = f"=== DataFrame Inspection: {name} ===" if name else "=== DataFrame Inspection ==="
print(f"\n==={title}===")
print("=" * len(title))
```

gives a clear entry point for readers, especially when multiple datasets are inspected in the same notebook.


### 1.2 Dimensions and schema

```python
print("\n=== Dimension ===")
print(df.shape)

print("\n=== DF Info ===")
df.info()
```

- `df.shape` reports `(n_rows, n_cols)`.
- `df.info()` exposes:
  - column names
  - data types (`int64`, `float64`, `object`, etc.)
  - non-null counts

**Interpretation:**
- A sudden mismatch between expected columns and actual columns often signals ingestion issues.
- Incorrect dtypes (e.g. numeric columns stored as `object`) indicate the need for parsing / casting.


### 1.3 First and random rows

```python
display(df.head(n))
display(df.sample(n, random_state=42))
```

- `head(n)` shows the **top rows** ‚Äî typically where structural problems appear.
- `sample(n)` shows **rows from across the dataset** with a fixed random seed.

**Why both are needed:**
- Top rows may look clean while anomalies live deeper in the file.
- Sampling avoids a biased view based only on the first rows.


### 1.4 Descriptive statistics, cardinality, missingness, duplicates

```python
display(df.describe(include="all").T)

print(df.nunique())
print(df.isna().sum())
print(df.duplicated().sum())
print(df[df.duplicated()])
```

- `df.describe(include="all").T` ‚Üí numeric + categorical stats in a unified table.
- `df.nunique()` ‚Üí number of unique values per column (helps detect IDs, categories, binary flags).
- `df.isna().sum()` ‚Üí missing values per column (critical for imputation strategy).
- `df.duplicated()` ‚Üí detects fully duplicated rows.

**Interpretation tips for a client:**
- High missingness (>30‚Äì40%) on key features suggests the need for either domain-specific imputation or feature dropping.
- Extremely low cardinality columns (e.g. only 1 unique value) may not add value to ML models.
- Duplicates in transactional or event data can significantly bias metrics and forecasts.


---

# üíª Code Appendix & Execution
<a id="part-c-2"></a>
Below is a self-contained implementation of `inspect_df` and example execution cells.


In [None]:
import os
import pandas as pd
from IPython.display import display


In [None]:
def inspect_df(df: pd.DataFrame, name: str | None = None, n: int = 20) -> None:
    """Comprehensive DataFrame inspection utility.

    Parameters
    ----------
    df : pd.DataFrame
        The dataset to inspect.
    name : str | None, optional
        Optional label displayed in the report header.
    n : int, default=20
        Number of rows to show in head() and sample().
    """

    title = f"=== DataFrame Inspection: {name} ===" if name else "=== DataFrame Inspection ==="
    print(f"\n==={title}===")
    print("=" * len(title))

    print("\n=== Dimension ===")
    print(df.shape)

    print("\n=== DF Info ===")
    df.info()

    print(f"\n=== {n} First Rows ===")
    display(df.head(n))

    print(f"\n=== {n} Random Rows ===")
    display(df.sample(n, random_state=42))

    print("\n=== Descriptive Stats ===")
    display(df.describe(include="all").T)

    print("\n=== Unique Value ===")
    print(df.nunique())

    print("\n=== Number of NaN Values ===")
    print(df.isna().sum())

    print("\n=== Number of Duplicates Rows ===")
    print(df.duplicated().sum())

    print("\n=== Duplicates Rows ===")
    print(df[df.duplicated()])


## ‚ñ∂Ô∏è Run the inspection on your cleaned animal dataset
<a id="run-cleaned-2"></a>

By default, we target the file produced by the `autofix` step:
`data/raw/animal_data_dirty_reworked.csv`.


In [None]:
DATA_PATH = "data/raw/animal_data_dirty_reworked.csv"

if os.path.exists(DATA_PATH):
    df = pd.read_csv(DATA_PATH, sep=";")
    filename = os.path.basename(DATA_PATH)
    inspect_df(df, name=filename, n=20)
else:
    print(f"‚ùå File not found: {DATA_PATH}")


## üîÅ Generic usage for any CSV
<a id="generic-2"></a>
You can reuse this notebook across projects by changing the `DATA_PATH` below.


In [None]:
# Example ‚Äî adapt this path to any dataset you want to audit
custom_path = "data/raw/your_other_dataset.csv"

if os.path.exists(custom_path):
    df_custom = pd.read_csv(custom_path, sep=";")
    inspect_df(df_custom, name=os.path.basename(custom_path), n=20)
else:
    print(f"(Info) Custom path does not exist yet: {custom_path}")
