<a href="https://colab.research.google.com/github/Renuka239/Deep-Learning-Project/blob/main/Pydantic_Dataset_Profiler.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

I built a dataset ingestion and profiling layer using Pydantic schemas. Any uploaded CSV is converted into a DatasetSchema capturing column metadata, inferred data types, missingness, and statistics, along with a DataPreview for downstream analytics, visualization, and LLM interaction.

Cell 1: Install packages
-q just means “quiet” (less output).

In [8]:
!pip -q install pandas pydantic numpy


Cell 2: Imports
typing → for type hints in Pydantic models
datetime → timestamp when schema is created
io → read uploaded file bytes as a file-like object
pandas → read and manipulate CSV
numpy → handle NaN and numeric conversions
Pydantic → define schemas (your “key data structures”)
Colab files → browser-based file upload

In [9]:
from __future__ import annotations

from typing import Any, Dict, List, Optional, Literal
from datetime import datetime
import io

import pandas as pd
import numpy as np
from pydantic import BaseModel, Field, ConfigDict

from google.colab import files


Cell 3: Pydantic Schemas (KEY DATA STRUCTURES)
1)DataType-This restricts column types to known values.
So no random strings like "int64" or "object"
2)ColumnSchema-Represents ONE column in the dataset.Each column stores:
name → column name
dtype → inferred logical type
nullable → does it contain missing values?
unique → does it look like an ID?
Statistics:non-null count
null count + percentage
distinct values
min / max / mean (only for numeric columns)
This is your column-level metadata.

3)DatasetSchema-Represents the ENTIRE dataset.It stores:dataset name (uploaded filename),timestamp (when schema was created),number of rows,number of column,list of ColumnSchema
Think of this as a data catalog entry.

4)DataPreview-A safe sample of the dataset.Why it exists:You don’t want to send full data to UI/LLMs. You want just a few rows, JSON-safe.It stores:dataset name,sample size, first few rows as dictionaries

In [10]:
DataType = Literal["string", "integer", "number", "boolean", "datetime", "category", "unknown"]

class ColumnSchema(BaseModel):
    model_config = ConfigDict(extra="forbid")

    name: str
    dtype: DataType
    nullable: bool
    unique: bool

    non_null_count: int
    null_count: int
    null_pct: float

    distinct_count: int
    min: Optional[float] = None
    max: Optional[float] = None
    mean: Optional[float] = None


class DatasetSchema(BaseModel):
    model_config = ConfigDict(extra="forbid")

    dataset_name: str
    created_at: datetime = Field(default_factory=datetime.utcnow)

    row_count: int
    column_count: int
    columns: List[ColumnSchema]


class DataPreview(BaseModel):
    model_config = ConfigDict(extra="forbid")

    dataset_name: str
    sample_size: int
    rows: List[Dict[str, Any]]


Cell 4: Helper function (data type inference)This function decides:

“What kind of data is this column logically?”

Checks happen in order:Integer → "integer",Float → "number",Boolean → "boolean",Datetime → "datetime",Low-cardinality → "category",Otherwise → "string"
This makes the code dataset-agnostic.

In [11]:
def infer_dtype(series: pd.Series) -> DataType:
    if pd.api.types.is_integer_dtype(series):
        return "integer"
    if pd.api.types.is_float_dtype(series):
        return "number"
    if pd.api.types.is_bool_dtype(series):
        return "boolean"
    if pd.api.types.is_datetime64_any_dtype(series):
        return "datetime"
    if series.nunique(dropna=True) < 20:
        return "category"
    return "string"


Cell 5: Upload CSV (Colab browser upload)-This opens a browser upload dialog.
What’s happening:It gives you the file as bytes,io.BytesIO converts bytes → file-like object,Pandas reads it like a normal CSV
Now df is a DataFrame.

In [19]:
uploaded = files.upload()  # choose a CSV file in the browser

if len(uploaded) == 0:
    raise ValueError("No file uploaded. Please upload a CSV file.")

filename = list(uploaded.keys())[0]
df = pd.read_csv(io.BytesIO(uploaded[filename]))

print("Loaded file:", filename)
df.head()


Saving OnlineRetail.csv to OnlineRetail (1).csv


UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 79780: invalid start byte

Cell 6: Build DatasetSchema (core logic)-This is where raw data → structured schema happens.For each column:Infer data type,Count missing values,Compute missing percentage,Count distinct values,Check uniqueness
If numeric → compute min, max, mean
All these values are passed into ColumnSchema.
Each column schema is appended to a list.
Create DatasetSchema-This bundles:filename,row count,column count,all column schemas
Now you have one validated object representing the dataset.

In [17]:
columns = []
rows = len(df)

for col in df.columns:
    s = df[col]
    dtype = infer_dtype(s)

    null_count = int(s.isna().sum())
    non_null_count = rows - null_count
    null_pct = round((null_count / rows) * 100, 2)

    distinct_count = int(s.nunique(dropna=True))
    unique = distinct_count == non_null_count

    min_v = max_v = mean_v = None
    if dtype in ["integer", "number"]:
        # numeric stats (safe conversion)
        numeric = pd.to_numeric(s, errors="coerce")
        if numeric.notna().any():
            min_v = float(numeric.min())
            max_v = float(numeric.max())
            mean_v = float(numeric.mean())

    columns.append(
        ColumnSchema(
            name=str(col),
            dtype=dtype,
            nullable=null_count > 0,
            unique=bool(unique),
            non_null_count=non_null_count,
            null_count=null_count,
            null_pct=null_pct,
            distinct_count=distinct_count,
            min=min_v,
            max=max_v,
            mean=mean_v,
        )
    )

dataset_schema = DatasetSchema(
    dataset_name=filename,
    row_count=rows,
    column_count=len(df.columns),
    columns=columns
)

dataset_schema


DatasetSchema(dataset_name='Superstore Sales Dataset-train.csv', created_at=datetime.datetime(2026, 2, 7, 0, 16, 34, 365875), row_count=9800, column_count=18, columns=[ColumnSchema(name='Row ID', dtype='integer', nullable=False, unique=True, non_null_count=9800, null_count=0, null_pct=0.0, distinct_count=9800, min=1.0, max=9800.0, mean=4900.5), ColumnSchema(name='Order ID', dtype='string', nullable=False, unique=False, non_null_count=9800, null_count=0, null_pct=0.0, distinct_count=4922, min=None, max=None, mean=None), ColumnSchema(name='Order Date', dtype='string', nullable=False, unique=False, non_null_count=9800, null_count=0, null_pct=0.0, distinct_count=1230, min=None, max=None, mean=None), ColumnSchema(name='Ship Date', dtype='string', nullable=False, unique=False, non_null_count=9800, null_count=0, null_pct=0.0, distinct_count=1326, min=None, max=None, mean=None), ColumnSchema(name='Ship Mode', dtype='category', nullable=False, unique=False, non_null_count=9800, null_count=0, nu

Cell 7: Create DataPreview-Key things here:Takes first 5 rows,Converts NaN → None (JSON-safe),Converts rows into list of dictionaries
This is perfect for:UI display,LLM prompts,Debugging,Logs

In [18]:
preview = DataPreview(
    dataset_name=filename,
    sample_size=5,
    rows=df.head(5).replace({np.nan: None}).to_dict(orient="records"),
)

preview


DataPreview(dataset_name='Superstore Sales Dataset-train.csv', sample_size=5, rows=[{'Row ID': 1, 'Order ID': 'CA-2017-152156', 'Order Date': '08/11/2017', 'Ship Date': '11/11/2017', 'Ship Mode': 'Second Class', 'Customer ID': 'CG-12520', 'Customer Name': 'Claire Gute', 'Segment': 'Consumer', 'Country': 'United States', 'City': 'Henderson', 'State': 'Kentucky', 'Postal Code': 42420.0, 'Region': 'South', 'Product ID': 'FUR-BO-10001798', 'Category': 'Furniture', 'Sub-Category': 'Bookcases', 'Product Name': 'Bush Somerset Collection Bookcase', 'Sales': 261.96}, {'Row ID': 2, 'Order ID': 'CA-2017-152156', 'Order Date': '08/11/2017', 'Ship Date': '11/11/2017', 'Ship Mode': 'Second Class', 'Customer ID': 'CG-12520', 'Customer Name': 'Claire Gute', 'Segment': 'Consumer', 'Country': 'United States', 'City': 'Henderson', 'State': 'Kentucky', 'Postal Code': 42420.0, 'Region': 'South', 'Product ID': 'FUR-CH-10000454', 'Category': 'Furniture', 'Sub-Category': 'Chairs', 'Product Name': 'Hon Deluxe 