ENH: preview_csv(***.csv) for Fast First-N-Line Preview on Large Plus Size (>100GB) #61281

visheshrwl · 2025-04-13T14:08:49Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

The current pandas.read_csv() implementation is designed for robust and complete CSV parsing. However, even when users request only a few lines using nrows=X, the function:

Initializes the full parsing engine
Performs column-wise type inference
Scans for delimiter/header consistency
May read a large portion or all of the file, even for small previews

For large datasets (10–100GB CSVs), this results in significant I/O, CPU, and memory overhead — all when the user likely just wants a quick preview of the data.

This is a common pattern in:

Exploratory Data Analysis (EDA)
Data cataloging and profiling
Schema validation or column sniffing
Dashboards and notebook tooling

Currently, users resort to workarounds like:

pd.read_csv(..., chunksize=5)
next(...)

or shell-level hacks like:

head -n 5 large_file.csv

These are non-intuitive, unstructured, or outside the pandas ecosystem.

Feature Description

Introduces a new Function

pandas.preview_csv(filepath_or_buffer, nrows=5, ...)

Goals

Read only the first n rows + header lines
Avoid loading or inferring types from null dataset
No full cloumn validation
Fallback to object dtype unless dtype_infer = true
Support basic options like delimiter, encoding, header presence.

Proposed API:

def preview_csv(
    filepath_or_buffer,
    nrows: int = 5,
    delimiter: str = ",",
    encoding: str = "utf-8",
    has_header: bool = True,
    dtype_infer: bool = False,
    as_generator: bool = False
) -> pd.DataFrame:
    ...

Alternative Solutions

Tool / Method	Behavior	Limitation
`pd.read_csv(nrows=X)`	Reads entire file into memory, performs dtype inference and column validation	Not optimized for quick previews; incurs overhead even for small `nrows`
`pd.read_csv(chunksize=X)`	Returns an iterator of chunks (DataFrames of size `X`)	Requires non-intuitive iterator handling; users often want `DataFrame` directly
`csv.reader + slicing`	Python’s built-in CSV reader is lightweight and fast	Returns raw lists, not a DataFrame; lacks header handling and column inference
`subprocess.run(["head", "-n"])`	OS-level utility that returns first N lines	Not portable across platforms, doesn't integrate with DataFrame workflow
`Polars: pl.read_csv(..., n_rows)`	Rust-based, blazing fast CSV reader	Requires installing a new library; pandas users might not want to switch ecosystems
`Dask: dd.read_csv(...).head()`	Lazy, out-of-core loading with chunked processing	Overhead of distributed engine is unnecessary for simple previews
`open(...).readlines(N)`	Naive Python read of first N lines	Doesn’t handle parsing, delimiters, or schema properly
`pyarrow.csv.read_csv(...)[0:X]`	Efficient Arrow-based preview	Requires using Apache Arrow APIs; returns Arrow tables unless converted

While workarounds exist, none provide a clean, idiomatic, native pandas function to:

Efficiently load the first N rows
Return a DataFrame immediately
Avoid dtype inference
Skip full file validation
Avoid requiring third-party dependencies

A dedicated pandas.preview_csv() would fill this gap and offer an elegant, performant solution for quick data previews.

Additional Context

No response

The text was updated successfully, but these errors were encountered:

rhshadrach · 2025-04-14T16:24:27Z

Thanks for the request. Having to maintain an entirely different code path that does very similar things to read_csv seems to me to be a non-starter. I would like to understand why read_csv could not be improved to fit this purpose.

visheshrwl · 2025-04-14T17:01:03Z

Thank you for the thoughtful feedback @rhshadrach !

I completely understand the reluctance to maintain a separate code path - especially in a core function like read_csv(), which already carries significant complexity.

read_csv() is designed for full-fidelity, schema-validated and optionally type-inferred ingestion. Introducing conditional short circuits for preview-style use cases pollutes that logic and increases branching inside a hot, complex code path.

On the other hand, a dedicated preview_csv() function:

Defines a minimal contract: "Read the top N rows quickly with minimal parsing"
Requires no inference or post-processing logic
Makes the behaviour explicit, predictable, and easy to optimize separately.

From a user intent perspective:

read_csv(nrows=X) implies: "I want a truncated but fully parsed and inferred subset of the data"
preview_csv(nrows=X) would mean: "I just want to see the first X lines, as fast as possible - even if it's untyped or partially parsed."

This distinction matters - especially in workflows where previewing is decoupled from actual analysis, such as:

Data cataloging
EDA profiling
Schema sniffing
Logging pipelines

Any performance optimization embedded in read_csv() must:

Preserve dozens of edge cases
Remain compatible with all backends (C, python, Arrow-based readers)
Honor ~50+ keyword arguments (dtype, parse_dates, converters, skiprows, etc.)

This would introduce non-trivial complexity and testing burden to a critical code path and create surface area for subtle regressions.

Both polars.read_csv(..., n_rows=X) and vaex.open(...).head(X) implement optimized preview semantics using fast readers with early stopping. These tools don't override their full read_csv() equivalents - they recognize the preview use case is distinct.

Pandas could adopt a similar design without breaking the existing contract of read_csv()

If approved, I'm happy to:

Own the implementation of preview_csv()
Benchmark it vs read_csv() under real workloads (10GB+)
Keep it behind a dedicated namespace (e.g. pandas.io.preview)
Ensure full test coverage and documentation.

Would love your thoughts - and if there's a preferred entry point you'd recommend for this to remain modular and maintainable long-term.

Thanks again!

rhshadrach · 2025-04-15T14:12:08Z

Can you post sample data and benchmarks demonstrating the performance issue with specifying nrows=N.

visheshrwl added Enhancement Needs Triage labels Apr 13, 2025

rhshadrach added IO CSV Needs Discussion and removed Needs Triage labels Apr 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: preview_csv(***.csv) for Fast First-N-Line Preview on Large Plus Size (>100GB) #61281

ENH: preview_csv(***.csv) for Fast First-N-Line Preview on Large Plus Size (>100GB) #61281

visheshrwl commented Apr 13, 2025

rhshadrach commented Apr 14, 2025

visheshrwl commented Apr 14, 2025

rhshadrach commented Apr 15, 2025

ENH: preview_csv(***.csv) for Fast First-N-Line Preview on Large Plus Size (>100GB) #61281

ENH: preview_csv(***.csv) for Fast First-N-Line Preview on Large Plus Size (>100GB) #61281

Comments

visheshrwl commented Apr 13, 2025

Feature Type

Problem Description

Feature Description

Introduces a new Function

Goals

Proposed API:

Alternative Solutions

Additional Context

rhshadrach commented Apr 14, 2025

visheshrwl commented Apr 14, 2025

rhshadrach commented Apr 15, 2025