Skip to content

ENH: preview_csv(***.csv) for Fast First-N-Line Preview on Large Plus Size (>100GB) #61281

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 3 tasks
visheshrwl opened this issue Apr 13, 2025 · 3 comments
Open
1 of 3 tasks
Labels
Enhancement IO CSV read_csv, to_csv Needs Discussion Requires discussion from core team before further action

Comments

@visheshrwl
Copy link

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

The current pandas.read_csv() implementation is designed for robust and complete CSV parsing. However, even when users request only a few lines using nrows=X, the function:

  • Initializes the full parsing engine
  • Performs column-wise type inference
  • Scans for delimiter/header consistency
  • May read a large portion or all of the file, even for small previews

For large datasets (10–100GB CSVs), this results in significant I/O, CPU, and memory overhead — all when the user likely just wants a quick preview of the data.

This is a common pattern in:

  • Exploratory Data Analysis (EDA)
  • Data cataloging and profiling
  • Schema validation or column sniffing
  • Dashboards and notebook tooling

Currently, users resort to workarounds like:

pd.read_csv(..., chunksize=5)
next(...)

or shell-level hacks like:

head -n 5 large_file.csv

These are non-intuitive, unstructured, or outside the pandas ecosystem.

Feature Description

Introduces a new Function

pandas.preview_csv(filepath_or_buffer, nrows=5, ...)

Goals

  • Read only the first n rows + header lines
  • Avoid loading or inferring types from null dataset
  • No full cloumn validation
  • Fallback to object dtype unless dtype_infer = true
  • Support basic options like delimiter, encoding, header presence.

Proposed API:

def preview_csv(
    filepath_or_buffer,
    nrows: int = 5,
    delimiter: str = ",",
    encoding: str = "utf-8",
    has_header: bool = True,
    dtype_infer: bool = False,
    as_generator: bool = False
) -> pd.DataFrame:
    ...

Alternative Solutions

Tool / Method Behavior Limitation
pd.read_csv(nrows=X) Reads entire file into memory, performs dtype inference and column validation Not optimized for quick previews; incurs overhead even for small nrows
pd.read_csv(chunksize=X) Returns an iterator of chunks (DataFrames of size X) Requires non-intuitive iterator handling; users often want DataFrame directly
csv.reader + slicing Python’s built-in CSV reader is lightweight and fast Returns raw lists, not a DataFrame; lacks header handling and column inference
subprocess.run(["head", "-n"]) OS-level utility that returns first N lines Not portable across platforms, doesn't integrate with DataFrame workflow
Polars: pl.read_csv(..., n_rows) Rust-based, blazing fast CSV reader Requires installing a new library; pandas users might not want to switch ecosystems
Dask: dd.read_csv(...).head() Lazy, out-of-core loading with chunked processing Overhead of distributed engine is unnecessary for simple previews
open(...).readlines(N) Naive Python read of first N lines Doesn’t handle parsing, delimiters, or schema properly
pyarrow.csv.read_csv(...)[0:X] Efficient Arrow-based preview Requires using Apache Arrow APIs; returns Arrow tables unless converted

While workarounds exist, none provide a clean, idiomatic, native pandas function to:

  • Efficiently load the first N rows
  • Return a DataFrame immediately
  • Avoid dtype inference
  • Skip full file validation
  • Avoid requiring third-party dependencies

A dedicated pandas.preview_csv() would fill this gap and offer an elegant, performant solution for quick data previews.

Additional Context

No response

@visheshrwl visheshrwl added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 13, 2025
@rhshadrach
Copy link
Member

Thanks for the request. Having to maintain an entirely different code path that does very similar things to read_csv seems to me to be a non-starter. I would like to understand why read_csv could not be improved to fit this purpose.

@rhshadrach rhshadrach added IO CSV read_csv, to_csv Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 14, 2025
@visheshrwl
Copy link
Author

Thank you for the thoughtful feedback @rhshadrach !

I completely understand the reluctance to maintain a separate code path - especially in a core function like read_csv(), which already carries significant complexity.

read_csv() is designed for full-fidelity, schema-validated and optionally type-inferred ingestion. Introducing conditional short circuits for preview-style use cases pollutes that logic and increases branching inside a hot, complex code path.

On the other hand, a dedicated preview_csv() function:

  • Defines a minimal contract: "Read the top N rows quickly with minimal parsing"
  • Requires no inference or post-processing logic
  • Makes the behaviour explicit, predictable, and easy to optimize separately.

From a user intent perspective:

  • read_csv(nrows=X) implies: "I want a truncated but fully parsed and inferred subset of the data"
  • preview_csv(nrows=X) would mean: "I just want to see the first X lines, as fast as possible - even if it's untyped or partially parsed."

This distinction matters - especially in workflows where previewing is decoupled from actual analysis, such as:

  • Data cataloging
  • EDA profiling
  • Schema sniffing
  • Logging pipelines

Any performance optimization embedded in read_csv() must:

  • Preserve dozens of edge cases
  • Remain compatible with all backends (C, python, Arrow-based readers)
  • Honor ~50+ keyword arguments (dtype, parse_dates, converters, skiprows, etc.)

This would introduce non-trivial complexity and testing burden to a critical code path and create surface area for subtle regressions.

Both polars.read_csv(..., n_rows=X) and vaex.open(...).head(X) implement optimized preview semantics using fast readers with early stopping. These tools don't override their full read_csv() equivalents - they recognize the preview use case is distinct.

Pandas could adopt a similar design without breaking the existing contract of read_csv()

If approved, I'm happy to:

  • Own the implementation of preview_csv()
  • Benchmark it vs read_csv() under real workloads (10GB+)
  • Keep it behind a dedicated namespace (e.g. pandas.io.preview)
  • Ensure full test coverage and documentation.

Would love your thoughts - and if there's a preferred entry point you'd recommend for this to remain modular and maintainable long-term.

Thanks again!

@rhshadrach
Copy link
Member

Can you post sample data and benchmarks demonstrating the performance issue with specifying nrows=N.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

2 participants