fileslicer is a lightweight Python library for efficiently reading and splitting large files using memory mapping. It allows you to iterate over lines within a file slice and split files into chunks without loading the entire file into memory, making it ideal for processing very large files.
- Memory-efficient line iteration using
mmap
. - Split large files into chunks while respecting newline boundaries.
- Simple and Pythonic API.
- Works with files of arbitrary size.
Install via pip:
pip install fileslicer
from fileslicer import FileSlice
# Create a FileSlice for an entire file
slice = FileSlice.from_file("large_file.txt")
# Iterate over lines in the slice
for line in slice.iter_lines():
print(line.decode().strip())
from fileslicer import FileSlice
# Split a file into 4 chunks
chunks = FileSlice.split_file("large_file.txt", splits=4)
for chunk in chunks:
print(f"Processing bytes {chunk.start_offset}-{chunk.end_offset}")
for line in chunk.iter_lines():
print(line.decode().strip())
from fileslicer import FileSlice
# Only read bytes 1000 to 5000
slice = FileSlice("large_file.txt", 1000, 5000)
for line in slice.iter_lines():
print(line.decode().strip())
-
FileSlice(file_path: str, start_offset: int, end_offset: int)
: Represents a slice of a file. -
iter_lines() -> Generator[bytes]
: Iterate over lines in the file slice as bytes. -
@staticmethod from_file(file_path: str) -> FileSlice
: Create aFileSlice
covering the entire file. -
@staticmethod split_file(file_path: str, splits: int) -> list[FileSlice]
: Split a file into multiple slices, aligned to newline boundaries.
Processing extremely large files with standard file reading can be slow and memory-intensive. fileslicer uses memory mapping to efficiently slice and iterate over file data without reading everything into memory. Inspired by the "1 Billion Row Challenge" in Python, it is perfect for data processing pipelines, log analysis, and ETL tasks.
fileslicer
is distributed under the terms of the MIT license.