# Large DataSets Scaling (LDS) - Load less data - Chunking

In [28]:
import pandas as pd
import numpy as np
import pathlib

Some workloads can be achieved with chunking by splitting a large problem into a bunch of small problems. 
For example, converting an individual CSV file into a Parquet file and repeating that for each file in a directory. As long as each chunk fits in memory, you can work with datasets that are much larger than memory.

Chunking works well when the operation you’re performing requires zero or minimal coordination between chunks. For more complicated workflows, you’re better off using another library.

## Create a chunked dataset

### Set a function to create data to work with

In [29]:
# Function to make a dataframe to work with scaling datasets
def make_timeseries(start="2024-01-01", end="2024-12-31", freq="1D", seed=None):
    """Build a dataset"""
    # Build an index
    index = pd.date_range(start=start, end=end, freq=freq, name="timestamp")
    n = len(index)
    state = np.random.RandomState(seed)
    columns = {
        "name": state.choice(["Alice", "Bob", "Charlie"], size=n),
        "id": state.poisson(1000, size=n),
        "x": state.rand(n) * 2 - 1,
        "y": state.rand(n) * 2 - 1,
    }
    df = pd.DataFrame(columns, index=index, columns=sorted(columns))
    if df.index[-1] == end:
        df = df.iloc[:-1]
    return df

### Build the data

In [30]:
N = 12
starts = [f"20{i+10:>02d}-01-01" for i in range(N)]
ends = [f"20{i+10:>02d}-12-31" for i in range(N)]

In [31]:
pathlib.Path("./timeseries").mkdir(exist_ok=True)

In [32]:
for i, (start, end) in enumerate(zip(starts, ends)):
    ts = make_timeseries(start=start, end=end, freq="1min", seed=i)
    ts.to_parquet(f"timeseries/ts-{i:0>2d}.parquet")

The data is already builded and organized in 12 timeseries folders

## Simulation of processing a Chunked File

I've implemented an out-of-core pandas.Series.value_counts(). The peak memory usage of this workflow is the single largest chunk, plus a small series storing the unique value counts up to this point. As long as each individual file fits in memory, this will work for arbitrary-sized datasets.

In [36]:
files = pathlib.Path("timeseries/").glob("ts*.parquet")
counts = pd.Series(dtype=int)
for path in files:
    df = pd.read_parquet(path)
    counts = counts.add(df["name"].value_counts(), fill_value=0)
counts.astype(int)

name
Alice      2098234
Bob        2097382
Charlie    2098636
dtype: int32

### Final Considerations

Some readers, like pandas.read_csv(), offer parameters to control the chunksize when reading a single file.

Manually chunking is an OK option for workflows that don’t require too sophisticated of operations. Some operations, like pandas.DataFrame.groupby(), are much harder to do chunkwise. In these cases, you may be better switching to a different library that implements these out-of-core algorithms for you.