# Handling Large Datasets with Chunking and Lazy Evaluation



## Introduction
A large dataset is not defined by file size alone.
It is large when it cannot be processed efficiently with naïve, in-memory methods on your system.

A dataset is large if it forces you to think about memory, disk I/O, or network constraints explicitly.
Large datasets often exceed the available RAM, leading to memory errors. Two key strategies to manage this are:
- **Chunking**: Processing data in smaller pieces (chunks).
- **Lazy Evaluation**: Delaying computations until necessary, often using libraries like Dask.

We'll use Pandas for chunking and Dask for lazy evaluation.

## Chunking with Pandas

Pandas allows reading CSV files in chunks using the `chunksize` parameter in `pd.read_csv()`. This returns an iterator of DataFrames.

### Why Chunking?
- Reduces memory usage.
- Enables processing of files larger than memory.

### Example: Summing a Large CSV File

Assume we have a large CSV file 'large_data.csv' with columns 'A', 'B', 'C'. We'll compute the sum of column 'A' by processing in chunks.

In [1]:
import pandas as pd

# Generate a large CSV file for simulation. It is not large enough to require Dask, but simulates the scenario.

data = pd.DataFrame({
    'A': range(1000000),
    'B': range(1000000),
    'C': range(1000000)
})
data.to_csv('large_data.csv', index=False)

print('Simulated data created.')

Simulated data created.


### Detailed Explanation

- `pd.read_csv(..., chunksize=n)`: Returns TextFileReader iterator.
- Loop over chunks: Process each DataFrame chunk individually.
- Aggregate results: Here, summing sums.

This way, only one chunk is in memory at a time.

In [2]:
# Reading in chunks
chunk_size = 100000  # Adjust based on memory
total_sum = 0

for chunk in pd.read_csv('large_data.csv', chunksize=chunk_size):
    print(f'Processing chunk with {len(chunk)} rows')
    total_sum += chunk['A'].sum()

print(f'Total sum of column A: {total_sum}')

Processing chunk with 100000 rows
Processing chunk with 100000 rows
Processing chunk with 100000 rows
Processing chunk with 100000 rows
Processing chunk with 100000 rows
Processing chunk with 100000 rows
Processing chunk with 100000 rows
Processing chunk with 100000 rows
Processing chunk with 100000 rows
Processing chunk with 100000 rows
Total sum of column A: 499999500000




### Advanced Chunking: Filtering and Writing

Example: Filter rows where 'A' > 500000 and write to a new file.

In [3]:
with open('filtered_data.csv', 'w') as f:
    header = True
    for chunk in pd.read_csv('large_data.csv', chunksize=chunk_size):
        filtered = chunk[chunk['A'] > 500000]
        filtered.to_csv(f, index=False, header=header)
        header = False  # Only write header once

print('Filtered data written.')

Filtered data written.


## Lazy Evaluation with Dask

Lazy evaluation is a programming technique where expressions are not computed until their values are needed. Memory is used only for what’s needed right now, not the entire dataset.

Python provides several lazy structures: 
- Iterator : Object with __next__()
- Generators : yield is the key i.e, it pauses the function and resumes later.

Dask provides parallel computing and lazy evaluation for larger-than-memory datasets. It mimics Pandas API but computes lazily.

### Why Lazy Evaluation?
- Builds computation graphs without executing immediately.
- Optimizes and parallelizes computations.
- Handles big data with clusters or multi-core.



In [4]:
# Installation
# If not installed:
# %pip install dask[complete] 
# %pip install "pyarrow>=10.0.1"

### How Dask library work ?

- `dd.read_csv()`: Creates a Dask DataFrame, partitioned into smaller Pandas DataFrames.
- Operations like `.mean()`, `.sum()` are lazy: They build a task graph.
- `dd.compute()`: Executes the graph and returns results.

This avoids loading the entire dataset into memory.


### Example: Basic Dask DataFrame

In [5]:
import dask.dataframe as dd

# Read the large CSV with Dask
ddf = dd.read_csv('large_data.csv')

# Lazy operations
mean_A = ddf['A'].mean()
sum_B = ddf['B'].sum()

# Compute only when needed
results = dd.compute(mean_A, sum_B)

print(f'Mean of A: {results[0]}, Sum of B: {results[1]}')

Mean of A: 499999.5, Sum of B: 499999500000


### GroupBy and Persist

### Persist 
Persisting means storing intermediate results to a stable storage medium (RAM, disk, or distributed storage) so that it can be reuse them later without recomputing.
- `persist()`: Computes and keeps the DataFrame in memory (or disk if specified).
- Useful for datasets that fit in distributed memory but not single machine.

Example: Group by a category .

In [6]:
# Add category to data for demo
data['Category'] = ['Cat1' if i % 2 == 0 else 'Cat2' for i in range(len(data))]
data.to_csv('large_data_with_cat.csv', index=False)

ddf = dd.read_csv('large_data_with_cat.csv')

# Lazy groupby
grouped_mean = ddf.groupby('Category')['A'].mean()

# Persist in memory for repeated use
ddf = ddf.persist()

# Compute
result = grouped_mean.compute()

print(result)

Category
Cat1    499999.0
Cat2    500000.0
Name: A, dtype: float64


## Combining Chunking and Lazy Evaluation

Dask internally uses chunking. You can control partitions with `npartitions`. `npartitions` refers to the number of smaller, independent chunks (partitions) that a large dataset is split into for parallel processing.

If there are too few partitions, parallelism is limited and processing is slower; if there are too many, task overhead increases and performance suffers, so the ideal is usually 2–4 partitions per CPU core.


In [7]:
import dask.dataframe as dd

# Read a large CSV lazily
df = dd.read_csv('large_data_with_cat.csv')

# Check number of partitions
print(df.npartitions)
# Repartition to 10 partitions
df = df.repartition(npartitions=10)

print(df.npartitions)  # 10


1
10
