## Use Case
### Performance Comparison of Reading a Large CSV File with and without Chunking

In this demonstration, we will:

1. Simulate the creation of a large CSV file.
2. Read the CSV file into a Pandas DataFrame in two different ways:
    - Without chunking
    - With chunking
3. Compare the performance of both methods.

In [30]:
import warnings
warnings.filterwarnings('ignore')

In [31]:
import pandas as pd
import numpy as np
import time

from memory_profiler import memory_usage
import time

### Step 1: Simulating a Large CSV File

First, let's create a large CSV file for testing purposes. We will generate a DataFrame with a significant number of rows and save it as a CSV file.

In [24]:
# Simulate a large DataFrame
num_rows = 10**6  # 1 million rows
df = pd.DataFrame({
    'A': np.random.rand(num_rows),
    'B': np.random.randint(1, 100, size=num_rows),
    'C': np.random.choice(['X', 'Y', 'Z'], size=num_rows)
})

# Save the DataFrame to a CSV file
csv_file_path = 'large_file.csv'
df.to_csv(csv_file_path, index=False)

### Step 2: Reading the CSV File Without Chunking

Now, we will read the entire CSV file into a DataFrame without using chunking and measure the execution time.

In [28]:
def read_without_chunking():
    # Measure time for reading without chunking
    start_time = time.time()
    df_full = pd.read_csv(csv_file_path)
    end_time = time.time()
    
    return df_full, end_time - start_time

# Measure memory usage
mem_usage_no_chunking = memory_usage(read_without_chunking)
print(f"Memory usage without chunking: {max(mem_usage_no_chunking) - min(mem_usage_no_chunking):.2f} MiB")

Memory usage without chunking: 102.02 MiB


### Step 3: Reading the CSV File With Chunking

Next, we will read the same CSV file using chunking and measure the execution time.

In [29]:
def read_with_chunking():
    # Measure time for reading with chunking
    start_time = time.time()
    chunks = []
    chunk_size = 100000  # Adjust chunk size as needed

    for chunk in pd.read_csv(csv_file_path, chunksize=chunk_size):
        chunks.append(chunk)

    df_chunked = pd.concat(chunks, ignore_index=True)
    end_time = time.time()
    
    return df_chunked, end_time - start_time

# Measure memory usage
mem_usage_with_chunking = memory_usage(read_with_chunking)
print(f"Memory usage with chunking: {max(mem_usage_with_chunking) - min(mem_usage_with_chunking):.2f} MiB")

Memory usage with chunking: 35.65 MiB


### Summary of Results

1. **Performance Comparison:** In many cases, reading data in chunks can be as fast as reading the entire file at once, especially if the data is large. Chunking can also help manage memory usage, preventing potential crashes due to memory overload.

2. **Flexibility:** Chunking allows for the processing of large datasets without needing to load the entire dataset into memory, making it a valuable technique for data processing in real-world applications.