## Pandas Optimization - Advance Techniques

- Chunking
- Indexing
- Vector Operations
- Memory Profiling

In [10]:
import warnings
warnings.filterwarnings('ignore')

In [11]:
from memory_profiler import profile
import pandas as pd
import numpy as np
import time
from memory_profiler import memory_usage

### Chunking

For large datasets, read data in chunks using the chunksize parameter in functions like pd.read_csv(). Process each chunk independently to avoid memory overflow.

In [12]:
csv_file_path = 'large_file.csv'
def read_with_chunking():
    # Measure time for reading with chunking
    start_time = time.time()
    chunks = []
    chunk_size = 100000  # Adjust chunk size as needed

    for chunk in pd.read_csv(csv_file_path, chunksize=chunk_size):
        chunks.append(chunk)

    df_chunked = pd.concat(chunks, ignore_index=True)
    end_time = time.time()
    
    return df_chunked, end_time - start_time

# Measure memory usage
mem_usage_with_chunking = memory_usage(read_with_chunking)
print(f"Memory usage with chunking: {max(mem_usage_with_chunking) - min(mem_usage_with_chunking):.2f} MiB")

Memory usage with chunking: 98.83 MiB


### Indexing

Setting an appropriate index can drastically speed up lookups, joins, and group operations.

In [20]:
import pandas as pd
import time

# Create large DataFrames
df1 = pd.DataFrame({'key': range(10000000), 'value1': range(10000000)})
df2 = pd.DataFrame({'key': range(10000000), 'value2': range(10000000)})

# Merge without indexing
start_time = time.time()
merged_df_no_index = pd.merge(df1, df2, on='key')
end_time = time.time()
print("Time taken without indexing:", end_time - start_time)

# Merge with indexing
df1.set_index('key', inplace=True)
df2.set_index('key', inplace=True)

start_time = time.time()
merged_df_with_index = pd.merge(df1, df2, left_index=True, right_index=True)
end_time = time.time()
print("Time taken with indexing:", end_time - start_time)

Time taken without indexing: 0.15232062339782715
Time taken with indexing: 0.06646132469177246


### Vectorization

Vectorized operations allow you to perform computations on entire columns or arrays without explicit loops, which can significantly speed up operations.

In [7]:
# Create a DataFrame
df = pd.DataFrame({
    'A': np.random.randint(0, 100, size=10000),
    'B': np.random.randint(0, 100, size=10000)
})

df['C'] = 0
start_time = time.time()
# Use a loop to add the values of columns 'A' and 'B'
for i in range(len(df)):
    df['C'][i] = df['A'][i] + df['B'][i]
end_time = time.time()

print(f"Time taken: {end_time-start_time:.4f}")

# Vectorized operation: adding two columns
start_time = time.time()
df['C'] = df['A'] + df['B']
end_time = time.time()

print(f"Time taken: {end_time-start_time:.4f}")

Time taken: 0.9784
Time taken: 0.0007


In [8]:
df = pd.DataFrame({
    'A': np.random.randint(0, 100, size=10000000),  # int64
})

start_time = time.time()
df['B'] = df['A'].apply(lambda x: x ** 2)
end_time = time.time()

print(f"Time taken: {end_time-start_time:.4f}")

# vectorization
start_time = time.time()
df['B'] = df['A'] ** 2 
end_time = time.time()

print(f"Time taken: {end_time-start_time:.4f}")

Time taken: 4.7028
Time taken: 0.0089


### Memory Profiling

Use profiling tools like Pandas-Profiling or memory_profiler to identify bottlenecks and memory hogs in your code.

In [9]:
!python -m memory_profiler memory.py


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/opt/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/anaconda3/lib/python3.9/site-packages/memory_profiler.py", line 1351, in <module>
    exec_with_profiler(script_filename, prof, args.backend, script_args)
  File "/opt/anaconda3/lib/python3.9/site-packages/memory_profiler.py", line 1252, in exec_with_profiler
    exec(compile(f.read(), fi

Filename: memory.py

Line #    Mem usage    Increment  Occurrences   Line Contents
     4     94.5 MiB     94.5 MiB           1   @profile
     5                                         def process_data():
     6    125.2 MiB     30.7 MiB           1       df = pd.DataFrame({'a': range(1000000), 'b': range(1000000)})
     7    125.4 MiB      0.2 MiB           1       df['c'] = df['a'] + df['b']
     8    141.0 MiB     15.6 MiB           1       df = df.drop(columns=['a'])
     9    141.0 MiB      0.0 MiB           1       return df


