# Question 2 - Case Study II: High-Performance Data Processing (Week 2) @ SISNET:

Scenario: Your team is processing a 10GB transaction log file daily. The current Python script uses a standard for loop to iterate through rows, apply a currency conversion, and filter out failed transactions. This process currently takes 6 hours to complete, delaying downstream reporting.

Q2.1 (Performance Diagnosis): Analyse why native Python loops are computationally expensive for large datasets compared to NumPy/Pandas operations. In your explanation, reference concepts such as Type Checking, Interpreter Overhead, and SIMD/C-Level Execution.

Q2.2 (Refactoring Plan): Provide a conceptual refactoring plan (pseudocode or Python snippet) using Pandas Vectorisation. Estimate the theoretical performance gain and explain the trade-offs (e.g. Memory Usage).


Q2.1:
Native Python loops are significantly slower than NumPy or Pandas because Python is designed for flexibility and not speed. When you run a standard "for" loop, the Python Interpreter must perform Type Checking at every single step. As Python is dynamic, it does not know if a variable is an integer, a string, or a list until it actually reads it during its runtime. This creates a lot of Interpreter Overhead and time lag as the computer spends more time figuring out what data types it is handling than actually performing the math.

In contrast, NumPy and Pandas uses C-Level Execution, where data types are fixed and known in advance. This allows for SIMD (Single Instruction, Multiple Data) processing. SIMD allows the CPU to perform the same operation on a whole block of data at once, rather than one part at a time.

In [3]:
'''
Docstring for Q2.2:
'''

import pandas as pd
import numpy as np
import timeit

# 1. Create a sample dataset with 1000000 rows
df = pd.DataFrame({'price': np.random.uniform(10, 100, 1000000)})

# 2. Using Native Python Loop
def loop_test():
    taxed_prices = []
    for p in df['price']:
        taxed_prices.append(p * 1.10)
    df['total_loop'] = taxed_prices

# 3. Using Pandas Vectorisation
def vectorized_test():
    df['total_vector'] = df['price'] * 1.10

# 4. Use timeit to measure the performance
loop_time = timeit.timeit(loop_test, number=1)
vector_time = timeit.timeit(vectorized_test, number=1)

print(f"Loop Method: {loop_time:.4f} seconds")
print(f"Vectorized Method: {vector_time:.4f} seconds")
print(f"Speedup Factor: {loop_time / vector_time:.2f}x")

Loop Method: 0.2419 seconds
Vectorized Method: 0.0018 seconds
Speedup Factor: 136.68x
