# Week 01, Lab 01 — NumPy & Jupyter EDA Warm‑Up

**Your Name**  
**Date:** October 26, 2025

---

## Outcomes

By the end of this lab, you will be able to:

1. Create and reshape NumPy arrays; explain when and why to use `reshape` vs. `ravel`/`flatten`.
2. Demonstrate vectorization and broadcasting to replace Python `for` loops.
3. Compare **runtime** and **memory footprint** of lists vs. NumPy arrays.
4. Use key Jupyter magics for EDA & reproducibility: `%matplotlib inline`, `%timeit`, `%%time`, `%env`, `%%capture`, and `autoreload`.
5. Produce a short, reproducible EDA narrative that includes figures, random seeds, and environment/version stamps.

## Notebook Prologue — Environment & Reproducibility Setup

In [None]:
# %pip install pandas fastparquet matplotlib numpy

# Reproducibility & environment snapshot
import os, sys, platform, random
import numpy as np
import pandas as pd
import matplotlib

SEED = 42
random.seed(SEED)
np.random.seed(SEED)

print({
    'python': sys.version.split()[0],
    'platform': platform.platform(),
    'numpy': np.__version__,
    'pandas': pd.__version__,
    'matplotlib': matplotlib.__version__,
    'pid': os.getpid(),
})

---

## Part A — NumPy Quick Refresh

### A1. Arrays vs. Python lists (time & memory)

**Step 1: Create comparable data**

In [None]:
import sys, numpy as np
N = 1_000_000
py_list = list(range(N))
np_array = np.arange(N)

**Step 2: Memory footprint**

In [None]:
# List memory: container + objects (rough estimate)
list_container_bytes = sys.getsizeof(py_list)
int_object_bytes = sys.getsizeof(0)  # per small int (implementation dependent)
approx_list_bytes = list_container_bytes + N * int_object_bytes

# NumPy memory: contiguous buffer
numpy_bytes = np_array.nbytes

print({'approx_list_bytes': approx_list_bytes, 'numpy_bytes': numpy_bytes})

**Step 3: Runtime with `%timeit`**

In [None]:
%timeit sum(py_list)
%timeit np_array.sum()

**Checkpoint A1:** NumPy is faster and more memory-efficient because it stores data in contiguous memory blocks and uses vectorized C loops, avoiding Python object overhead. This results in better CPU cache locality and fewer allocations compared to Python lists.

### A2. Shape, reshape, ravel, and flatten

**Step 1: Create and reshape arrays**

In [None]:
x = np.arange(12)
x2 = x.reshape(3, 4)
x3 = x.reshape(2, 2, 3)
x_ravel = x2.ravel()      # view if possible
x_flat = x2.flatten()     # always copy
print(x2.shape, x3.shape, x_ravel.base is x, x_flat.base is x)

**Step 2: Prove views vs copies by mutation**

In [None]:
# Mutate an element in x2 and observe x, x_ravel, and x_flat
x2[0, 0] = 999
print(f"x[0] = {x[0]}")
print(f"x_ravel[0] = {x_ravel[0]}")
print(f"x_flat[0] = {x_flat[0]}")
print(f"\nx2[0,0] changed to 999")
print(f"x and x_ravel reflect the change (views)")
print(f"x_flat does NOT reflect the change (copy)")

**Checkpoint A2:** Prefer `reshape` and `ravel` for performance when you don't need an independent copy, as they return views when possible. Use `flatten` only when you explicitly need a copy to avoid unintended mutations.

### A3. Broadcasting & vectorization

**Step 1: Broadcasting scalar + vector**

In [None]:
v = np.arange(5)
v_plus = v + 10
v_scaled = v * 2
print(f"v = {v}")
print(f"v + 10 = {v_plus}")
print(f"v * 2 = {v_scaled}")

**Step 2: Broadcasting 2D + 1D**

In [None]:
M = np.arange(12).reshape(3,4)
col = np.array([1, 2, 3]).reshape(3,1)
M2 = M + col  # adds [1,2,3] to each row
print("M:")
print(M)
print("\ncol:")
print(col)
print("\nM + col:")
print(M2)

**Step 3: Loop vs vectorized timing**

In [None]:
big = np.random.rand(2_000_000)

def py_loop_square(arr):
    out = [0.0]*len(arr)
    for i, val in enumerate(arr):
        out[i] = val*val
    return out

print("Python loop:")
%timeit py_loop_square(big)

print("\nNumPy vectorized:")
%timeit big*big

**Checkpoint A3:** The vectorized NumPy operation is typically 10-100× faster than the Python loop, depending on hardware. This demonstrates the power of vectorization for numerical computations.

---

## Part B — Jupyter for EDA & Pipelines

### B1. Jupyter magics you'll actually use

In [None]:
# 1) Inline plotting for reports/notebooks
%matplotlib inline

In [None]:
%%time
# 2) Timing
import time
_ = [time.sleep(0.001) for _ in range(200)]

In [None]:
# 3) Micro-benchmarks (repeat/average)
import numpy as np
arr = np.random.rand(1_000_00)
%timeit arr.mean()

In [None]:
# 4) Environment variables for pipelines
%env DATA_DIR=./data

In [None]:
%%capture cap
# 5) Capture noisy cell output (useful when logging)
print('This will be captured, not printed.')

# Verify capture worked
print("If you see this, capture is working (the above print was captured)")

In [None]:
# 6) Autoreload during iterative development
%load_ext autoreload
%autoreload 2

> **Note:** `%%time` measures a single run (wall & CPU time). `%timeit` runs multiple times and reports a stable average — better for micro‑benchmarks.

### B2. Mini‑EDA narrative with reproducibility

**Step 1: Create a small synthetic dataset**

In [None]:
import pandas as pd, numpy as np
rng = np.random.default_rng(SEED)
n = 500
df = pd.DataFrame({
    'user_id': np.arange(n),
    'age': rng.integers(18, 70, size=n),
    'country': rng.choice(['US', 'SG', 'DE', 'BR', 'IN'], size=n, p=[0.35,0.15,0.2,0.15,0.15]),
    'sessions': rng.poisson(3, size=n),
    'avg_session_sec': rng.normal(300, 50, size=n).clip(30, 1200)
})
df.head()

**Step 2: Quick profile (no external libs)**

In [None]:
df.info()

In [None]:
df.describe(include='all').T

In [None]:
df['country'].value_counts(normalize=True).round(3)

**Step 3: Plot distributions**

In [None]:
import matplotlib.pyplot as plt
df['age'].hist(bins=20)
plt.title('Age Distribution')
plt.show()

In [None]:
df.plot.scatter(x='sessions', y='avg_session_sec', alpha=0.3)
plt.title('Sessions vs Avg Session Seconds')
plt.show()

**Step 4: Reproducibility stamp**

- **Seed:** 42
- **Versions:** See prologue cell output
- **DATA_DIR:** `./data` (set via `%env` magic)

### B3. Export artifacts

In [None]:
# Persist CSV and a compact Parquet for downstream steps
import os
os.makedirs(os.getenv('DATA_DIR', './data'), exist_ok=True)

out_csv = os.path.join(os.environ['DATA_DIR'], 'mini_eda_users.csv')
out_parquet = os.path.join(os.environ['DATA_DIR'], 'mini_eda_users.parquet')

%time df.to_csv(out_csv, index=False)
%time df.to_parquet(out_parquet, index=False, engine='fastparquet')

out_csv, out_parquet

**Artifact paths:**

- CSV: `./data/mini_eda_users.csv`
- Parquet: `./data/mini_eda_users.parquet`

These will be used in later labs.

---

## Part C — Wrap‑Up

### Reflection Questions:

1. **When would you prefer a view (`ravel`, `reshape`) over a copy (`flatten`) and why?**
   
   *Your answer:* Prefer views when you need to conserve memory and ensure changes propagate to the original array. Views are more performant since they don't copy data. Use copies only when you need independent data that won't affect the original.

2. **What's an example where pure‑Python loops might still be acceptable?**
   
   *Your answer:* Pure-Python loops are acceptable for small datasets, complex control flow that can't be vectorized, or when readability is more important than performance (e.g., one-time scripts, prototyping).

3. **Which Jupyter magic would you use to:**
   - **(a) Benchmark two approaches:** `%timeit` for micro-benchmarks with averaged results
   - **(b) Hide verbose output:** `%%capture` to suppress cell output
   - **(c) Ensure figures render inline in exported HTML:** `%matplotlib inline`

---

## Final Thoughts

- **Confusion between `ravel` vs `flatten`:** Remember that `ravel` returns a view when possible (changes affect original), while `flatten` always returns a copy (independent data).
- **`%time` vs `%timeit`:** Use `%time` for single-run timing, `%timeit` for averaged benchmarks.
- **Memory estimates for lists:** These are approximate due to Python object overhead vs contiguous NumPy buffers.

---

## Solution Snippets (reference)

**Why NumPy faster?**

- Contiguous memory + vectorized C/Fortran loops reduce Python interpreter overhead; fewer allocations; better CPU cache locality.

**View vs copy demo:**

In [None]:
x = np.arange(6)
x2 = x.reshape(2,3)
r = x2.ravel()
f = x2.flatten()

x2[0,0] = 999
assert x[0] == 999 and r[0] == 999    # view tracks source
assert f[0] != 999                     # copy is independent
print("View vs Copy demo passed!")
print(f"x[0] = {x[0]}, r[0] = {r[0]}, f[0] = {f[0]}")

**Magics mapping:**

- Benchmark: `%timeit` (and `%%time` for single‑run cells)
- Hide output: `%%capture`
- Inline plots: `%matplotlib inline`
- Re-run code after edits to imported modules: `%load_ext autoreload; %autoreload 2`

**Speedup expectation:** On typical laptops, `np_array.sum()` ≫ `sum(py_list)`; vectorized square vs loop often yields **10–100×** depending on hardware.