# Day 2 - Morning Session Exercises
## Advanced NumPy, Pandas, and Visualization

**Instructions:**
- Complete exercises appropriate to your skill level
- Experiment and modify the code
- Ask questions if you get stuck!
- Solutions are hidden below each exercise - try to solve them first!

---

## Exercise 2.1: Advanced NumPy Operations (40 min)

### Physics Context
In particle physics, we often need to calculate relationships between particles in an event - distances, angular separations, and invariant masses. Doing this efficiently requires vectorized operations.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import time
%matplotlib inline

### Beginner Version: Angular Separations and 2D Histograms

In [None]:
# Generate simulated particle data for one event
np.random.seed(42)
n_particles = 10

# Particle properties
pt = np.random.exponential(scale=30, size=n_particles)   # Transverse momentum (GeV/c)
eta = np.random.uniform(-2.5, 2.5, size=n_particles)     # Pseudorapidity
phi = np.random.uniform(-np.pi, np.pi, size=n_particles) # Azimuthal angle

print(f"Generated {n_particles} particles:")
for i in range(n_particles):
    print(f"  Particle {i}: pT={pt[i]:.1f} GeV/c, Î·={eta[i]:.2f}, Ï†={phi[i]:.2f}")

In [None]:
# TODO: Calculate angular separation Î”R between ALL pairs of particles
# Î”R = sqrt(Î”Î·Â² + Î”Ï†Â²)
# Remember: Ï† wraps around (-Ï€ to Ï€), so Î”Ï† needs special handling

def delta_phi(phi1, phi2):
    """
    Calculate Î”Ï† accounting for wrap-around.
    Result is in range [-Ï€, Ï€]
    """
    dphi = phi1 - phi2
    # YOUR CODE HERE: handle wrap-around
    # Hint: use np.where to adjust values outside [-Ï€, Ï€]
    
    return dphi

def delta_r(eta1, eta2, phi1, phi2):
    """
    Calculate Î”R = sqrt(Î”Î·Â² + Î”Ï†Â²)
    """
    deta = eta1 - eta2
    dphi = delta_phi(phi1, phi2)
    # YOUR CODE HERE: return sqrt(detaÂ² + dphiÂ²)
    return None

# Calculate Î”R for all pairs using loops (for comparison)
n = len(eta)
delta_r_matrix = np.zeros((n, n))

for i in range(n):
    for j in range(n):
        delta_r_matrix[i, j] = delta_r(eta[i], eta[j], phi[i], phi[j])

print("Î”R matrix (first 5x5):")
print(delta_r_matrix[:5, :5].round(2))

<details>
<summary>ðŸ’¡ Click to reveal solution</summary>

```python
def delta_phi(phi1, phi2):
    """
    Calculate Î”Ï† accounting for wrap-around.
    Result is in range [-Ï€, Ï€]
    """
    dphi = phi1 - phi2
    # Handle wrap-around
    dphi = np.where(dphi > np.pi, dphi - 2*np.pi, dphi)
    dphi = np.where(dphi < -np.pi, dphi + 2*np.pi, dphi)
    return dphi

def delta_r(eta1, eta2, phi1, phi2):
    """
    Calculate Î”R = sqrt(Î”Î·Â² + Î”Ï†Â²)
    """
    deta = eta1 - eta2
    dphi = delta_phi(phi1, phi2)
    return np.sqrt(deta**2 + dphi**2)
```

</details>

In [None]:
# TODO: Find the closest pair of particles
# Hint: Set diagonal to infinity so we don't find self-pairs

# Make a copy to avoid modifying original
dr_matrix = delta_r_matrix.copy()
np.fill_diagonal(dr_matrix, np.inf)  # Exclude self-pairs

# YOUR CODE HERE: Find minimum Î”R value
min_dr = None  # Use np.min(...)

# YOUR CODE HERE: Find indices of minimum
# Hint: use np.argmin(...) then np.unravel_index to convert flat index to 2D
i, j = 0, 0  # Replace with correct code

print(f"\nClosest pair: particles {i} and {j}")
print(f"  Î”R = {delta_r_matrix[i, j]:.3f}")
print(f"  Particle {i}: Î·={eta[i]:.2f}, Ï†={phi[i]:.2f}")
print(f"  Particle {j}: Î·={eta[j]:.2f}, Ï†={phi[j]:.2f}")

<details>
<summary>ðŸ’¡ Click to reveal solution</summary>

```python
# Make a copy to avoid modifying original
dr_matrix = delta_r_matrix.copy()
np.fill_diagonal(dr_matrix, np.inf)  # Exclude self-pairs

# Find minimum
min_dr = np.min(dr_matrix)

# np.unravel_index converts flat index to 2D index
i, j = np.unravel_index(np.argmin(dr_matrix), dr_matrix.shape)

print(f"\nClosest pair: particles {i} and {j}")
print(f"  Î”R = {delta_r_matrix[i, j]:.3f}")
print(f"  Particle {i}: Î·={eta[i]:.2f}, Ï†={phi[i]:.2f}")
print(f"  Particle {j}: Î·={eta[j]:.2f}, Ï†={phi[j]:.2f}")
```

</details>

In [None]:
# TODO: Create a 2D histogram of Î·-Ï† distribution
# Generate more particles for a better visualization

np.random.seed(123)
n_events = 10000

# Simulate particles across many events
eta_all = np.random.uniform(-2.5, 2.5, n_events)
phi_all = np.random.uniform(-np.pi, np.pi, n_events)

# Create 2D histogram
fig, ax = plt.subplots(figsize=(10, 6))

# YOUR CODE HERE: Create 2D histogram using plt.hist2d or ax.hist2d
# Use bins=30 for both dimensions
# Add colorbar with label 'Events'

ax.set_xlabel('Î· (pseudorapidity)')
ax.set_ylabel('Ï† (azimuthal angle)')
ax.set_title('Particle Distribution in Î·-Ï† Space')

plt.tight_layout()
plt.show()

<details>
<summary>ðŸ’¡ Click to reveal solution</summary>

```python
# Create 2D histogram
fig, ax = plt.subplots(figsize=(10, 6))

# Create 2D histogram
h = ax.hist2d(eta_all, phi_all, bins=30, cmap='viridis')
plt.colorbar(h[3], ax=ax, label='Events')

ax.set_xlabel('Î· (pseudorapidity)')
ax.set_ylabel('Ï† (azimuthal angle)')
ax.set_title('Particle Distribution in Î·-Ï† Space')

plt.tight_layout()
plt.show()
```

</details>

### Advanced Version: Vectorized Jet Clustering

In [None]:
# Jet clustering: Group nearby particles into jets
# We'll implement a simple cone algorithm (not anti-kT, but educational)

np.random.seed(42)
n_particles = 50

# Generate particles (some clustered, some isolated)
# Create 3 "seed" jets and spread particles around them
jet_centers = [
    {'eta': 0.5, 'phi': 0.3},
    {'eta': -1.2, 'phi': -2.0},
    {'eta': 1.8, 'phi': 1.5}
]

eta_particles = []
phi_particles = []
pt_particles = []

for center in jet_centers:
    n_in_jet = 15
    eta_particles.extend(np.random.normal(center['eta'], 0.2, n_in_jet))
    phi_particles.extend(np.random.normal(center['phi'], 0.2, n_in_jet))
    pt_particles.extend(np.random.exponential(20, n_in_jet))

# Add some random particles
n_random = 5
eta_particles.extend(np.random.uniform(-2.5, 2.5, n_random))
phi_particles.extend(np.random.uniform(-np.pi, np.pi, n_random))
pt_particles.extend(np.random.exponential(10, n_random))

eta = np.array(eta_particles)
phi = np.array(phi_particles)
pt = np.array(pt_particles)

print(f"Generated {len(eta)} particles")

In [None]:
# TODO: Implement vectorized Î”R calculation for ALL pairs
# Use broadcasting instead of loops!

def compute_all_delta_r_vectorized(eta, phi):
    """
    Compute Î”R between all particle pairs using broadcasting.
    
    Parameters:
    -----------
    eta, phi : np.ndarray
        Arrays of particle coordinates
    
    Returns:
    --------
    np.ndarray : Matrix of Î”R values (shape: n x n)
    """
    # YOUR CODE HERE
    # Broadcasting hint: eta[:, None] has shape (n, 1), eta[None, :] has shape (1, n)
    # Result of subtraction has shape (n, n)
    
    deta = None  # Calculate using broadcasting
    dphi = None  # Calculate using broadcasting, handle wrap-around
    dr = None    # sqrt(detaÂ² + dphiÂ²)
    
    return dr

# Test it
dr_matrix = compute_all_delta_r_vectorized(eta, phi)
print(f"Î”R matrix shape: {dr_matrix.shape}")

<details>
<summary>ðŸ’¡ Click to reveal solution</summary>

```python
def compute_all_delta_r_vectorized(eta, phi):
    """
    Compute Î”R between all particle pairs using broadcasting.
    """
    # Broadcasting: eta[:, None] has shape (n, 1), eta[None, :] has shape (1, n)
    # Result of subtraction has shape (n, n)
    
    deta = eta[:, None] - eta[None, :]  # Shape (n, n)
    
    # Calculate dphi with wrap-around
    dphi = phi[:, None] - phi[None, :]
    dphi = np.where(dphi > np.pi, dphi - 2*np.pi, dphi)
    dphi = np.where(dphi < -np.pi, dphi + 2*np.pi, dphi)
    
    dr = np.sqrt(deta**2 + dphi**2)
    return dr
```

</details>

In [None]:
# TODO: Implement simple cone jet clustering
# Algorithm:
# 1. Start with highest pT particle as seed
# 2. Find all particles within Î”R < R_cone
# 3. Mark them as used, compute jet 4-momentum
# 4. Repeat with remaining particles

def cone_clustering(pt, eta, phi, R_cone=0.4, pt_min=5.0):
    """
    Simple cone jet clustering algorithm.
    
    Parameters:
    -----------
    pt, eta, phi : np.ndarray
        Particle properties
    R_cone : float
        Cone radius for clustering
    pt_min : float
        Minimum pT for jet seed
    
    Returns:
    --------
    list of dict : List of jets with properties
    """
    n = len(pt)
    used = np.zeros(n, dtype=bool)
    jets = []
    
    # Compute all Î”R values once
    dr_matrix = compute_all_delta_r_vectorized(eta, phi)
    
    while True:
        # Find highest pT unused particle
        pt_masked = np.where(used, 0, pt)
        seed_idx = np.argmax(pt_masked)
        
        if pt_masked[seed_idx] < pt_min:
            break  # No more seeds above threshold
        
        # YOUR CODE HERE: Find particles within R_cone of seed
        # Use dr_matrix[seed_idx] to get distances from seed
        in_cone = None  # Boolean mask for particles in cone
        
        # Mark particles as used
        used[in_cone] = True
        used[seed_idx] = True
        
        # YOUR CODE HERE: Calculate jet properties (pT-weighted average)
        jet_pt = None   # Sum of pT
        jet_eta = None  # pT-weighted average eta
        jet_phi = None  # pT-weighted average phi
        
        jets.append({
            'pt': jet_pt,
            'eta': jet_eta,
            'phi': jet_phi,
            'n_constituents': np.sum(in_cone) + 1
        })
    
    return jets

# Run clustering
jets = cone_clustering(pt, eta, phi, R_cone=0.4)
print(f"\nFound {len(jets)} jets:")
for i, jet in enumerate(jets):
    print(f"  Jet {i}: pT={jet['pt']:.1f}, Î·={jet['eta']:.2f}, Ï†={jet['phi']:.2f}, n={jet['n_constituents']}")

<details>
<summary>ðŸ’¡ Click to reveal solution</summary>

```python
def cone_clustering(pt, eta, phi, R_cone=0.4, pt_min=5.0):
    """
    Simple cone jet clustering algorithm.
    """
    n = len(pt)
    used = np.zeros(n, dtype=bool)
    jets = []
    
    # Compute all Î”R values once
    dr_matrix = compute_all_delta_r_vectorized(eta, phi)
    
    while True:
        # Find highest pT unused particle
        pt_masked = np.where(used, 0, pt)
        seed_idx = np.argmax(pt_masked)
        
        if pt_masked[seed_idx] < pt_min:
            break  # No more seeds above threshold
        
        # Find particles within R_cone of seed
        in_cone = (dr_matrix[seed_idx] < R_cone) & ~used
        
        # Mark particles as used
        used[in_cone] = True
        used[seed_idx] = True
        
        # Calculate jet properties (pT-weighted average)
        jet_pt = np.sum(pt[in_cone]) + pt[seed_idx]
        
        # pT-weighted averages for eta and phi
        weights = np.concatenate([[pt[seed_idx]], pt[in_cone]])
        eta_vals = np.concatenate([[eta[seed_idx]], eta[in_cone]])
        phi_vals = np.concatenate([[phi[seed_idx]], phi[in_cone]])
        
        jet_eta = np.average(eta_vals, weights=weights)
        jet_phi = np.average(phi_vals, weights=weights)
        
        jets.append({
            'pt': jet_pt,
            'eta': jet_eta,
            'phi': jet_phi,
            'n_constituents': np.sum(in_cone) + 1
        })
    
    return jets
```

</details>

In [None]:
# Visualize the jets
fig, ax = plt.subplots(figsize=(10, 8))

# Plot all particles
scatter = ax.scatter(eta, phi, c=pt, s=pt*2, cmap='viridis', alpha=0.6, label='Particles')
plt.colorbar(scatter, ax=ax, label='pT (GeV/c)')

# Plot jet cones
for i, jet in enumerate(jets):
    circle = plt.Circle((jet['eta'], jet['phi']), 0.4, 
                        fill=False, color='red', linewidth=2, linestyle='--')
    ax.add_patch(circle)
    ax.plot(jet['eta'], jet['phi'], 'r*', markersize=15)
    ax.annotate(f"Jet {i}\npT={jet['pt']:.0f}", 
                (jet['eta'], jet['phi']), 
                xytext=(10, 10), textcoords='offset points',
                fontsize=9, color='red')

ax.set_xlabel('Î·')
ax.set_ylabel('Ï†')
ax.set_xlim(-3, 3)
ax.set_ylim(-np.pi - 0.5, np.pi + 0.5)
ax.set_title('Jet Clustering Visualization')
ax.set_aspect('equal')

plt.tight_layout()
plt.show()

In [None]:
# TODO: Benchmark vectorized vs loop approach

def delta_r_loops(eta, phi):
    """Loop-based Î”R calculation."""
    n = len(eta)
    result = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            deta = eta[i] - eta[j]
            dphi = phi[i] - phi[j]
            if dphi > np.pi:
                dphi -= 2*np.pi
            if dphi < -np.pi:
                dphi += 2*np.pi
            result[i, j] = np.sqrt(deta**2 + dphi**2)
    return result

# Test with different sizes
sizes = [50, 100, 200, 500]

print("Performance comparison:")
print("-" * 50)

for size in sizes:
    eta_test = np.random.uniform(-2.5, 2.5, size)
    phi_test = np.random.uniform(-np.pi, np.pi, size)
    
    # Time loops
    start = time.time()
    result_loops = delta_r_loops(eta_test, phi_test)
    time_loops = time.time() - start
    
    # Time vectorized
    start = time.time()
    result_vec = compute_all_delta_r_vectorized(eta_test, phi_test)
    time_vec = time.time() - start
    
    speedup = time_loops / time_vec if time_vec > 0 else float('inf')
    print(f"n={size:4d}: Loops={time_loops*1000:8.2f}ms, Vec={time_vec*1000:6.2f}ms, Speedup={speedup:6.1f}x")

---
## Exercise 2.2: Advanced Pandas Techniques (45 min)

### Physics Context
Real particle physics data has hierarchical structure: runs contain events, events contain particles. We need efficient ways to work with this structure.

In [None]:
import pandas as pd
import numpy as np

### Beginner Version: Energy Calibrations and Pivot Tables

In [None]:
# Generate simulated detector data
np.random.seed(42)
n_events = 1000

data = pd.DataFrame({
    'run': np.random.choice([1, 2, 3, 4], n_events),
    'event': range(n_events),
    'detector': np.random.choice(['barrel', 'endcap'], n_events),
    'energy_raw': np.random.exponential(50, n_events),
    'eta': np.random.uniform(-2.5, 2.5, n_events)
})

print("Raw data:")
data.head(10)

In [None]:
# TODO: Apply energy calibrations
# Different calibration factors for each detector region and run

calibration = {
    (1, 'barrel'): 1.02,
    (1, 'endcap'): 1.05,
    (2, 'barrel'): 1.01,
    (2, 'endcap'): 1.04,
    (3, 'barrel'): 1.03,
    (3, 'endcap'): 1.06,
    (4, 'barrel'): 1.00,
    (4, 'endcap'): 1.03,
}

# Method 1: Using apply (slow but simple)
def get_calibration(row):
    key = (row['run'], row['detector'])
    return row['energy_raw'] * calibration.get(key, 1.0)

# YOUR CODE HERE: use apply method to calibrate energy
data['energy_cal_v1'] = None

# Method 2: More efficient - create a calibration DataFrame and merge
cal_df = pd.DataFrame([
    {'run': k[0], 'detector': k[1], 'cal_factor': v}
    for k, v in calibration.items()
])

# YOUR CODE HERE: Merge calibration factors and calculate calibrated energy
# Step 1: Merge data with cal_df on ['run', 'detector']
# Step 2: Calculate energy_cal_v2 = energy_raw * cal_factor

print("\nWith calibration:")
data[['run', 'detector', 'energy_raw', 'energy_cal_v1']].head(10)

<details>
<summary>ðŸ’¡ Click to reveal solution</summary>

```python
# Method 1: Using apply (slow but simple)
data['energy_cal_v1'] = data.apply(get_calibration, axis=1)

# Method 2: More efficient - create a calibration DataFrame and merge
cal_df = pd.DataFrame([
    {'run': k[0], 'detector': k[1], 'cal_factor': v}
    for k, v in calibration.items()
])

# Merge calibration factors
data = pd.merge(data, cal_df, on=['run', 'detector'], how='left')

# Calculate calibrated energy
data['energy_cal_v2'] = data['energy_raw'] * data['cal_factor']

print("\nWith calibration:")
data[['run', 'detector', 'energy_raw', 'cal_factor', 'energy_cal_v1', 'energy_cal_v2']].head(10)
```

</details>

In [None]:
# TODO: Create pivot tables for run-by-run statistics

# Pivot table: mean energy by run and detector
pivot_mean = pd.pivot_table(
    data,
    values='energy_cal_v1',
    index='run',
    columns='detector',
    aggfunc='mean'
)

print("Mean calibrated energy by run and detector:")
print(pivot_mean.round(2))

# YOUR CODE HERE: Create pivot table with multiple statistics
# Use aggfunc=['mean', 'std', 'count']
pivot_multi = None

print("\nMultiple statistics:")
# print(pivot_multi.round(2))

<details>
<summary>ðŸ’¡ Click to reveal solution</summary>

```python
# Create pivot table with multiple statistics
pivot_multi = pd.pivot_table(
    data,
    values='energy_cal_v1',
    index='run',
    columns='detector',
    aggfunc=['mean', 'std', 'count']
)

print("\nMultiple statistics:")
print(pivot_multi.round(2))
```

</details>

In [None]:
# Visualize run-by-run differences
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Energy distribution by run
for run in sorted(data['run'].unique()):
    subset = data[data['run'] == run]['energy_cal_v1']
    axes[0].hist(subset, bins=30, alpha=0.5, label=f'Run {run}', range=(0, 200))

axes[0].set_xlabel('Calibrated Energy (GeV)')
axes[0].set_ylabel('Events')
axes[0].set_title('Energy Distribution by Run')
axes[0].legend()

# Plot 2: Mean energy vs run (bar chart)
mean_by_run = data.groupby(['run', 'detector'])['energy_cal_v1'].mean().unstack()
mean_by_run.plot(kind='bar', ax=axes[1])
axes[1].set_xlabel('Run')
axes[1].set_ylabel('Mean Energy (GeV)')
axes[1].set_title('Mean Energy by Run and Detector')
axes[1].legend(title='Detector')

plt.tight_layout()
plt.show()

### Advanced Version: Hierarchical Data Structure

In [None]:
# Create hierarchical dataset: Run â†’ Event â†’ Particle
np.random.seed(42)

# Generate data
data_list = []

for run in range(1, 4):  # 3 runs
    n_events_in_run = np.random.randint(30, 50)
    for event in range(n_events_in_run):
        n_particles = np.random.poisson(5)  # ~5 particles per event
        for particle in range(n_particles):
            data_list.append({
                'run': run,
                'event': event,
                'particle': particle,
                'particle_type': np.random.choice(['electron', 'muon', 'photon', 'jet']),
                'pt': np.random.exponential(30),
                'eta': np.random.uniform(-2.5, 2.5),
                'phi': np.random.uniform(-np.pi, np.pi),
                'energy': np.random.exponential(50)
            })

df = pd.DataFrame(data_list)
print(f"Total particles: {len(df)}")
print(f"Runs: {df['run'].nunique()}, Events: {df.groupby('run')['event'].nunique().sum()}")
df.head(10)

In [None]:
# TODO: Create MultiIndex DataFrame

# YOUR CODE HERE: Set index to ['run', 'event', 'particle']
df_multi = None

print("MultiIndex DataFrame:")
# print(df_multi.head(15))

# Access data at different levels
# print("\n--- Run 1 data ---")
# print(df_multi.loc[1].head())

# print("\n--- Run 1, Event 0 ---")
# print(df_multi.loc[(1, 0)])

<details>
<summary>ðŸ’¡ Click to reveal solution</summary>

```python
# Create MultiIndex DataFrame
df_multi = df.set_index(['run', 'event', 'particle'])

print("MultiIndex DataFrame:")
print(df_multi.head(15))

# Access data at different levels
print("\n--- Run 1 data ---")
print(df_multi.loc[1].head())

print("\n--- Run 1, Event 0 ---")
print(df_multi.loc[(1, 0)])

print("\n--- Using xs (cross-section) ---")
print(df_multi.xs(1, level='run').head())
```

</details>

In [None]:
# TODO: Calculate event-level quantities using groupby

# Group by run and event
event_grouped = df.groupby(['run', 'event'])

# YOUR CODE HERE: Calculate event-level quantities
# - Total pT (sum)
# - Leading pT (max)
# - Total energy (sum)
# - Number of particles (count)

event_summary = event_grouped.agg({
    # YOUR CODE HERE
})

print("Event-level summary:")
# print(event_summary.head(10))

<details>
<summary>ðŸ’¡ Click to reveal solution</summary>

```python
# Group by run and event
event_grouped = df.groupby(['run', 'event'])

# Calculate event-level quantities
event_summary = event_grouped.agg({
    'pt': ['sum', 'max'],  # Total pT, leading pT
    'energy': 'sum',        # Total energy
    'particle': 'count'     # Number of particles
})

# Flatten column names
event_summary.columns = ['_'.join(col).strip() for col in event_summary.columns]
event_summary = event_summary.rename(columns={'particle_count': 'n_particles'})

print("Event-level summary:")
print(event_summary.head(10))
```

</details>

In [None]:
# TODO: Find leading particle in each event

# YOUR CODE HERE: Use idxmax to find index of max pT in each event
# Then use .loc to get those particles

leading_idx = None  # df.groupby(['run', 'event'])['pt'].idxmax()
leading_particles = None  # df.loc[leading_idx]

print("Leading particles (first 10 events):")
# print(leading_particles[['run', 'event', 'particle_type', 'pt', 'eta']].head(10))

# Statistics of leading particles
print("\nLeading particle type distribution:")
# print(leading_particles['particle_type'].value_counts())

<details>
<summary>ðŸ’¡ Click to reveal solution</summary>

```python
# Find leading particle in each event using idxmax
leading_idx = df.groupby(['run', 'event'])['pt'].idxmax()
leading_particles = df.loc[leading_idx]

print("Leading particles (first 10 events):")
print(leading_particles[['run', 'event', 'particle_type', 'pt', 'eta']].head(10))

# Statistics of leading particles
print("\nLeading particle type distribution:")
print(leading_particles['particle_type'].value_counts())
```

</details>

In [None]:
# TODO: Memory optimization with categorical types

print("Memory usage before optimization:")
print(df.memory_usage(deep=True))
print(f"Total: {df.memory_usage(deep=True).sum() / 1024:.1f} KB")

# YOUR CODE HERE: Convert 'particle_type' to categorical
df_optimized = df.copy()
# df_optimized['particle_type'] = ...

print("\nMemory usage after optimization:")
print(df_optimized.memory_usage(deep=True))
print(f"Total: {df_optimized.memory_usage(deep=True).sum() / 1024:.1f} KB")

<details>
<summary>ðŸ’¡ Click to reveal solution</summary>

```python
print("Memory usage before optimization:")
print(df.memory_usage(deep=True))
print(f"Total: {df.memory_usage(deep=True).sum() / 1024:.1f} KB")

# Convert string columns to categorical
df_optimized = df.copy()
df_optimized['particle_type'] = df_optimized['particle_type'].astype('category')

print("\nMemory usage after optimization:")
print(df_optimized.memory_usage(deep=True))
print(f"Total: {df_optimized.memory_usage(deep=True).sum() / 1024:.1f} KB")

reduction = 1 - df_optimized.memory_usage(deep=True).sum() / df.memory_usage(deep=True).sum()
print(f"\nMemory reduction: {reduction*100:.1f}%")
```

</details>

---
## Exercise 2.3: Visualization with Matplotlib and Seaborn (30 min)

In [None]:
import seaborn as sns
sns.set_theme(style='whitegrid')

### Beginner Version: Standard Analysis Plots

In [None]:
# Generate simulated Z boson data
np.random.seed(42)
n_events = 5000

# Signal: Z boson mass peak
mass_signal = np.random.normal(91.2, 2.5, int(n_events * 0.7))
# Background: exponential
mass_background = np.random.exponential(30, int(n_events * 0.3)) + 60
mass_background = mass_background[mass_background < 120]

mass_all = np.concatenate([mass_signal, mass_background])
pt_all = np.random.exponential(40, len(mass_all))

print(f"Generated {len(mass_all)} events")

In [None]:
# TODO: Create a mass peak plot with proper formatting

fig, ax = plt.subplots(figsize=(10, 7))

# YOUR CODE HERE: Create histogram with histtype='step'
# counts, bins, _ = ax.hist(...)

# YOUR CODE HERE: Add error bars (Poisson errors = sqrt(N))
# bin_centers = (bins[:-1] + bins[1:]) / 2
# errors = np.sqrt(counts)
# ax.errorbar(...)

# YOUR CODE HERE: Add proper labels
ax.set_xlabel(r'$m_{\mu\mu}$ (GeV/cÂ²)', fontsize=14)
ax.set_ylabel('Events / (1.5 GeV/cÂ²)', fontsize=14)
ax.set_title('Dimuon Invariant Mass Distribution', fontsize=16)

# Add annotation for Z peak
# ax.annotate(...)

ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

<details>
<summary>ðŸ’¡ Click to reveal solution</summary>

```python
fig, ax = plt.subplots(figsize=(10, 7))

# Create histogram
counts, bins, _ = ax.hist(mass_all, bins=40, range=(60, 120), 
                          histtype='step', linewidth=2, color='black',
                          label='Data')

# Add error bars
bin_centers = (bins[:-1] + bins[1:]) / 2
errors = np.sqrt(counts)
ax.errorbar(bin_centers, counts, yerr=errors, fmt='none', 
            capsize=2, color='black', label='Stat. uncertainty')

# Add proper labels
ax.set_xlabel(r'$m_{\mu\mu}$ (GeV/cÂ²)', fontsize=14)
ax.set_ylabel('Events / (1.5 GeV/cÂ²)', fontsize=14)
ax.set_title('Dimuon Invariant Mass Distribution', fontsize=16)

# Add annotation
ax.annotate('Z peak', xy=(91.2, max(counts)*0.9), fontsize=12,
            ha='center', color='red')
ax.axvline(91.2, color='red', linestyle='--', alpha=0.5)

ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
```

</details>

In [None]:
# TODO: Create pT distribution plot with log scale

fig, ax = plt.subplots(figsize=(10, 7))

# YOUR CODE HERE: Create histogram of pT
# Use bins=50, range=(0, 200)

# YOUR CODE HERE: Add log scale
# ax.set_yscale('log')

ax.set_xlabel(r'$p_T$ (GeV/c)', fontsize=14)
ax.set_ylabel('Events', fontsize=14)
ax.set_title(r'Transverse Momentum Distribution', fontsize=16)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

<details>
<summary>ðŸ’¡ Click to reveal solution</summary>

```python
fig, ax = plt.subplots(figsize=(10, 7))

# Create histogram of pT
ax.hist(pt_all, bins=50, range=(0, 200), histtype='step', 
        linewidth=2, color='blue', label='All events')

# Add log scale
ax.set_yscale('log')

ax.set_xlabel(r'$p_T$ (GeV/c)', fontsize=14)
ax.set_ylabel('Events', fontsize=14)
ax.set_title(r'Transverse Momentum Distribution', fontsize=16)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
```

</details>

### Advanced Version: Multi-Panel Data/MC Comparison

In [None]:
# Generate "data" and "Monte Carlo" samples
np.random.seed(42)

# Data
n_data = 10000
data_mass = np.concatenate([
    np.random.normal(91.2, 2.5, int(n_data * 0.7)),
    np.random.exponential(25, int(n_data * 0.3)) + 60
])
data_mass = data_mass[(data_mass > 60) & (data_mass < 120)]

# MC Signal (normalized to data)
n_mc_sig = int(n_data * 0.7 * 1.2)  # Slightly more for better stats
mc_signal = np.random.normal(91.2, 2.5, n_mc_sig)

# MC Background
n_mc_bkg = int(n_data * 0.3 * 1.2)
mc_background = np.random.exponential(25, n_mc_bkg) + 60

print(f"Data: {len(data_mass)}, MC Signal: {len(mc_signal)}, MC Background: {len(mc_background)}")

In [None]:
# TODO: Create publication-quality comparison plot with:
# - Top panel: Data points with error bars + stacked MC histograms
# - Bottom panel: Data/MC ratio

fig, axes = plt.subplots(2, 1, figsize=(10, 10), 
                         gridspec_kw={'height_ratios': [3, 1]},
                         sharex=True)

# Define binning
bins = np.linspace(60, 120, 41)
bin_centers = (bins[:-1] + bins[1:]) / 2
bin_width = bins[1] - bins[0]

# ===== Top panel: Data and MC =====
ax1 = axes[0]

# YOUR CODE HERE: Create data histogram and get counts
# data_counts, _ = np.histogram(data_mass, bins=bins)
# data_errors = np.sqrt(data_counts)

# YOUR CODE HERE: Plot data as points with error bars
# ax1.errorbar(...)

# YOUR CODE HERE: Create MC histograms and scale to data
# mc_sig_counts, _ = np.histogram(mc_signal, bins=bins)
# mc_bkg_counts, _ = np.histogram(mc_background, bins=bins)
# scale = data_counts.sum() / (mc_sig_counts.sum() + mc_bkg_counts.sum())

# YOUR CODE HERE: Plot stacked MC using ax1.bar()

ax1.set_ylabel('Events / (1.5 GeV/cÂ²)', fontsize=12)
ax1.set_title('Z â†’ Î¼Î¼: Data vs Monte Carlo', fontsize=14, fontweight='bold')
ax1.legend(fontsize=10, loc='upper right')
ax1.grid(True, alpha=0.3)

# ===== Bottom panel: Data/MC ratio =====
ax2 = axes[1]

# YOUR CODE HERE: Calculate and plot ratio
# ratio = data_counts / mc_total
# ax2.errorbar(...)
# ax2.axhline(1.0, ...)

ax2.set_xlabel(r'$m_{\mu\mu}$ (GeV/cÂ²)', fontsize=12)
ax2.set_ylabel('Data / MC', fontsize=12)
ax2.set_ylim(0.5, 1.5)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

<details>
<summary>ðŸ’¡ Click to reveal solution</summary>

```python
fig, axes = plt.subplots(2, 1, figsize=(10, 10), 
                         gridspec_kw={'height_ratios': [3, 1]},
                         sharex=True)

# Define binning
bins = np.linspace(60, 120, 41)
bin_centers = (bins[:-1] + bins[1:]) / 2
bin_width = bins[1] - bins[0]

# ===== Top panel: Data and MC =====
ax1 = axes[0]

# Data histogram
data_counts, _ = np.histogram(data_mass, bins=bins)
data_errors = np.sqrt(data_counts)
ax1.errorbar(bin_centers, data_counts, yerr=data_errors, 
             fmt='ko', markersize=4, label='Data')

# MC histograms (stacked)
mc_sig_counts, _ = np.histogram(mc_signal, bins=bins)
mc_bkg_counts, _ = np.histogram(mc_background, bins=bins)

# Scale MC to data
scale = data_counts.sum() / (mc_sig_counts.sum() + mc_bkg_counts.sum())
mc_sig_scaled = mc_sig_counts * scale
mc_bkg_scaled = mc_bkg_counts * scale

# Plot stacked MC
ax1.bar(bin_centers, mc_bkg_scaled, width=bin_width, alpha=0.5, 
        color='orange', label='MC Background')
ax1.bar(bin_centers, mc_sig_scaled, width=bin_width, alpha=0.5,
        bottom=mc_bkg_scaled, color='blue', label='MC Signal')

ax1.set_ylabel('Events / (1.5 GeV/cÂ²)', fontsize=12)
ax1.set_title('Z â†’ Î¼Î¼: Data vs Monte Carlo', fontsize=14, fontweight='bold')
ax1.legend(fontsize=10, loc='upper right')
ax1.grid(True, alpha=0.3)

# Add experiment label
ax1.text(0.05, 0.95, 'Simulation\n' + r'$\sqrt{s}$ = 13 TeV',
         transform=ax1.transAxes, verticalalignment='top',
         fontsize=11, family='sans-serif')

# ===== Bottom panel: Data/MC ratio =====
ax2 = axes[1]

mc_total = mc_sig_scaled + mc_bkg_scaled
ratio = np.divide(data_counts, mc_total, where=mc_total > 0)
ratio_err = np.divide(data_errors, mc_total, where=mc_total > 0)

ax2.errorbar(bin_centers, ratio, yerr=ratio_err, fmt='ko', markersize=4)
ax2.axhline(1.0, color='red', linestyle='--', linewidth=1)
ax2.fill_between([60, 120], [0.9, 0.9], [1.1, 1.1], alpha=0.2, color='gray')

ax2.set_xlabel(r'$m_{\mu\mu}$ (GeV/cÂ²)', fontsize=12)
ax2.set_ylabel('Data / MC', fontsize=12)
ax2.set_ylim(0.5, 1.5)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('data_mc_comparison.png', dpi=150, bbox_inches='tight')
plt.show()
```

</details>

In [None]:
# Seaborn statistical visualization

# Create DataFrame for Seaborn
df_plot = pd.DataFrame({
    'pt': np.concatenate([pt_all, np.random.exponential(35, len(pt_all))]),
    'eta': np.concatenate([
        np.random.uniform(-2.5, 2.5, len(pt_all)),
        np.random.uniform(-2.5, 2.5, len(pt_all))
    ]),
    'source': ['Data'] * len(pt_all) + ['MC'] * len(pt_all)
})

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# KDE comparison
sns.kdeplot(data=df_plot, x='pt', hue='source', ax=axes[0], fill=True, alpha=0.5)
axes[0].set_xlabel(r'$p_T$ (GeV/c)')
axes[0].set_title('pT Distribution Comparison')
axes[0].set_xlim(0, 200)

# Box plot
sns.boxplot(data=df_plot, x='source', y='pt', ax=axes[1])
axes[1].set_ylabel(r'$p_T$ (GeV/c)')
axes[1].set_title('pT Box Plot')

# Violin plot
sns.violinplot(data=df_plot, x='source', y='eta', ax=axes[2])
axes[2].set_ylabel('Î·')
axes[2].set_title('Î· Distribution')

plt.tight_layout()
plt.show()

---
## Summary

Today you learned:

âœ… **Advanced NumPy**: Vectorized operations, broadcasting, fancy indexing  
âœ… **Performance**: Vectorized code is 10-100x faster than loops  
âœ… **Advanced Pandas**: MultiIndex, apply/transform, memory optimization  
âœ… **Visualization**: Publication-quality plots, Data/MC comparisons, Seaborn  

**This afternoon:** Functions and Object-Oriented Programming for organizing analysis code!

---

**Great work! ðŸŽ‰**