# Week 1: Multiple Random Variables

**Course:** Statistics for Data Science II (BSST1002)  
**Week:** 1

## Learning Objectives
- Understand multiple random variables concepts
- Apply statistical methods to real data
- Implement using NumPy, SciPy, and Pandas
- Interpret results for data science


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

print('✓ Libraries loaded')

## 1. Joint Distributions

### Definition
**Joint PMF** for discrete $(X, Y)$:
$$p(x, y) = P(X=x, Y=y)$$

**Properties:**
- $p(x, y) \geq 0$ for all $(x, y)$
- $\sum_x\sum_y p(x, y) = 1$

**Joint PDF** for continuous $(X, Y)$:
$$P((X,Y) \in A) = \iint_A f(x,y)\,dx\,dy$$


In [None]:
# Discrete joint distribution
x_vals = [0, 1, 2]
y_vals = [0, 1]

# Joint PMF (example: tossing two dice)
joint_pmf = np.array([[1/12, 2/12, 1/12],
                      [2/12, 3/12, 3/12]])

print('Joint PMF:')
print(pd.DataFrame(joint_pmf, index=[f'Y={y}' for y in y_vals], 
                   columns=[f'X={x}' for x in x_vals]))
print(f'\n✓ Sum = {joint_pmf.sum():.2f}')

# Visualize
plt.figure(figsize=(10, 6))
sns.heatmap(joint_pmf, annot=True, fmt='.3f', cmap='Blues',
            xticklabels=x_vals, yticklabels=y_vals, cbar_kws={'label': 'Probability'})
plt.xlabel('X', fontsize=12)
plt.ylabel('Y', fontsize=12)
plt.title('Joint Probability Mass Function', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## 2. Marginal Distributions

### Definition
**Marginal PMF of $X$:**
$$p_X(x) = \sum_y p(x, y)$$

**Marginal PMF of $Y$:**
$$p_Y(y) = \sum_x p(x, y)$$

Sum over the other variable!


In [None]:
# Calculate marginal distributions
marginal_x = joint_pmf.sum(axis=0)
marginal_y = joint_pmf.sum(axis=1)

print('Marginal Distribution of X:')
for i, x in enumerate(x_vals):
    print(f'  P(X={x}) = {marginal_x[i]:.3f}')

print('\nMarginal Distribution of Y:')
for i, y in enumerate(y_vals):
    print(f'  P(Y={y}) = {marginal_y[i]:.3f}')

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].bar(x_vals, marginal_x, color='skyblue', edgecolor='black', alpha=0.7)
axes[0].set_xlabel('X', fontsize=12)
axes[0].set_ylabel('Probability', fontsize=12)
axes[0].set_title('Marginal Distribution of X', fontsize=13, fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)

axes[1].bar(y_vals, marginal_y, color='salmon', edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Y', fontsize=12)
axes[1].set_ylabel('Probability', fontsize=12)
axes[1].set_title('Marginal Distribution of Y', fontsize=13, fontweight='bold')
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

## 3. Conditional Distributions

### Definition
**Conditional PMF of $Y$ given $X=x$:**
$$p_{Y|X}(y|x) = \frac{p(x, y)}{p_X(x)}$$

Probability of $Y=y$ given we know $X=x$

**Properties:**
- $p_{Y|X}(y|x) \geq 0$
- $\sum_y p_{Y|X}(y|x) = 1$


In [None]:
# Conditional distribution
x_given = 1  # Condition on X=1
idx = x_vals.index(x_given)

conditional_y_given_x = joint_pmf[:, idx] / marginal_x[idx]

print(f'Conditional Distribution of Y given X={x_given}:')
for i, y in enumerate(y_vals):
    print(f'  P(Y={y}|X={x_given}) = {conditional_y_given_x[i]:.3f}')
print(f'\n✓ Sum = {conditional_y_given_x.sum():.3f}')

# Compare with marginal
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].bar(y_vals, marginal_y, color='salmon', alpha=0.7, label='Marginal')
axes[0].set_xlabel('Y', fontsize=12)
axes[0].set_ylabel('Probability', fontsize=12)
axes[0].set_title('Marginal P(Y)', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(axis='y', alpha=0.3)

axes[1].bar(y_vals, conditional_y_given_x, color='green', alpha=0.7, label=f'Conditional (X={x_given})')
axes[1].set_xlabel('Y', fontsize=12)
axes[1].set_ylabel('Probability', fontsize=12)
axes[1].set_title(f'Conditional P(Y|X={x_given})', fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

## 4. Continuous Joint Distributions

### Bivariate Normal
Most important continuous joint distribution:

$$(X, Y) \sim N(\mu_X, \mu_Y, \sigma_X^2, \sigma_Y^2, \rho)$$

where $\rho$ is correlation coefficient.

**PDF:**
$$f(x, y) = \frac{1}{2\pi\sigma_X\sigma_Y\sqrt{1-\rho^2}}\exp\left[-\frac{1}{2(1-\rho^2)}\left(\frac{(x-\mu_X)^2}{\sigma_X^2} - \frac{2\rho(x-\mu_X)(y-\mu_Y)}{\sigma_X\sigma_Y} + \frac{(y-\mu_Y)^2}{\sigma_Y^2}\right)\right]$$


In [None]:
# Bivariate normal distribution
from scipy.stats import multivariate_normal

mean = [0, 0]
cov = [[1, 0.8], [0.8, 1]]  # Correlation = 0.8

rv = multivariate_normal(mean, cov)

# Generate grid
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
X, Y = np.meshgrid(x, y)
pos = np.dstack((X, Y))
Z = rv.pdf(pos)

# 3D visualization
fig = plt.figure(figsize=(14, 6))

ax1 = fig.add_subplot(121, projection='3d')
ax1.plot_surface(X, Y, Z, cmap='viridis', alpha=0.8)
ax1.set_xlabel('X', fontsize=10)
ax1.set_ylabel('Y', fontsize=10)
ax1.set_zlabel('Density', fontsize=10)
ax1.set_title('Bivariate Normal PDF (ρ=0.8)', fontsize=13, fontweight='bold')

# Contour plot
ax2 = fig.add_subplot(122)
contour = ax2.contourf(X, Y, Z, levels=20, cmap='viridis')
ax2.set_xlabel('X', fontsize=12)
ax2.set_ylabel('Y', fontsize=12)
ax2.set_title('Contour Plot', fontsize=13, fontweight='bold')
plt.colorbar(contour, ax=ax2, label='Density')

plt.tight_layout()
plt.show()

## 5. Application: Stock Returns

Analyzing joint distribution of two stock returns


In [None]:
# Simulate stock returns
np.random.seed(42)
n_days = 500

# Correlated returns
mean = [0.001, 0.0008]
cov = [[0.0004, 0.00025], [0.00025, 0.0003]]
returns = np.random.multivariate_normal(mean, cov, n_days)

stock_A = returns[:, 0]
stock_B = returns[:, 1]

# Analysis
print('Summary Statistics:')
print(f'Stock A: μ={stock_A.mean():.5f}, σ={stock_A.std():.5f}')
print(f'Stock B: μ={stock_B.mean():.5f}, σ={stock_B.std():.5f}')
print(f'Correlation: {np.corrcoef(stock_A, stock_B)[0,1]:.3f}')

# Visualize joint distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Scatter plot
axes[0].scatter(stock_A, stock_B, alpha=0.5, s=30)
axes[0].set_xlabel('Stock A Returns', fontsize=12)
axes[0].set_ylabel('Stock B Returns', fontsize=12)
axes[0].set_title('Joint Distribution of Returns', fontsize=13, fontweight='bold')
axes[0].grid(True, alpha=0.3)
axes[0].axhline(0, color='red', linestyle='--', linewidth=0.5)
axes[0].axvline(0, color='red', linestyle='--', linewidth=0.5)

# 2D histogram
axes[1].hist2d(stock_A, stock_B, bins=30, cmap='Blues')
axes[1].set_xlabel('Stock A Returns', fontsize=12)
axes[1].set_ylabel('Stock B Returns', fontsize=12)
axes[1].set_title('2D Histogram', fontsize=13, fontweight='bold')

plt.tight_layout()
plt.show()

## Summary

### Key Concepts
1. **Joint Distribution:** $p(x, y)$ or $f(x, y)$
2. **Marginal Distribution:** Sum/integrate over other variable
3. **Conditional Distribution:** Given one variable known
4. **Bivariate Normal:** Most important continuous joint

### Formulas
$$p_X(x) = \sum_y p(x, y) \quad p_{Y|X}(y|x) = \frac{p(x,y)}{p_X(x)}$$

### Applications
- Portfolio analysis (stocks)
- Multivariate regression
- Machine learning (joint feature distributions)

**Next:** Week 2 - Independence and Conditional Probability
