# Week 2: Covariance, Correlation & Independence

**Course:** Statistics for Data Science II (BSMA1004)  
**Week:** 2 of 12

## Learning Objectives
- Master covariance and correlation concepts
- Understand independence vs uncorrelated variables
- Apply to feature selection and multicollinearity detection
- Implement correlation analysis in Python
- Interpret correlation matrices for real datasets

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.datasets import load_diabetes

np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

print('‚úì Libraries loaded successfully')

## 1. Review: Independence

### Mathematical Definition
X and Y are **independent** if:
$$P(X \in A, Y \in B) = P(X \in A) \cdot P(Y \in B)$$

**Equivalently:**
- Discrete: $p_{X,Y}(x,y) = p_X(x) \cdot p_Y(y)$ for all x,y
- Continuous: $f_{X,Y}(x,y) = f_X(x) \cdot f_Y(y)$ for all x,y
- Conditional: $p_{Y|X}(y|x) = p_Y(y)$ (Y doesn't depend on X)

**Key Property:** Independence ‚áí Zero Covariance (but NOT vice versa!)

In [None]:
# Test independence function
def test_independence(joint_pmf, marginal_x, marginal_y):
    """Test if P(X,Y) = P(X)P(Y) for all cells"""
    expected = np.outer(marginal_x, marginal_y)
    return np.allclose(joint_pmf, expected)

# Independent case: two fair coin flips
joint_indep = np.array([
    [0.25, 0.25],
    [0.25, 0.25]
])

marginal_X = joint_indep.sum(axis=1)  # [0.5, 0.5]
marginal_Y = joint_indep.sum(axis=0)  # [0.5, 0.5]

print(f"Independent? {test_independence(joint_indep, marginal_X, marginal_Y)}")
# True

# Dependent case
joint_dep = np.array([
    [0.3, 0.2],
    [0.1, 0.4]
])
marginal_X_dep = joint_dep.sum(axis=1)
marginal_Y_dep = joint_dep.sum(axis=0)

print(f"Dependent? {not test_independence(joint_dep, marginal_X_dep, marginal_Y_dep)}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
sns.heatmap(joint_indep, annot=True, cmap='Greens', ax=axes[0], cbar=False)
axes[0].set_title('Independent Distribution', fontweight='bold')
sns.heatmap(joint_dep, annot=True, cmap='Reds', ax=axes[1], cbar=False)
axes[1].set_title('Dependent Distribution', fontweight='bold')
plt.tight_layout()
plt.show()

## 2. Covariance

### Definition
**Covariance** measures how two variables vary together:

$$\text{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y]$$

### Interpretation
- **Cov(X,Y) > 0**: Positive association (X‚Üë ‚Üí Y‚Üë)
- **Cov(X,Y) < 0**: Negative association (X‚Üë ‚Üí Y‚Üì)
- **Cov(X,Y) = 0**: No linear relationship (uncorrelated)

### Properties
1. $\text{Cov}(X, X) = \text{Var}(X)$
2. $\text{Cov}(X, Y) = \text{Cov}(Y, X)$ (symmetric)
3. $\text{Cov}(aX, bY) = ab \cdot \text{Cov}(X, Y)$
4. $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X,Y)$

In [None]:
# Example: study hours vs exam scores
study_hours = np.array([2, 3, 4, 5, 6, 7, 8])
exam_scores = np.array([65, 70, 75, 80, 85, 90, 95])

# Method 1: Manual calculation
mean_hours = study_hours.mean()
mean_scores = exam_scores.mean()
cov_manual = np.mean((study_hours - mean_hours) * (exam_scores - mean_scores))
print(f"Covariance (manual): {cov_manual:.2f}")

# Method 2: Using numpy
cov_matrix = np.cov(study_hours, exam_scores, bias=True)
cov_numpy = cov_matrix[0, 1]
print(f"Covariance (numpy): {cov_numpy:.2f}")

# Interpretation
print(f"\n‚úì Positive covariance ‚Üí more study hours associated with higher scores")

# Visualize
plt.figure(figsize=(8, 6))
plt.scatter(study_hours, exam_scores, s=100, alpha=0.6)
plt.plot(study_hours, exam_scores, '--', alpha=0.3)
plt.xlabel('Study Hours', fontsize=12)
plt.ylabel('Exam Scores', fontsize=12)
plt.title(f'Study Hours vs Exam Scores (Cov = {cov_numpy:.2f})', fontweight='bold')
plt.grid(True, alpha=0.3)
plt.show()

## 3. Correlation Coefficient

### Definition
**Pearson Correlation Coefficient** (normalized covariance):

$$\rho_{X,Y} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X) \text{Var}(Y)}}$$

### Properties
1. **Range**: $-1 \leq \rho \leq 1$
2. **Scale-invariant**: Changing units doesn't change œÅ
3. **Perfect correlation**:
   - œÅ = 1: Perfect positive linear (Y = aX + b, a > 0)
   - œÅ = -1: Perfect negative linear (Y = aX + b, a < 0)
   - œÅ = 0: No linear relationship

### Why Normalize?
Covariance is scale-dependent:
- Cov(height in cm, weight in kg) ‚â† Cov(height in m, weight in g)
- Correlation solves this: always between -1 and 1

In [None]:
# Calculate correlation
correlation = np.corrcoef(study_hours, exam_scores)[0, 1]
print(f"Correlation: {correlation:.4f}")
print(f"Interpretation: Very strong positive relationship (œÅ ‚âà 1.00)")

# Different correlation scenarios
np.random.seed(42)
n = 100

# Create 4 scenarios
x1 = np.linspace(0, 10, n)
y1 = 2 * x1 + 5  # Perfect positive

x2 = np.random.normal(0, 1, n)
y2 = 0.8 * x2 + np.random.normal(0, 0.3, n)  # Strong positive

x3 = np.random.normal(0, 1, n)
y3 = np.random.normal(0, 1, n)  # No correlation

x4 = np.random.normal(0, 1, n)
y4 = -0.9 * x4 + np.random.normal(0, 0.2, n)  # Strong negative

# Visualize
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
scenarios = [(x1, y1, 'Perfect Positive'), (x2, y2, 'Strong Positive'),
             (x3, y3, 'No Correlation'), (x4, y4, 'Strong Negative')]
colors = ['blue', 'green', 'gray', 'red']

for ax, (x, y, title), color in zip(axes.flat, scenarios, colors):
    corr = np.corrcoef(x, y)[0, 1]
    ax.scatter(x, y, alpha=0.6, color=color, s=30)
    ax.set_title(f'{title}\n(œÅ = {corr:.3f})', fontweight='bold')
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 4. Independence vs Uncorrelation

### Critical Distinction

| Property | Independence | Zero Correlation |
|----------|-------------|------------------|
| Definition | P(X,Y) = P(X)P(Y) | Cov(X,Y) = 0 |
| Meaning | Complete independence | No LINEAR relationship |
| Strength | Stronger condition | Weaker condition |
| Direction | **Independence ‚áí Zero Cov** | **Zero Cov ‚áè Independence** |

### Classic Example: Y = X¬≤
- X ~ Uniform(-2, 2)
- Y = X¬≤
- Result: Cov(X,Y) = 0 BUT Y completely depends on X!
- Reason: Symmetry cancels linear relationship

In [None]:
# Y = X¬≤ example
np.random.seed(42)
X = np.random.uniform(-2, 2, 1000)
Y = X**2

# Check correlation
corr = np.corrcoef(X, Y)[0, 1]
print(f"Correlation between X and Y=X¬≤: {corr:.4f}")
print(f"\n‚ö†Ô∏è Nearly ZERO correlation, but Y is 100% determined by X!")
print(f"This proves: Zero Correlation ‚â† Independence")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Scatter plot
axes[0].scatter(X, Y, alpha=0.5, s=20)
axes[0].set_xlabel('X', fontsize=12)
axes[0].set_ylabel('Y = X¬≤', fontsize=12)
axes[0].set_title(f'Uncorrelated (œÅ = {corr:.3f}) but DEPENDENT', fontsize=13, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Show symmetry
axes[1].scatter(X, Y, c=np.sign(X), cmap='RdBu', alpha=0.6, s=20)
axes[1].set_xlabel('X', fontsize=12)
axes[1].set_ylabel('Y = X¬≤', fontsize=12)
axes[1].set_title('Symmetry Cancels Linear Relationship', fontsize=13, fontweight='bold')
axes[1].axvline(0, color='black', linestyle='--', alpha=0.5)
plt.colorbar(axes[1].collections[0], ax=axes[1], label='Sign of X')

plt.tight_layout()
plt.show()

print("\nüí° Key Lesson: Always visualize data! Correlation only measures LINEAR relationships.")

## 5. Correlation Matrix Analysis

### Real Dataset Application
Correlation matrices are essential for:
- Feature selection
- Multicollinearity detection
- Understanding data relationships
- Principal Component Analysis (PCA)

In [None]:
# Create realistic dataset
np.random.seed(42)
n = 200

df = pd.DataFrame({
    'height_cm': np.random.normal(170, 10, n),
    'age': np.random.randint(18, 65, n),
    'exercise_hrs': np.random.exponential(2, n)
})

# Add correlated variables
df['weight_kg'] = 0.5 * df['height_cm'] + np.random.normal(0, 5, n)
df['income'] = 800 * df['age'] + np.random.normal(0, 10000, n)
df['bmi'] = df['weight_kg'] / (df['height_cm']/100)**2

# Correlation matrix
corr_matrix = df.corr()

print("üìä Correlation Matrix:")
print(corr_matrix.round(3))

# Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm',
            center=0, vmin=-1, vmax=1, square=True, linewidths=1,
            cbar_kws={'label': 'Correlation'})
plt.title('Feature Correlation Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Find strong correlations
print("\nüîç Strong Correlations (|œÅ| > 0.5):")
for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        corr_val = corr_matrix.iloc[i, j]
        if abs(corr_val) > 0.5:
            print(f"  {corr_matrix.columns[i]} ‚Üî {corr_matrix.columns[j]}: {corr_val:.3f}")

## 6. Feature Selection Application

### Using Correlation for ML
High correlation with target ‚Üí potentially useful feature
High correlation between features ‚Üí multicollinearity problem

In [None]:
# Load diabetes dataset
diabetes = load_diabetes(as_frame=True)
X = diabetes.data
y = diabetes.target

# Correlation with target
target_corr = X.corrwith(y).abs().sort_values(ascending=False)
print("üìà Features ranked by correlation with target:")
print(target_corr)

# Visualize
plt.figure(figsize=(10, 6))
target_corr.plot(kind='barh', color='steelblue')
plt.xlabel('|Correlation| with Target', fontsize=12)
plt.title('Feature Importance by Correlation', fontsize=14, fontweight='bold')
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\n‚úì Top 3 features: {', '.join(target_corr.head(3).index.tolist())}")

## 7. Multicollinearity Detection

### Problem
When predictors are highly correlated:
- Unstable coefficient estimates
- Hard to interpret individual effects
- Inflated standard errors

### Solution
Detect pairs with |œÅ| > 0.8 and consider removing one

In [None]:
def detect_multicollinearity(df, threshold=0.8):
    """Find highly correlated feature pairs"""
    corr_matrix = df.corr().abs()
    upper_tri = corr_matrix.where(
        np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
    )
    
    high_corr = [(column, row, upper_tri.loc[row, column])
                 for column in upper_tri.columns
                 for row in upper_tri.index
                 if upper_tri.loc[row, column] > threshold]
    
    return high_corr

# Test on diabetes dataset
high_corr_pairs = detect_multicollinearity(X, threshold=0.5)

print("‚ö†Ô∏è Multicollinearity Warning (|œÅ| > 0.5):")
if high_corr_pairs:
    for feat1, feat2, corr_val in high_corr_pairs:
        print(f"  {feat1} <-> {feat2}: {corr_val:.3f}")
else:
    print("  None found (good!)")

# Visualize feature correlations
plt.figure(figsize=(10, 8))
sns.heatmap(X.corr(), annot=True, fmt='.2f', cmap='RdYlGn',
            center=0, vmin=-1, vmax=1, square=True)
plt.title('Feature Correlation Matrix - Diabetes Dataset', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## 8. Practice Problems

### Problem 1: Compute Correlation
Given Cov(X,Y) = 10, Var(X) = 25, Var(Y) = 16, find œÅ(X,Y)

### Problem 2: Independence Test
If X and Y are independent, prove Cov(X+Y, X-Y) = Var(X) - Var(Y)

### Problem 3: Real Data
Analyze correlation in the dataset below

In [None]:
print("="*60)
print("SOLUTIONS")
print("="*60)

# Problem 1
print("\nüìù Problem 1: Correlation")
cov_xy = 10
var_x = 25
var_y = 16
rho = cov_xy / (np.sqrt(var_x) * np.sqrt(var_y))
print(f"œÅ = Cov(X,Y) / (œÉ_X √ó œÉ_Y)")
print(f"œÅ = {cov_xy} / (‚àö{var_x} √ó ‚àö{var_y})")
print(f"œÅ = {cov_xy} / ({np.sqrt(var_x):.0f} √ó {np.sqrt(var_y):.0f})")
print(f"œÅ = {rho:.2f}")

# Problem 2
print("\nüìù Problem 2: Proof")
print("Cov(X+Y, X-Y) = E[(X+Y)(X-Y)] - E[X+Y]E[X-Y]")
print("              = E[X¬≤ - Y¬≤] - (E[X]+E[Y])(E[X]-E[Y])")
print("              = E[X¬≤] - E[Y¬≤] - E[X]¬≤ + E[Y]¬≤")
print("              = (E[X¬≤] - E[X]¬≤) - (E[Y¬≤] - E[Y]¬≤)")
print("              = Var(X) - Var(Y) ‚úì")

# Problem 3
print("\nüìù Problem 3: Dataset Analysis")
data = pd.DataFrame({
    'temperature': [20, 22, 25, 27, 30],
    'ice_cream_sales': [50, 60, 75, 85, 100],
    'crime_rate': [10, 12, 15, 17, 20]
})

corr = data.corr()
print("\nCorrelation Matrix:")
print(corr.round(3))

print("\n‚ö†Ô∏è Warning: temperature ‚Üî crime_rate correlation = {:.3f}".format(
    corr.loc['temperature', 'crime_rate']
))
print("This is SPURIOUS correlation (both caused by temperature)!")
print("Correlation ‚â† Causation")

print("\n" + "="*60)

## 9. Summary & Key Takeaways

### üìö Core Concepts
1. **Covariance**: $\text{Cov}(X,Y) = E[XY] - E[X]E[Y]$ - measures joint variability
2. **Correlation**: $\rho = \text{Cov}(X,Y)/(\sigma_X\sigma_Y)$ - normalized, scale-free (-1 to 1)
3. **Independence**: $P(X,Y) = P(X)P(Y)$ - strongest relationship (or lack thereof)

### üîë Key Relationships
```
Independence ‚áí Zero Covariance ‚áí Linear Independence
     ‚úì              ‚úó                    ‚úó
(reverse implications don't hold!)
```

### ‚ö†Ô∏è Common Pitfalls
1. **Y = X¬≤**: Zero correlation but fully dependent
2. **Spurious correlation**: Both variables caused by third factor
3. **Correlation ‚â† Causation**: Always investigate mechanism
4. **Outliers**: Can heavily influence correlation

### üéØ Data Science Applications
- **Feature Selection**: Select high target correlation
- **Multicollinearity**: Remove high inter-feature correlation (|œÅ| > 0.8)
- **PCA**: Uses covariance matrix for dimensionality reduction
- **Portfolio Theory**: Diversification uses negative correlation

### üìñ Important Formulas
- $\text{Var}(aX + bY) = a^2\text{Var}(X) + b^2\text{Var}(Y) + 2ab\text{Cov}(X,Y)$
- $\text{Cov}(X, Y) = 0$ if X, Y independent
- $|\rho_{X,Y}| = 1 \iff Y = aX + b$ for some a, b

### üöÄ Next Week
**Week 3: Expectations and Variance of Functions**
- Law of Total Expectation
- Variance decomposition
- Moment generating functions

---
**üéì End of Week 2**