# Correlation Analysis

A comprehensive exploration of relationship analysis between variables using the RustLab ecosystem. This notebook demonstrates Pearson, Spearman, and Kendall correlations, covariance analysis, and advanced visualization techniques for understanding variable relationships.

## Learning Objectives

- **Linear Correlation**: Pearson correlation for linear relationships
- **Non-Parametric Correlation**: Spearman and Kendall for monotonic relationships  
- **Covariance Analysis**: Understanding joint variation between variables
- **Multivariate Relationships**: Correlation matrices and heatmaps
- **Interpretation**: Correlation vs causation, outlier effects, and statistical significance

## Mathematical Foundation

Correlation measures the strength and direction of relationship between variables:

- **Pearson Correlation (r)**: Linear relationships, parametric (-1 ≤ r ≤ 1)
- **Spearman Correlation (ρ)**: Monotonic relationships, rank-based, non-parametric
- **Kendall Correlation (τ)**: Based on concordant/discordant pairs, robust to outliers
- **Covariance**: Joint variability, scale-dependent measure
- **Partial Correlation**: Controlling for confounding variables

In [2]:
// 📦 Setup: Dependencies and Imports
:dep rustlab-stats = { path = ".." }
:dep rustlab-math = { path = "../../rustlab-math" }
:dep rustlab-plotting = { path = "../../rustlab-plotting" }

// Global imports - these persist across all cells
use rustlab_stats::prelude::*;
use rustlab_math::*;
use rustlab_plotting::*;

// Test that everything is working
{
    let x_test = vec64![1.0, 2.0, 3.0, 4.0, 5.0];
    let y_test = vec64![2.1, 3.8, 6.2, 7.9, 9.8];
    let test_corr = x_test.correlation(&y_test, CorrelationMethod::Pearson);
    
    let setup_msg = format!("🎯 Setup complete! Test correlation: {:.3}", test_corr);
    println!("{}", setup_msg);
    println!("📊 Ready for correlation analysis and relationship exploration");
}

🎯 Setup complete! Test correlation: 0.999
📊 Ready for correlation analysis and relationship exploration


()

## 1. Pearson Correlation: Linear Relationships

Examining linear relationships between continuous variables using Pearson correlation coefficient.

In [3]:
{
    use rustlab_stats::prelude::*;
    use rustlab_math::*;
    use rustlab_plotting::*;
    
    println!("🔬 Pearson Correlation Analysis: Linear Relationships");
    println!("{}", "=".repeat(55));
    
    // Economic data: GDP vs Life Expectancy
    let gdp_per_capita = vec64![15000.0, 25000.0, 35000.0, 45000.0, 55000.0, 65000.0, 75000.0, 85000.0];
    let life_expectancy = vec64![68.5, 72.1, 74.8, 76.9, 78.2, 79.1, 79.8, 80.3];
    
    // Technology adoption: Internet penetration vs Digital literacy
    let internet_penetration = vec64![45.0, 52.0, 61.0, 68.0, 74.0, 79.0, 83.0, 87.0, 91.0, 94.0];
    let digital_literacy = vec64![38.0, 44.0, 51.0, 58.0, 63.0, 68.0, 72.0, 76.0, 80.0, 83.0];
    
    // Climate data: Temperature vs Ice cover (negative correlation)
    let avg_temperature = vec64![-2.1, -1.5, -0.8, 0.2, 0.9, 1.4, 2.0, 2.7, 3.3, 4.1];
    let ice_cover_pct = vec64![85.2, 82.1, 78.9, 74.3, 70.1, 65.8, 61.2, 56.7, 52.3, 47.9];
    
    println!("📊 Dataset 1: Economic Development (Strong Positive)");
    
    // Calculate Pearson correlation (using correct method call)
    let gdp_life_corr = gdp_per_capita.correlation(&life_expectancy, CorrelationMethod::Pearson);
    let gdp_life_cov = gdp_per_capita.covariance(&life_expectancy);
    
    let econ_corr = format!("   GDP per Capita vs Life Expectancy: r = {:.3}", gdp_life_corr);
    println!("{}", econ_corr);
    let econ_cov = format!("   Covariance: {:.1}", gdp_life_cov);
    println!("{}", econ_cov);
    
    // Interpretation of correlation strength
    let strength = match gdp_life_corr.abs() {
        r if r >= 0.8 => "very strong",
        r if r >= 0.6 => "strong", 
        r if r >= 0.4 => "moderate",
        r if r >= 0.2 => "weak",
        _ => "very weak",
    };
    let strength_msg = format!("   Interpretation: {} positive linear relationship", strength);
    println!("{}", strength_msg);
    
    println!();
    println!("📊 Dataset 2: Technology Adoption (Strong Positive)");
    
    let tech_corr = internet_penetration.correlation(&digital_literacy, CorrelationMethod::Pearson);
    let tech_cov = internet_penetration.covariance(&digital_literacy);
    
    let tech_corr_msg = format!("   Internet Penetration vs Digital Literacy: r = {:.3}", tech_corr);
    println!("{}", tech_corr_msg);
    let tech_cov_msg = format!("   Covariance: {:.1}", tech_cov);
    println!("{}", tech_cov_msg);
    
    println!();
    println!("📊 Dataset 3: Climate Change (Strong Negative)");
    
    let climate_corr = avg_temperature.correlation(&ice_cover_pct, CorrelationMethod::Pearson);
    let climate_cov = avg_temperature.covariance(&ice_cover_pct);
    
    let climate_corr_msg = format!("   Temperature vs Ice Cover: r = {:.3}", climate_corr);
    println!("{}", climate_corr_msg);
    let climate_cov_msg = format!("   Covariance: {:.1}", climate_cov);
    println!("{}", climate_cov_msg);
    
    let climate_direction = if climate_corr < 0.0 { "negative" } else { "positive" };
    let climate_interp = format!("   Interpretation: Strong {} relationship (as expected)", climate_direction);
    println!("{}", climate_interp);
    
    // Coefficient of determination (R²)
    let r_squared_gdp = gdp_life_corr.powi(2);
    let r_squared_tech = tech_corr.powi(2);
    let r_squared_climate = climate_corr.powi(2);
    
    println!();
    println!("📈 Explained Variance (R²):");
    let r2_gdp = format!("   Economic: {:.1}% of life expectancy variance explained by GDP", r_squared_gdp * 100.0);
    println!("{}", r2_gdp);
    let r2_tech = format!("   Technology: {:.1}% of digital literacy variance explained by internet", r_squared_tech * 100.0);
    println!("{}", r2_tech);
    let r2_climate = format!("   Climate: {:.1}% of ice cover variance explained by temperature", r_squared_climate * 100.0);
    println!("{}", r2_climate);
    
    println!();
    println!("💡 Key Insight: Pearson correlation measures LINEAR relationships only!");
    println!("   • Range: -1 (perfect negative) to +1 (perfect positive)");
    println!("   • R² shows proportion of variance explained");
    println!("   • Sensitive to outliers and assumes linear relationship");
}

🔬 Pearson Correlation Analysis: Linear Relationships
📊 Dataset 1: Economic Development (Strong Positive)
   GDP per Capita vs Life Expectancy: r = 0.950
   Covariance: 96642.9
   Interpretation: very strong positive linear relationship

📊 Dataset 2: Technology Adoption (Strong Positive)
   Internet Penetration vs Digital Literacy: r = 0.999
   Covariance: 255.0

📊 Dataset 3: Climate Change (Strong Negative)
   Temperature vs Ice Cover: r = -0.996
   Covariance: -26.5
   Interpretation: Strong negative relationship (as expected)

📈 Explained Variance (R²):
   Economic: 90.3% of life expectancy variance explained by GDP
   Technology: 99.8% of digital literacy variance explained by internet
   Climate: 99.2% of ice cover variance explained by temperature

💡 Key Insight: Pearson correlation measures LINEAR relationships only!
   • Range: -1 (perfect negative) to +1 (perfect positive)
   • R² shows proportion of variance explained
   • Sensitive to outliers and assumes linear relationship


()

## 2. Non-Parametric Correlations: Spearman and Kendall

Exploring rank-based correlations for non-linear monotonic relationships and outlier-robust analysis.

In [4]:
{
    use rustlab_stats::prelude::*;
    use rustlab_math::*;
    use rustlab_plotting::*;
    
    println!("🔬 Non-Parametric Correlations: Robust Relationship Analysis");
    println!("{}", "=".repeat(60));
    
    // Non-linear but monotonic relationship: Website traffic vs Revenue
    let daily_visitors = vec64![100.0, 200.0, 400.0, 800.0, 1600.0, 3200.0, 6400.0, 12800.0];
    let daily_revenue = vec64![50.0, 85.0, 140.0, 220.0, 320.0, 450.0, 620.0, 820.0];
    
    // Data with outliers: Social media engagement
    let post_likes = vec64![45.0, 52.0, 61.0, 68.0, 74.0, 79.0, 83.0, 350.0, 91.0, 94.0]; // 350 is outlier
    let post_shares = vec64![12.0, 14.0, 16.0, 18.0, 20.0, 22.0, 24.0, 78.0, 26.0, 28.0]; // 78 corresponds to outlier
    
    // Ordinal data: Survey ratings
    let service_quality = vec64![3.0, 4.0, 2.0, 5.0, 4.0, 3.0, 5.0, 4.0, 2.0, 5.0];
    let customer_satisfaction = vec64![3.0, 4.0, 2.0, 5.0, 5.0, 3.0, 4.0, 4.0, 1.0, 5.0];
    
    println!("📊 Dataset 1: Non-Linear Relationship (Website Metrics)");
    
    // Calculate different correlation measures (using correct method calls)
    let pearson_web = daily_visitors.correlation(&daily_revenue, CorrelationMethod::Pearson);
    let spearman_web = daily_visitors.correlation(&daily_revenue, CorrelationMethod::Spearman);
    let kendall_web = daily_visitors.correlation(&daily_revenue, CorrelationMethod::Kendall);
    
    let web_pearson = format!("   Pearson correlation: r = {:.3}", pearson_web);
    println!("{}", web_pearson);
    let web_spearman = format!("   Spearman correlation: ρ = {:.3}", spearman_web);
    println!("{}", web_spearman);
    let web_kendall = format!("   Kendall correlation: τ = {:.3}", kendall_web);
    println!("{}", web_kendall);
    
    println!("   → Spearman higher than Pearson: Non-linear but monotonic relationship");
    
    println!();
    println!("📊 Dataset 2: Data with Outliers (Social Media)");
    
    let pearson_social = post_likes.correlation(&post_shares, CorrelationMethod::Pearson);
    let spearman_social = post_likes.correlation(&post_shares, CorrelationMethod::Spearman);
    let kendall_social = post_likes.correlation(&post_shares, CorrelationMethod::Kendall);
    
    let social_pearson = format!("   Pearson correlation: r = {:.3}", pearson_social);
    println!("{}", social_pearson);
    let social_spearman = format!("   Spearman correlation: ρ = {:.3}", spearman_social);
    println!("{}", social_spearman);
    let social_kendall = format!("   Kendall correlation: τ = {:.3}", kendall_social);
    println!("{}", social_kendall);
    
    // Outlier impact analysis
    let outlier_impact_pearson = (pearson_social - spearman_social).abs();
    let impact_msg = format!("   → Outlier impact on Pearson: {:.3} difference from Spearman", outlier_impact_pearson);
    println!("{}", impact_msg);
    
    println!();
    println!("📊 Dataset 3: Ordinal Data (Survey Ratings)");
    
    let pearson_survey = service_quality.correlation(&customer_satisfaction, CorrelationMethod::Pearson);
    let spearman_survey = service_quality.correlation(&customer_satisfaction, CorrelationMethod::Spearman);
    let kendall_survey = service_quality.correlation(&customer_satisfaction, CorrelationMethod::Kendall);
    
    let survey_pearson = format!("   Pearson correlation: r = {:.3}", pearson_survey);
    println!("{}", survey_pearson);
    let survey_spearman = format!("   Spearman correlation: ρ = {:.3}", spearman_survey);
    println!("{}", survey_spearman);
    let survey_kendall = format!("   Kendall correlation: τ = {:.3}", kendall_survey);
    println!("{}", survey_kendall);
    
    println!("   → Rank-based correlations more appropriate for ordinal data");
    
    // Comparison summary
    println!();
    println!("📈 Correlation Method Comparison:");
    println!("{}", "-".repeat(35));
    
    println!("🔵 Pearson Correlation:");
    println!("   • Measures LINEAR relationships only");
    println!("   • Sensitive to outliers");
    println!("   • Assumes normal distribution");
    println!("   • Best for: Continuous data, linear relationships");
    
    println!();
    println!("🟢 Spearman Correlation:");
    println!("   • Measures MONOTONIC relationships (linear or non-linear)");
    println!("   • Robust to outliers (rank-based)");
    println!("   • No distribution assumptions");
    println!("   • Best for: Ordinal data, non-linear monotonic relationships");
    
    println!();
    println!("🟡 Kendall Correlation:");
    println!("   • Based on concordant/discordant pairs");
    println!("   • Most robust to outliers");
    println!("   • Smaller sample distribution properties better known");
    println!("   • Best for: Small samples, extremely robust analysis");
    
    // When to use which method
    println!();
    println!("🎯 Method Selection Guidelines:");
    println!("{}", "-".repeat(30));
    
    let guidelines = [
        "Linear relationship + Normal data → Pearson",
        "Monotonic relationship + Outliers → Spearman", 
        "Ordinal/Ranked data → Spearman or Kendall",
        "Small samples + Robustness → Kendall",
        "Non-linear but monotonic → Spearman"
    ];
    
    for (i, guideline) in guidelines.iter().enumerate() {
        let guide_msg = format!("   {}. {}", i + 1, guideline);
        println!("{}", guide_msg);
    }
    
    println!();
    println!("💡 Key Insight: Always compare multiple correlation methods for robust analysis!");
}

🔬 Non-Parametric Correlations: Robust Relationship Analysis
📊 Dataset 1: Non-Linear Relationship (Website Metrics)
   Pearson correlation: r = 0.949
   Spearman correlation: ρ = 1.000
   Kendall correlation: τ = 1.000
   → Spearman higher than Pearson: Non-linear but monotonic relationship

📊 Dataset 2: Data with Outliers (Social Media)
   Pearson correlation: r = 0.995
   Spearman correlation: ρ = 1.000
   Kendall correlation: τ = 1.000
   → Outlier impact on Pearson: 0.005 difference from Spearman

📊 Dataset 3: Ordinal Data (Survey Ratings)
   Pearson correlation: r = 0.909
   Spearman correlation: ρ = 0.881
   Kendall correlation: τ = 0.733
   → Rank-based correlations more appropriate for ordinal data

📈 Correlation Method Comparison:
-----------------------------------
🔵 Pearson Correlation:
   • Measures LINEAR relationships only
   • Sensitive to outliers
   • Assumes normal distribution
   • Best for: Continuous data, linear relationships

🟢 Spearman Correlation:
   • Measures 

()

## 3. Summary and Best Practices

Comprehensive guide to correlation analysis best practices and method selection.

In [5]:
{
    use rustlab_stats::prelude::*;
    use rustlab_math::*;
    use rustlab_plotting::*;
    
    println!("🎯 Correlation Analysis: Summary and Best Practices");
    println!("{}", "=".repeat(55));
    
    // Demonstration with a comprehensive example
    let student_study_hours = vec64![2.0, 4.0, 6.0, 8.0, 10.0, 12.0, 14.0, 16.0];
    let student_exam_scores = vec64![65.0, 72.0, 78.0, 82.0, 86.0, 89.0, 91.0, 93.0];
    let student_stress_level = vec64![8.5, 7.8, 7.2, 6.5, 6.0, 6.8, 7.5, 8.2]; // U-shaped: stress decreases then increases
    
    println!("📊 Complete Correlation Analysis Workflow:");
    println!("{}", "-".repeat(40));
    
    // Step 1: Data exploration
    println!("1. 📈 Data Exploration:");
    let study_mean = student_study_hours.mean();
    let score_mean = student_exam_scores.mean();
    let stress_mean = student_stress_level.mean();
    
    let explore_study = format!("   Study Hours: mean = {:.1}, range = [{:.0}, {:.0}]", 
                               study_mean, student_study_hours[0], student_study_hours[student_study_hours.len()-1]);
    println!("{}", explore_study);
    let explore_scores = format!("   Exam Scores: mean = {:.1}, range = [{:.0}, {:.0}]", 
                                 score_mean, student_exam_scores[0], student_exam_scores[student_exam_scores.len()-1]);
    println!("{}", explore_scores);
    let explore_stress = format!("   Stress Level: mean = {:.1}, appears non-linear", stress_mean);
    println!("{}", explore_stress);
    
    println!();
    
    // Step 2: Multiple correlation methods
    println!("2. 🔍 Multiple Correlation Methods:");
    
    // Study hours vs exam scores (linear relationship)
    let pearson_study_score = student_study_hours.correlation(&student_exam_scores, CorrelationMethod::Pearson);
    let spearman_study_score = student_study_hours.correlation(&student_exam_scores, CorrelationMethod::Spearman);
    let kendall_study_score = student_study_hours.correlation(&student_exam_scores, CorrelationMethod::Kendall);
    
    println!("   📚 Study Hours ↔ Exam Scores:");
    let study_pearson = format!("     Pearson: r = {:.3} (linear relationship)", pearson_study_score);
    println!("{}", study_pearson);
    let study_spearman = format!("     Spearman: ρ = {:.3} (monotonic relationship)", spearman_study_score);
    println!("{}", study_spearman);
    let study_kendall = format!("     Kendall: τ = {:.3} (rank-based)", kendall_study_score);
    println!("{}", study_kendall);
    
    // Study hours vs stress (non-linear relationship)
    let pearson_study_stress = student_study_hours.correlation(&student_stress_level, CorrelationMethod::Pearson);
    let spearman_study_stress = student_study_hours.correlation(&student_stress_level, CorrelationMethod::Spearman);
    
    println!();
    println!("   😰 Study Hours ↔ Stress Level:");
    let stress_pearson = format!("     Pearson: r = {:.3} (misses non-linear pattern)", pearson_study_stress);
    println!("{}", stress_pearson);
    let stress_spearman = format!("     Spearman: ρ = {:.3} (better for non-linear)", spearman_study_stress);
    println!("{}", stress_spearman);
    
    println!();
    
    // Step 3: Interpretation framework
    println!("3. 🎯 Interpretation Framework:");
    
    let study_score_strength = match pearson_study_score.abs() {
        r if r >= 0.8 => "very strong",
        r if r >= 0.6 => "strong",
        r if r >= 0.4 => "moderate",
        r if r >= 0.2 => "weak",
        _ => "very weak",
    };
    
    let interp_study = format!("   Study-Score relationship: {} positive correlation", study_score_strength);
    println!("{}", interp_study);
    
    let r_squared = pearson_study_score.powi(2);
    let variance_explained = format!("   Variance explained: {:.1}% of score variance by study hours", r_squared * 100.0);
    println!("{}", variance_explained);
    
    // Best practices checklist
    println!();
    println!("✅ Best Practices Checklist:");
    println!("{}", "-".repeat(27));
    
    let best_practices = [
        "Visualize data before calculating correlations",
        "Use multiple correlation methods for comparison",
        "Report confidence intervals for correlations",
        "Check statistical significance (p-values)",
        "Consider effect sizes and practical significance",
        "Look for non-linear patterns",
        "Investigate outliers and influential points",
        "Consider domain knowledge and theory",
        "Use correlation matrices for multivariate analysis",
        "Always discuss limitations and assumptions"
    ];
    
    for (i, practice) in best_practices.iter().enumerate() {
        let practice_msg = format!("   {}. {}", i + 1, practice);
        println!("{}", practice_msg);
    }
    
    println!();
    println!("🎯 Next Steps: Explore advanced multivariate analysis and causal inference methods!");
    println!("   • Partial correlations (controlling for confounders)");
    println!("   • Principal component analysis (PCA)");
    println!("   • Structural equation modeling");
    println!("   • Time series analysis for temporal relationships");
}

🎯 Correlation Analysis: Summary and Best Practices
📊 Complete Correlation Analysis Workflow:
----------------------------------------
1. 📈 Data Exploration:
   Study Hours: mean = 9.0, range = [2, 16]
   Exam Scores: mean = 82.0, range = [65, 93]
   Stress Level: mean = 7.3, appears non-linear

2. 🔍 Multiple Correlation Methods:
   📚 Study Hours ↔ Exam Scores:
     Pearson: r = 0.976 (linear relationship)
     Spearman: ρ = 1.000 (monotonic relationship)
     Kendall: τ = 1.000 (rank-based)

   😰 Study Hours ↔ Stress Level:
     Pearson: r = -0.181 (misses non-linear pattern)
     Spearman: ρ = -0.190 (better for non-linear)

3. 🎯 Interpretation Framework:
   Study-Score relationship: very strong positive correlation
   Variance explained: 95.3% of score variance by study hours

✅ Best Practices Checklist:
---------------------------
   1. Visualize data before calculating correlations
   2. Use multiple correlation methods for comparison
   3. Report confidence intervals for correlati

()