After training and evaluating different models using various metrics, a common next step is to compare their performance to determine which one is genuinely better for your specific task. Simply observing that Model A has an accuracy of 92% and Model B has 91% on a test set isn't always enough to conclude that Model A is definitively superior. This difference might be due to random chance or the specific way your test set was sampled. Statistical significance tests help us determine if the observed differences in performance are likely real or just a result of random variability.

**Concept:**

The core idea is to use statistical hypothesis testing to assess if the difference in performance metrics between models is statistically significant. This means we want to know the probability that an observed difference could have occurred if there were no *true* underlying difference in the models' capabilities.

* **Why it's important:**
    * **Robustness:** Ensures decisions about model selection are not based on random fluctuations in test scores.
    * **Resource Allocation:** Justifies choosing a more complex or computationally expensive model only if its performance improvement is statistically real.
    * **Scientific Rigor:** Essential for research and publications to validate claims about new algorithms or model improvements.

* **Key Components of Hypothesis Testing:**
    1.  **Null Hypothesis ($H_0$):** States that there is no significant difference in performance between the models. For example, the mean accuracy of Model A is equal to the mean accuracy of Model B.
    2.  **Alternative Hypothesis ($H_A$ or $H_1$):** States that there *is* a significant difference. This can be two-sided (Model A $\neq$ Model B) or one-sided (Model A > Model B).
    3.  **Test Statistic:** A value calculated from your sample data (e.g., metric scores from cross-validation folds) that measures how far your observed data deviates from what is expected under the null hypothesis.
    4.  **p-value:** The probability of observing a test statistic as extreme as (or more extreme than) the one calculated from your data, *assuming the null hypothesis is true*.
    5.  **Significance Level ($\alpha$):** A pre-determined threshold (commonly 0.05, 0.01, or 0.001). If the p-value is less than $\alpha$, we reject the null hypothesis ($H_0$) and conclude that the observed difference is statistically significant.

* **Using Cross-Validation Scores:** To perform these tests effectively, we usually need multiple performance scores for each model. These are typically obtained from the folds of a cross-validation (CV) procedure (e.g., 10-fold CV gives 10 accuracy scores for each model). This provides an estimate of the metric's variability for each model.

---

## 39. Statistical Tests on Metrics

Here are common techniques used for comparing model performance statistically, primarily focusing on scenarios where you have scores from cross-validation:

### A. Paired t-test (for Comparing Two Models using Cross-Validation Scores)

* **Concept:** Used when you have *paired* observations of a metric for two models. In the context of k-fold cross-validation, the scores from the same fold are paired (e.g., Model A's accuracy on fold 1 vs. Model B's accuracy on fold 1). The test determines if the mean of the differences between these paired scores is significantly different from zero.
* **Formula (Conceptual for the test statistic $t$):**
    $t = \frac{\bar{d}}{s_d / \sqrt{k}}$
    Where:
    * $\bar{d}$ is the mean of the differences in performance scores ($Metric_{A_i} - Metric_{B_i}$) for each fold $i$.
    * $s_d$ is the standard deviation of these differences.
    * $k$ is the number of folds (i.e., number of paired differences).
    The p-value is then derived from this t-statistic and the degrees of freedom ($k-1$).
* **Interpretation:**
    * $H_0$: The true mean difference between the paired scores is zero.
    * $H_A$: The true mean difference is not zero (two-sided) or is greater/less than zero (one-sided).
    * If p-value < $\alpha$, reject $H_0$. For example, if testing if Model A is better than Model B ($H_A: \text{mean\_diff} > 0$) and p-value is small, conclude Model A is statistically significantly better.
* **Pros:**
    * Relatively simple to understand and implement.
    * Accounts for pairing, which can reduce variance compared to independent sample tests.
* **Cons:**
    * **Assumption of Normality:** Assumes the *differences* between the paired scores are approximately normally distributed. This is less critical with a larger number of folds (e.g., >15-30) due to the Central Limit Theorem.
    * **Violation of Independence with K-fold CV:** Standard k-fold CV partitions the data, but the training sets for different folds overlap significantly. This means the performance scores from different folds are not truly independent. This violation often leads to an underestimation of the true variance, potentially increasing the Type I error rate (finding a significant difference when none exists). Specialized tests like Dietterich's 5x2cv t-test or the corrected repeated k-fold CV test aim to address this, but the standard paired t-test is still commonly (though imperfectly) used.
* **Example:**
    Model A (Accuracy on 5 folds): `[0.92, 0.90, 0.93, 0.91, 0.92]`
    Model B (Accuracy on 5 folds): `[0.90, 0.89, 0.91, 0.90, 0.91]`
    Differences (A - B): `[0.02, 0.01, 0.02, 0.01, 0.01]`
    Mean difference $\bar{d} = 0.014$. Standard deviation $s_d \approx 0.00548$.
    Perform a one-sample t-test on these differences against a mean of 0.


* **Implementation (Scipy):**
    ```python
    from scipy import stats
    import numpy as np

    scores_model_A = np.array([0.92, 0.90, 0.93, 0.91, 0.92])
    scores_model_B = np.array([0.90, 0.89, 0.91, 0.90, 0.91])

    # Paired t-test: tests if scores_model_A is significantly different from scores_model_B
    # For H_A: Model A > Model B, use alternative='greater'
    t_statistic, p_value = stats.ttest_rel(scores_model_A, scores_model_B, alternative='greater')

    # print(f"Paired t-test:")
    # print(f"  t-statistic: {t_statistic:.3f}")
    # print(f"  p-value: {p_value:.3f}")

    alpha = 0.05
    # if p_value < alpha:
    #     print(f"  Reject H0: Model A is statistically significantly better than Model B (at alpha={alpha}).")
    # else:
    #     print(f"  Fail to reject H0: No statistically significant difference favoring Model A (at alpha={alpha}).")
    ```
    *Expected Output (approximate for the example data):*
    * t-statistic: 5.745
    * p-value: 0.002
    * Reject H0: Model A is statistically significantly better than Model B.
* **Context:** A common first-line approach for comparing two models based on k-fold CV scores. Be mindful of its limitations, especially the independence assumption.
---

### B. Permutation Test (Randomization Test)

* **Concept:** A non-parametric test that makes no assumptions about the underlying data distribution (like normality). It directly estimates the p-value by simulating the null hypothesis: if there's no difference between the models, then swapping their performance scores (or randomizing assignments) should not consistently lead to a more extreme result than what was observed.
* **Procedure (Conceptual for paired differences from CV):**
    1.  **Calculate Observed Statistic:** Compute the mean difference (or another statistic) between Model A's and Model B's scores from the CV folds. Let this be $D_{obs}$.
    2.  **Permutations:** Repeat for a large number of iterations (e.g., B = 1000 to 10,000):
        a.  For each pair of scores (A_i, B_i) from fold $i$, randomly decide (with 50% probability) whether to keep them as is or swap them. This creates a new "permuted" set of scores for Model A' and Model B'.
        b.  Alternatively, for the list of differences $d_i = A_i - B_i$, randomly assign a sign (+ or -) to each $d_i$.
        c.  Calculate the mean difference $D_{perm}$ for this permuted set.
    3.  **Calculate p-value:** The p-value is the proportion of permuted mean differences ($D_{perm}$) that are as extreme as or more extreme than the originally observed mean difference ($D_{obs}$).
        * For a one-sided test ($H_A$: Model A > Model B), p-value = $(\text{count}(D_{perm} \ge D_{obs}) + 1) / (B + 1)$.
        * For a two-sided test, p-value = $(\text{count}(|D_{perm}| \ge |D_{obs}|) + 1) / (B + 1)$. (The +1s are a small correction).
* **Interpretation:** If p-value < $\alpha$, reject $H_0$.
* **Pros:**
    * Non-parametric: Does not assume data normality.
    * More robust and often more accurate than t-tests, especially with small numbers of folds, non-normal data, or when t-test assumptions are clearly violated.
    * Intuitive concept.
* **Cons:**
    * Computationally more intensive, especially with many permutations.
    * "Exact" test (all possible permutations) is usually infeasible; Monte Carlo approximation (sampling permutations) is used.
* **Example:** Using the same CV scores:
    Observed differences: `[0.02, 0.01, 0.02, 0.01, 0.01]`. $D_{obs} = 0.014$.
    Permutation example: Randomly flip signs: `[-0.02, 0.01, -0.02, 0.01, -0.01]`. New mean: -0.006. Repeat many times.


* **Implementation (Conceptual Python):**
    ```python
    import numpy as np

    def permutation_test_paired(scores_a, scores_b, n_permutations=10000, alternative='greater'):
        observed_diff = np.mean(scores_a) - np.mean(scores_b)
        # Or, more correctly for paired:
        differences = np.array(scores_a) - np.array(scores_b)
        observed_mean_diff = np.mean(differences)

        count_extreme = 0
        for _ in range(n_permutations):
            permuted_signs = np.random.choice([-1, 1], size=len(differences))
            permuted_diff_mean = np.mean(differences * permuted_signs)
            
            if alternative == 'greater':
                if permuted_diff_mean >= observed_mean_diff:
                    count_extreme += 1
            elif alternative == 'two-sided':
                if abs(permuted_diff_mean) >= abs(observed_mean_diff):
                    count_extreme += 1
            # Add 'less' if needed
            
        p_value = (count_extreme + 1) / (n_permutations + 1)
        return observed_mean_diff, p_value

    # scores_model_A = np.array([0.92, 0.90, 0.93, 0.91, 0.92]) # Example data
    # scores_model_B = np.array([0.90, 0.89, 0.91, 0.90, 0.91]) # Example data
    # observed_mean_diff_perm, p_value_perm = permutation_test_paired(scores_model_A, scores_model_B, 
    #                                                                  n_permutations=10000, alternative='greater')
    # print(f"\nPermutation Test:")
    # print(f"  Observed Mean Difference: {observed_mean_diff_perm:.3f}")
    # print(f"  p-value: {p_value_perm:.4f}") # Note: p-value will vary slightly due to randomness
    ```
    Specialized libraries like `mlxtend.evaluate.permutation_test` offer more optimized implementations.
* **Context:** A powerful and robust alternative to parametric tests like the t-test, especially when assumptions are questionable. Recommended when computational resources allow.
---

### C. McNemar's Test (for Two Classifiers on a Single Test Set)

* **Concept:** Used for comparing the error rates of two binary classifiers based on their performance on a *single, shared test set*. It's a non-parametric test for paired nominal data (correct/incorrect). It focuses on the instances where the two classifiers *disagree*.
* **Setup:** Create a 2x2 contingency table based on disagreements:

    |                     | Model B Correct | Model B Incorrect |
    | :------------------ | :-------------- | :---------------- |
    | **Model A Correct** | $n_{00}$          | $n_{01}$            |
    | **Model A Incorrect** | $n_{10}$          | $n_{11}$            |

    * $n_{00}$: Number of instances where both A and B are correct.
    * $n_{01}$: Number of instances where A is correct, B is incorrect.
    * $n_{10}$: Number of instances where A is incorrect, B is correct.
    * $n_{11}$: Number of instances where both A and B are incorrect.
* **Formula (Test Statistic $\chi^2$ with continuity correction):**
    $\chi^2 = \frac{(|n_{01} - n_{10}| - 1)^2}{n_{01} + n_{10}}$
    This statistic follows a chi-squared distribution with 1 degree of freedom.
    $H_0$: The two models have the same proportion of errors on the disagreements (i.e., $P(n_{01}) = P(n_{10})$).
* **Interpretation:** If p-value < $\alpha$, reject $H_0$. This means there's a statistically significant difference in the error rates of the two models (one model makes significantly more errors on instances where the other is correct).
* **Pros:**
    * Simple to compute.
    * Appropriate for paired nominal data (correct/incorrect predictions on the same set of instances).
    * Does not require independence of instances within the test set.
* **Cons:**
    * Only considers disagreements ($n_{01}, n_{10}$), ignoring cases where both models agree.
    * Primarily for error rates; not directly for comparing continuous metrics like AUC on a single test set.
    * Limited to comparing two classifiers.
    * Less powerful than CV-based approaches for estimating generalization performance.
* **Example:**
    Suppose on a test set of 100 instances:
    * $n_{01}$ (A correct, B incorrect) = 15
    * $n_{10}$ (A incorrect, B correct) = 5
    $\chi^2 = \frac{(|15 - 5| - 1)^2}{15 + 5} = \frac{(10 - 1)^2}{20} = \frac{9^2}{20} = \frac{81}{20} = 4.05$
    Looking up p-value for $\chi^2=4.05$ with df=1 gives p $\approx$ 0.044.


* **Implementation (Statsmodels or MLxtend):**
    ```python
    from statsmodels.stats.contingency_tables import mcnemar
    # Or from mlxtend.evaluate import mcnemar_table, mcnemar

    # Contingency table for McNemar's test
    #      Model B: Correct | Incorrect
    # Model A:
    # Correct     [[n00,       n01],
    # Incorrect    [n10,       n11]]
    # We only need n01 and n10 for the basic test.
    # table = [[70, 15],  # n00, n01
    #          [5,  10]]  # n10, n11
    
    # result = mcnemar(table, exact=False, correction=True) # exact=False for chi-square approx.
    # print("\nMcNemar's Test:")
    # print(result)
    # print(f"  p-value: {result.pvalue:.3f}")
    ```
    *Expected output for the example $n_{01}=15, n_{10}=5$:*
    * p-value $\approx$ 0.044 (if using chi-squared approximation with continuity correction).
* **Context:** Useful for a quick comparison of two classifiers' error patterns on the *same test set*. It's less common in modern ML where CV-based comparisons are favored for estimating how well models generalize.

---

### Comparing Multiple Models (>2 Models)

When you have more than two models and want to see if *any* of them perform differently, and if so, which ones:

* **ANOVA (Analysis of Variance) with post-hoc tests (e.g., Tukey HSD):**
    * **Concept:** ANOVA tests if there's a significant difference between the means of three or more groups (e.g., mean CV scores of Model A, B, C). If ANOVA is significant, post-hoc tests (like Tukey's Honestly Significant Difference) are used to find out which specific pairs of models differ significantly.
    * **Caveats:** ANOVA has assumptions (normality of data within groups, homogeneity of variances across groups, independence of observations) that are often violated by CV scores.
* **Friedman Test with post-hoc tests (e.g., Nemenyi, Bonferroni-Dunn):**
    * **Concept:** A non-parametric alternative to ANOVA for repeated measures (like models evaluated on the same CV folds). It ranks the models' performance on each fold, then tests if there are significant differences in these mean ranks. If the Friedman test is significant, post-hoc tests are used to compare models pairwise.
    * **Pros:** More robust to violations of ANOVA's assumptions.
    * **Visualization:** Often accompanied by Critical Difference (CD) diagrams.
* **Bayesian Approaches:**
    * **Concept:** Instead of p-values, these methods estimate the posterior probability distributions of the performance metrics for each model. You can then directly calculate probabilities like "P(Model A > Model B)" or define a Region Of Practical Equivalence (ROPE) to see if models are practically equivalent.
    * **Pros:** Provides richer information than p-values, allows direct probability statements about model superiority.
    * **Cons:** Can be more complex to set up and interpret.

---

### Practical Considerations & Best Practices

1.  **Baseline Model:** Always compare your models against a sensible baseline (e.g., a simple rule-based system, logistic regression, or a previous production model).
2.  **Corrected Tests:** For CV scores, consider using tests designed to better handle dependencies, such as Dietterich's 5x2cv paired t-test, or corrected repeated k-fold CV approaches if computational resources permit.
3.  **Effect Size vs. Statistical Significance:** A statistically significant difference doesn't automatically mean it's *practically* important. A tiny improvement (e.g., 0.01% accuracy) might be statistically significant with a very large dataset or many CV folds but offer no real-world benefit. Always consider the magnitude of the difference (effect size) alongside the p-value.
4.  **Multiple Comparisons Problem:** If you perform many pairwise tests (e.g., comparing 5 models pairwise = 10 tests), the chance of getting a false positive (Type I error) increases. Use corrections like Bonferroni correction or Holm-Bonferroni method for post-hoc tests, or use tests designed for multiple comparisons (like Friedman + Nemenyi).
5.  **Single Test Set vs. Cross-Validation:** Comparing models on a single train/test split is highly susceptible to sampling bias. Cross-validation provides more robust estimates of generalization performance and the variability of that performance, which is essential for statistical testing.
6.  **Reporting:** Clearly state your null hypothesis, the test used, the significance level ($\alpha$), the obtained test statistic, the p-value, and your conclusion in the context of the problem.

Choosing the right statistical test and interpreting its results correctly is vital for making sound decisions about model selection and deployment.