# 📊 Statistical Analysis

**🎯 Objective:** To evaluate whether model performance (e.g., Accuracy, F1) varies significantly across demographic or dialectal groups.

> 🧠 Use non-parametric tests when normality assumptions may not hold (e.g., with small, imbalanced, or ordinal data).

---

**📌 Step 1: Kruskal–Wallis H Test**
- **Purpose:** Non-parametric alternative to ANOVA. Tests whether the median performance differs across multiple groups (e.g., AAVE vs White vs AAE-no-AAVE).

- **Test Used:** `scipy.stats.kruskal(group1_scores, group2_scores, group3_scores)`

- **Interpretation:**
  - **H-statistic:** Measures rank-based variance between groups.
  - **p-value:** If **p < 0.05**, there is a statistically significant difference in medians across at least one pair of groups. Otherwise, no evidence of group disparity.

---

**📌 Step 2: Dunn’s Test (Post-Hoc Analysis)**
- **Purpose:** If Kruskal–Wallis is significant, Dunn’s test identifies which **specific pairs** of groups differ.

- **Test Used:**  
  `scikit_posthocs.posthoc_dunn(df, val_col="metric", group_col="group", p_adjust="bonferroni")`

- **Interpretation:**
  - **p-adj:** Corrected p-value for each group pair. If **p-adj < 0.05**, the pair differs significantly.
  - **Correction:** Use Bonferroni or Holm correction to control for Type I error.

---

**📝 Reporting:**
- **Kruskal–Wallis:** Report H-statistic and p-value per metric/model.
- **Dunn’s Test:** Report all pairwise comparisons and highlight those with p < 0.05.
- **Conclusion:** Summarize group-wise disparities and potential fairness concerns across dialects or demographics.

---

**🔍 When to prefer Kruskal–Wallis + Dunn’s over ANOVA + Tukey:**
- Your data is **not normally distributed**, or
- You have **ordinal data**, or
- Sample sizes are **small or unequal**, or
- You want to test **median differences** rather than means.

<!-- # Statistical Analysis (OLD SCRIPT)

**🎯 Objective:** To evaluate whether model performance (e.g., Accuracy, F1) varies significantly across demographic or dialectal groups.

**📌 Step 1: ANOVA (Analysis of Variance)**
- Purpose: Test whether the mean performance differs across multiple groups (e.g., AAVE vs White vs AAE-no-AAVE).

- Test Used: scipy.stats.f_oneway(group1_scores, group2_scores, group3_scores)

- Interpretation:
  - F-statistic: The ratio of between-group to within-group variance.
  - p-value: The probability that any observed difference is due to chance. If p < 0.05, at least one group differs significantly. Else, no evidence of difference.

**📌 Step 2: Tukey HSD (Post-Hoc Analysis)**
- Purpose: If ANOVA is significant, Tukey HSD tells us which specific groups differ.

- Test Used: statsmodels.stats.multicomp.pairwise_tukeyhsd(endog, groups, alpha=0.05)

- Interpretation:
  - meandiff: Difference in means between each group pair
  - p-adj: Corrected p-value for multiple comparisons. p-adv < 0.05 means the difference is significant
  - lower/upper: Confidence Interval (CI) for the difference in means. If the CI (lower, upper) does not include 0, then difference is likely significant.
  - reject = True: Difference is statistically significant; False: otherwise.

**📝 Reporting:**
- ANOVA: Report F-statistic and p-value per metric/model.
- Tukey HSD: Report all pairwise comparisons and note which are statistically significant.
- Conclusion: Summarize which models (if any) show bias or performance disparities by group. -->

## 1. Import Essential Libraries and Load The Results Summary

In [1]:
pip install scikit-posthocs



In [2]:
import os
import pandas as pd

In [3]:
import scipy.stats as stats
from scipy.stats import kruskal
try:
    import scikit_posthocs as sp
    HAS_SCPH = True
except ImportError:
    HAS_SCPH = False
    print("scikit-posthocs not installed correctly. Re-run: pip install scikit-posthocs")

In [4]:
RESULTS_DIR = "/content/drive/MyDrive/Colab Notebooks/Dialect_Sentiment_Analysis/results/sentiment_analysis"

# check validity of the path
print("Exists:", os.path.exists(RESULTS_DIR))
print("Is directory:", os.path.isdir(RESULTS_DIR))
print("Contents:", os.listdir(RESULTS_DIR))

# load CSVs
df_full = pd.read_csv(f"{RESULTS_DIR}/sentiment_bias_results_full.csv")
df_balanced = pd.read_csv(f"{RESULTS_DIR}/sentiment_bias_results_balanced.csv")
df_all = pd.concat([df_full, df_balanced])
df_all

Exists: True
Is directory: True
Contents: ['Full_Set_RoBERTa_White_confusion_matrix (1).png', 'Full_Set_RoBERTa_AAE-no-AAVE_confusion_matrix (1).png', 'Full_Set_RoBERTa_AAVE_confusion_matrix (1).png', 'Full_Set_RoBERTa-Latest_White_confusion_matrix (1).png', 'Full_Set_RoBERTa-Latest_AAE-no-AAVE_confusion_matrix (1).png', 'Full_Set_RoBERTa-Latest_AAVE_confusion_matrix (1).png', 'Full_Set_BERTweet_White_confusion_matrix (1).png', 'Full_Set_BERTweet_AAE-no-AAVE_confusion_matrix (1).png', 'Full_Set_BERTweet_AAVE_confusion_matrix (1).png', 'Balanced_Set_RoBERTa_White_confusion_matrix (1).png', 'accuracy_full.png', 'accuracy_balanced.png', 'f1_score_full.png', 'f1_score_balanced.png', 'sentiment_bias_results_full.csv', 'sentiment_bias_results_balanced.csv', 'sentiment_bias_results_all.csv', 'Full_Set_BERTweet_AAE-no-AAVE_confusion_matrix.png', 'Full_Set_BERTweet_AAVE_confusion_matrix.png', 'Full_Set_BERTweet_White_confusion_matrix.png', 'Full_Set_RoBERTa-Latest_White_confusion_matrix.png', '

Unnamed: 0,model,group,accuracy,f1_score,mode
0,RoBERTa,White,0.808815,0.807437,full
1,RoBERTa,AAE-no-AAVE,0.795628,0.787915,full
2,RoBERTa,AAVE,0.789287,0.78972,full
3,RoBERTa-Latest,White,0.768141,0.766732,full
4,RoBERTa-Latest,AAE-no-AAVE,0.755306,0.746919,full
5,RoBERTa-Latest,AAVE,0.73785,0.739475,full
6,BERTweet,White,0.818611,0.817249,full
7,BERTweet,AAE-no-AAVE,0.824915,0.820304,full
8,BERTweet,AAVE,0.821923,0.823498,full
0,RoBERTa,White,0.815537,0.815987,balanced


# 2. Perform Kruskal-Willis and Dunn's test

In [5]:
def run_kruskal_dunn(df, df_type="balanced", metric_name="accuracy",
                     correction="bonferroni"):
    print(f"\n=== {metric_name.upper()} on the {df_type} set ===")

    # Get unique groups
    groups = df["group"].unique()

    # Group scores into separate lists
    group_scores = [df[df["group"] == g][metric_name].values for g in groups]

    # Kruskal–Wallis test
    h_stat, p_val = stats.kruskal(*group_scores)
    print(f"Kruskal–Wallis H-test:\nH = {h_stat:.4f}, p = {p_val:.4f}")

    if p_val < 0.05:
        print("Significant differences found. Proceeding to Dunn's test...\n")

        # Dunn's post-hoc test
        dunn_results = sp.posthoc_dunn(
            df,
            val_col=metric_name,
            group_col="group",
            p_adjust=correction
        )
        print("Dunn's test (adjusted p-values):")
        print(dunn_results)
    else:
        print("No statistically significant difference detected. Skipping Dunn's test.")


In [6]:
# For accuracy on the balanced set
run_kruskal_dunn(df_balanced, df_type="Balanced", metric_name="accuracy")


=== ACCURACY on the Balanced set ===
Kruskal–Wallis H-test:
H = 0.6222, p = 0.7326
No statistically significant difference detected. Skipping Dunn's test.


In [7]:
# For F1 on the balanced set
run_kruskal_dunn(df_balanced, df_type="Balanced", metric_name="f1_score")


=== F1_SCORE on the Balanced set ===
Kruskal–Wallis H-test:
H = 0.6222, p = 0.7326
No statistically significant difference detected. Skipping Dunn's test.


In [8]:
# For accuracy on the full set
run_kruskal_dunn(df_full, df_type="Full", metric_name="accuracy")


=== ACCURACY on the Full set ===
Kruskal–Wallis H-test:
H = 0.2667, p = 0.8752
No statistically significant difference detected. Skipping Dunn's test.


In [9]:
# For F1 on the full set
run_kruskal_dunn(df_full, df_type="Full", metric_name="f1_score")


=== F1_SCORE on the Full set ===
Kruskal–Wallis H-test:
H = 0.0889, p = 0.9565
No statistically significant difference detected. Skipping Dunn's test.


# 📊 Extended Sentiment Analysis & Fairness Evaluation

In addition to F1-scorem Accuracy, and confusion matrices per model, this notebook explores model performance disparities across dialect groups (White, AAE-no-AAVE, AAVE) including:
- A more in-depth metrics per class
- Top confusions pairs per model/group
- Cross-model disagreement
- Top most "difficult" tweets (where models got wrong predictions)

In [10]:
import re
import random
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    precision_recall_fscore_support,
    confusion_matrix,
    ConfusionMatrixDisplay,
    classification_report
)

In [11]:
# data paths of the predictions records
RESULTS_DIR = "/content/drive/MyDrive/Colab Notebooks/Dialect_Sentiment_Analysis/results/sentiment_analysis/"
FULL_CSV      = RESULTS_DIR + "sentiment_bias_predictions_full.csv"
BALANCED_CSV  = RESULTS_DIR + "sentiment_bias_predictions_balanced.csv"

# integer to label mapping and vice versa
ID2LABEL = {0: "negative", 1: "neutral", 2: "positive"}
LABEL2ID = {v: k for k, v in ID2LABEL.items()}

In [12]:
# load the predictions
df_full = pd.read_csv(FULL_CSV)
df_balanced  = pd.read_csv(BALANCED_CSV)
df_all  = pd.concat([df_full, df_balanced], ignore_index=True)

# map ints to strings for readability
if "true_label" in df_all and np.issubdtype(df_all["true_label"].dtype, np.integer):
    df_all["true_label_str"] = df_all["true_label"].map(ID2LABEL)
    df_all["pred_label_str"] = df_all["pred_label"].map(ID2LABEL)

df_all.head(10)

Unnamed: 0,model,group,mode,text,true_label,pred_label,true_label_str,pred_label_str
0,RoBERTa,White,full,"""QT @user In the original draft of the 7th boo...",2,2,positive,positive
1,RoBERTa,White,full,"""Ben Smith / Smith (concussion) remains out of...",1,1,neutral,neutral
2,RoBERTa,White,full,Sorry bout the stream last night I crashed out...,1,1,neutral,neutral
3,RoBERTa,White,full,Chase Headley's RBI double in the 8th inning o...,1,1,neutral,neutral
4,RoBERTa,White,full,@user Alciato: Bee will invest 150 million in ...,2,1,positive,neutral
5,RoBERTa,White,full,So disappointed in wwe summerslam! I want to s...,0,0,negative,negative
6,RoBERTa,White,full,@user @user CENA & AJ sitting in a tree K-I-S-...,1,1,neutral,neutral
7,RoBERTa,White,full,@user Well said on HMW. Can you now address wh...,1,1,neutral,neutral
8,RoBERTa,White,full,Just said hello to Dennis Kucinich as he walke...,1,2,neutral,positive
9,RoBERTa,White,full,Super excited for homecoming Saturday with Mon...,2,2,positive,positive


In [13]:
# recompute the metrics, e.g. accuracy and f1 score, to verify the validity of the predictions data file
def aggregate_metrics(df):
    rows = []
    for (model, group, mode), g in df.groupby(["model", "group", "mode"]):
        acc = accuracy_score(g["true_label"], g["pred_label"])
        f1  = f1_score(g["true_label"], g["pred_label"], average="macro")
        rows.append({"model": model, "group": group, "mode": mode, "accuracy": acc, "f1_macro": f1})
    return pd.DataFrame(rows)

In [14]:
metrics_df = aggregate_metrics(df_all)
display(metrics_df)

Unnamed: 0,model,group,mode,accuracy,f1_macro
0,BERTweet,AAE-no-AAVE,balanced,0.820504,0.816501
1,BERTweet,AAE-no-AAVE,full,0.824915,0.820304
2,BERTweet,AAVE,balanced,0.821923,0.823498
3,BERTweet,AAVE,full,0.821923,0.823498
4,BERTweet,White,balanced,0.830791,0.831364
5,BERTweet,White,full,0.818611,0.817249
6,RoBERTa,AAE-no-AAVE,balanced,0.79851,0.788645
7,RoBERTa,AAE-no-AAVE,full,0.795628,0.787915
8,RoBERTa,AAVE,balanced,0.789287,0.78972
9,RoBERTa,AAVE,full,0.789287,0.78972


In [15]:
# compute the error df for error analysis
errors_df = df_all[df_all["true_label"] != df_all["pred_label"]].copy()
errors_df.to_csv(os.path.join(RESULTS_DIR, "error_rates.csv"), index=False)

# Error rate per (model, group, mode)
err_rates = (
    df_all.assign(correct=lambda d: d["true_label"] == d["pred_label"])
          .groupby(["model", "group", "mode"])["correct"]
          .apply(lambda x: 1 - x.mean())
          .reset_index(name="error_rate")
)
display(err_rates)

Unnamed: 0,model,group,mode,error_rate
0,BERTweet,AAE-no-AAVE,balanced,0.179496
1,BERTweet,AAE-no-AAVE,full,0.175085
2,BERTweet,AAVE,balanced,0.178077
3,BERTweet,AAVE,full,0.178077
4,BERTweet,White,balanced,0.169209
5,BERTweet,White,full,0.181389
6,RoBERTa,AAE-no-AAVE,balanced,0.20149
7,RoBERTa,AAE-no-AAVE,full,0.204372
8,RoBERTa,AAVE,balanced,0.210713
9,RoBERTa,AAVE,full,0.210713


In [16]:
# produce more in-depth metrics per class
def per_class_metrics(df):
    rows = []
    for (model, group, mode), g in df.groupby(["model", "group", "mode"]):
        p, r, f1, support = precision_recall_fscore_support(
            g["true_label"], g["pred_label"], labels=[0,1,2], zero_division=0
        )
        for i, lab in enumerate([0,1,2]):
            rows.append({
                "model": model, "group": group, "mode": mode,
                "label": lab, "label_str": ID2LABEL[lab],
                "precision": p[i], "recall": r[i], "f1": f1[i], "support": support[i]
            })
    return pd.DataFrame(rows)

per_class_df = per_class_metrics(df_all)
display(per_class_df)
per_class_df.to_csv(os.path.join(RESULTS_DIR, "per_class_metrics.csv"), index=False)

Unnamed: 0,model,group,mode,label,label_str,precision,recall,f1,support
0,BERTweet,AAE-no-AAVE,balanced,0,negative,0.772401,0.832046,0.801115,518
1,BERTweet,AAE-no-AAVE,balanced,1,neutral,0.83345,0.821899,0.827634,1443
2,BERTweet,AAE-no-AAVE,balanced,2,positive,0.830549,0.811189,0.820755,858
3,BERTweet,AAE-no-AAVE,full,0,negative,0.789474,0.820981,0.804919,877
4,BERTweet,AAE-no-AAVE,full,1,neutral,0.833402,0.834786,0.834094,2409
5,BERTweet,AAE-no-AAVE,full,2,positive,0.833453,0.810659,0.821898,1426
6,BERTweet,AAVE,balanced,0,negative,0.79062,0.858182,0.823017,550
7,BERTweet,AAVE,balanced,1,neutral,0.818106,0.782051,0.799672,1248
8,BERTweet,AAVE,balanced,2,positive,0.844509,0.851126,0.847805,1021
9,BERTweet,AAVE,full,0,negative,0.79062,0.858182,0.823017,550


In [17]:
# top confusion pairs per model/group
def top_confusions(df, top_k=5):
    rows = []
    for (model, group, mode), g in df.groupby(["model", "group", "mode"]):
        cm = confusion_matrix(g["true_label"], g["pred_label"], labels=[0,1,2])
        # Flatten pairs
        for i_true in range(cm.shape[0]):
            for i_pred in range(cm.shape[1]):
                if i_true != i_pred and cm[i_true, i_pred] > 0:
                    rows.append({
                        "model": model, "group": group, "mode": mode,
                        "true": ID2LABEL.get(i_true, i_true),
                        "pred": ID2LABEL.get(i_pred, i_pred),
                        "count": cm[i_true, i_pred]
                    })
    out = pd.DataFrame(rows).sort_values("count", ascending=False)
    return out.groupby(["model", "group", "mode"]).head(top_k)

display(top_confusions(df_all))

Unnamed: 0,model,group,mode,true,pred,count
107,RoBERTa-Latest,White,full,positive,neutral,3503
104,RoBERTa-Latest,White,full,neutral,negative,3261
71,RoBERTa,White,full,positive,neutral,3228
105,RoBERTa-Latest,White,full,neutral,positive,3146
33,BERTweet,White,full,neutral,positive,2949
...,...,...,...,...,...,...
4,BERTweet,AAE-no-AAVE,balanced,positive,negative,8
25,BERTweet,White,balanced,negative,positive,8
13,BERTweet,AAVE,balanced,negative,positive,7
19,BERTweet,AAVE,full,negative,positive,7


In [27]:
# compute disagreement rate between models

# pivot to get predictions per model in columns
pivot_df = df_all.pivot_table(
    index=["text", "group", "mode"],
    columns="model",
    values="pred_label"
).reset_index()

# compute disagreement: at least one model differs from the others
pivot_df["disagree"] = pivot_df.iloc[:, 3:].nunique(axis=1) > 1

# overall disagreement rate
overall_disagreement = pivot_df["disagree"].mean()

# grouped disagreement rates
grouped_disagreement = pivot_df.groupby(["group", "mode"])["disagree"].mean()

print("Disagreement Rate Between Models")
print(f"\nOverall disagreement rate: {overall_disagreement:.4f}")
print(grouped_disagreement)

Disagreement Rate Between Models

Overall disagreement rate: 0.1931
group        mode    
AAE-no-AAVE  balanced    0.133862
             full        0.246337
AAVE         balanced    0.240866
             full        0.240866
White        balanced    0.007875
             full        0.216499
Name: disagree, dtype: float64


In [19]:
# compute the most "difficult" tweets where the models got wrong

# join predictions with true labels
merged = df_all.copy()
merged["correct"] = (merged["true_label"] == merged["pred_label"]).astype(int)

# compute mean correctness across models
mean_correct = (
    merged.groupby(["text", "group", "mode"])["correct"]
    .mean()
    .reset_index()
    .rename(columns={"correct": "mean_correct"})
)

# sort to show the most consistently misclassified tweets
worst_examples = mean_correct.sort_values("mean_correct").head(10)

print("Consistently Misclassified Tweets")
display(worst_examples)

Consistently Misclassified Tweets


Unnamed: 0,text,group,mode,mean_correct
75097,😑Ellison swings at Trump in Thanksgiving tweet...,White,full,0.0
22593,"@user I don't cry very often , bit your anti a...",White,full,0.0
22595,"@user I don't get it, they hate Christians and...",White,full,0.0
22596,@user I don't get the Davis love. I mean sure ...,White,full,0.0
22601,@user I don't know why they haven't put Siri o...,White,balanced,0.0
75055,🏀@KoponenPetteri is ready #fcblive #GSOFCB,White,balanced,0.0
51,"""#BlueJays WIN 9-2! Oh, and David Price &amp; ...",White,full,0.0
57,"""#Brewers Ryan Braun went 2-for-5, left on bas...",White,full,0.0
75024,"‣ Chomsky and Hillary | ""Why are some leftist...",AAE-no-AAVE,full,0.0
75025,…but I’m still surprised anybody cares about m...,AAE-no-AAVE,balanced,0.0
