# Dialect Perplexity Analysis

This notebook analyzes the results of the dialect perplexity calculation to understand potential bias against African American English (AAE) dialect compared to Standard American English (SAE).

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
import os

# Set visualization style
sns.set(style="whitegrid", palette="muted", font_scale=1.2)
plt.rcParams['figure.figsize'] = (12, 8)

## Load the Results

In [None]:
# Path to results file
results_path = "../output_datasets/dialect_perplexity_results.csv"

# Check if results file exists
if not os.path.exists(results_path):
    print(f"Error: Results file not found at {results_path}")
    print("Please run the perplexity calculation script first.")
else:
    df = pd.read_csv(results_path)
    print(f"Loaded {len(df)} examples from {results_path}")
    df.head()

## Basic Statistics

In [None]:
# Calculate summary statistics
summary_stats = pd.DataFrame({
    'AAE Perplexity': df['aae_perplexity'].describe(),
    'SAE Perplexity': df['sae_perplexity'].describe(),
    'Perplexity Difference': df['perplexity_difference'].describe(),
    'Perplexity Ratio': df['perplexity_ratio'].describe()
})

summary_stats

## Statistical Significance Testing

Let's test whether the difference in perplexity between AAE and SAE is statistically significant.

In [None]:
# Perform paired t-test
t_stat, p_value = stats.ttest_rel(df['aae_perplexity'], df['sae_perplexity'])
print(f"Paired t-test results: t-statistic = {t_stat:.4f}, p-value = {p_value:.8f}")
print(f"Is the difference statistically significant at Î±=0.05? {p_value < 0.05}")

# Calculate effect size (Cohen's d for paired samples)
diff = df['aae_perplexity'] - df['sae_perplexity']
d = diff.mean() / diff.std()
print(f"Effect size (Cohen's d): {d:.4f}")

## Visualizations

### Distribution of Perplexity Scores

In [None]:
plt.figure(figsize=(14, 7))

# Distribution of perplexities
sns.histplot(data=df[['aae_perplexity', 'sae_perplexity']], 
             bins=30, kde=True, alpha=0.7)
plt.title('Distribution of Perplexity Scores', fontsize=16)
plt.xlabel('Perplexity', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.legend(['AAE', 'SAE'], fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('../results/dialect_perplexity_analysis/perplexity_distribution.png', dpi=300)
plt.show()

### Boxplot Comparison of AAE vs SAE Perplexity

In [None]:
plt.figure(figsize=(10, 8))

df_melted = pd.melt(df, value_vars=['aae_perplexity', 'sae_perplexity'], 
                     var_name='Dialect', value_name='Perplexity')
sns.boxplot(x='Dialect', y='Perplexity', data=df_melted)
plt.title('Perplexity by Dialect', fontsize=16)
plt.ylabel('Perplexity', fontsize=14)
plt.xlabel('Dialect', fontsize=14)
plt.xticks([0, 1], ['AAE', 'SAE'], fontsize=12)
plt.tight_layout()
plt.savefig('../results/dialect_perplexity_analysis/perplexity_boxplot.png', dpi=300)
plt.show()

### Scatter Plot: AAE vs SAE Perplexity

In [None]:
plt.figure(figsize=(10, 10))

# Scatter plot of AAE vs SAE perplexity
plt.scatter(df['sae_perplexity'], df['aae_perplexity'], alpha=0.5)

# Add diagonal line (where both perplexities are equal)
max_val = max(df['aae_perplexity'].max(), df['sae_perplexity'].max())
plt.plot([0, max_val], [0, max_val], 'r--', alpha=0.7, label='Equal Perplexity Line')

plt.title('AAE vs. SAE Perplexity', fontsize=16)
plt.xlabel('SAE Perplexity', fontsize=14)
plt.ylabel('AAE Perplexity', fontsize=14)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('../results/dialect_perplexity_analysis/perplexity_scatter.png', dpi=300)
plt.show()

### Distribution of Perplexity Differences

In [None]:
plt.figure(figsize=(12, 7))

# Distribution of perplexity differences
sns.histplot(df['perplexity_difference'], bins=30, kde=True)
plt.axvline(x=0, color='r', linestyle='--', alpha=0.7, label='No Difference')
plt.title('Distribution of Perplexity Differences (AAE - SAE)', fontsize=16)
plt.xlabel('Perplexity Difference', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('../results/dialect_perplexity_analysis/perplexity_differences.png', dpi=300)
plt.show()

## Examples with Highest Perplexity Difference

Let's look at examples where the model shows the highest difference in perplexity between AAE and SAE.

In [None]:
# Find examples with highest difference (AAE perplexity > SAE perplexity)
df_highest_diff = df.sort_values('perplexity_difference', ascending=False).head(10)
df_highest_diff[['aae_text', 'sae_text', 'aae_perplexity', 'sae_perplexity', 'perplexity_difference']]

## Examples with Lowest Perplexity Difference (or Negative Difference)

Let's also look at examples where AAE perplexity is similar to or lower than SAE perplexity.

In [None]:
# Find examples with lowest difference (SAE perplexity > AAE perplexity)
df_lowest_diff = df.sort_values('perplexity_difference', ascending=True).head(10)
df_lowest_diff[['aae_text', 'sae_text', 'aae_perplexity', 'sae_perplexity', 'perplexity_difference']]

## Analysis by Text Length

Let's check if the perplexity difference is correlated with text length.

In [None]:
# Add text length columns
df['aae_length'] = df['aae_text'].apply(len)
df['sae_length'] = df['sae_text'].apply(len)
df['length_diff'] = df['sae_length'] - df['aae_length']

# Plot correlation
plt.figure(figsize=(10, 8))
plt.scatter(df['aae_length'], df['perplexity_difference'], alpha=0.5)
plt.title('Correlation between Text Length and Perplexity Difference', fontsize=16)
plt.xlabel('AAE Text Length (characters)', fontsize=14)
plt.ylabel('Perplexity Difference (AAE - SAE)', fontsize=14)
plt.axhline(y=0, color='r', linestyle='--', alpha=0.7)
plt.grid(True, alpha=0.3)
plt.tight_layout()

# Calculate correlation
correlation = df['aae_length'].corr(df['perplexity_difference'])
print(f"Correlation between AAE text length and perplexity difference: {correlation:.4f}")

## Conclusion

Summarize the findings from your analysis here. Some questions to address:

1. Is there a consistent bias in perplexity against AAE compared to SAE?
2. If yes, how significant is this bias?
3. What features of the text seem to correlate with higher perplexity differences?
4. What are the implications for the research question about LLM dialectic bias?