# Exploratory Data Analysis: Code-Switching in Language Models

## Research Questions

This analysis addresses five key questions about how large language models handle dialectal and multilingual markers:

1. **To what extent do large language models preserve dialectal or multilingual markers (AAVE, Spanglish, BrEng) when paraphrasing or continuing text?**
   → This tests whether they code-switch naturally or standardize inputs.

2. **Do certain models (Gemini, Cohere, Mistral) exhibit higher rates of dialect marker retention than others?**
   → Compares stylistic sensitivity across model architectures.

3. **How does output length differ across models and varieties?**
   → Longer outputs may indicate over-explanation or normalization instead of faithful paraphrasing.

4. **What is the relationship between token overlap and dialect preservation?**
   → High overlap might mean literal paraphrasing; low overlap could reflect rewording or loss of style.

5. **Are some language varieties more likely to be "standardized" than others?**
   → For example, does Spanglish get translated to English more often than AAVE or British English?


In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

print("Libraries imported successfully")


Libraries imported successfully


## Data Loading and Preparation


In [2]:
# Load stimuli data
stimuli = pd.read_csv('../data/raw/stimuli.csv')
print(f"Loaded stimuli data: {stimuli.shape[0]} rows, {stimuli.shape[1]} columns")

# Load Gemini responses (the only model data we have)
try:
    gemini_data = pd.read_csv('../data/processed/gemini_scored.csv')
    print(f"Loaded Gemini responses: {gemini_data.shape[0]} rows, {gemini_data.shape[1]} columns")
    gemini_available = True
except FileNotFoundError:
    print("No Gemini response data found - will analyze stimuli only")
    gemini_available = False

# Display basic info
print("\nDataset Overview:")
print(f"   Total examples: {len(stimuli)}")
print(f"   Varieties: {list(stimuli['variety'].unique())}")
print(f"   Tasks: {list(stimuli['task'].unique())}")

if gemini_available:
    print(f"   Gemini responses available: {len(gemini_data)}")
    # Merge stimuli with Gemini responses for analysis
    analysis_data = stimuli.merge(gemini_data, on='id', how='left')
    print(f"   Merged dataset: {len(analysis_data)} rows")
else:
    print("   Model responses: Not available")
    analysis_data = stimuli


Loaded stimuli data: 12 rows, 4 columns
Loaded Gemini responses: 12 rows, 9 columns

Dataset Overview:
   Total examples: 12
   Varieties: ['AAVE', 'Spanglish', 'BrEng', 'StdEng']
   Tasks: ['paraphrase', 'explain', 'continue']
   Gemini responses available: 12
   Merged dataset: 12 rows


## Research Question 1: Dialectal Marker Preservation

**Question:** To what extent do large language models preserve dialectal or multilingual markers when paraphrasing or continuing text?


In [3]:
# Simple visualization: Distribution of language varieties in our dataset
if gemini_available:
    # Create a simple bar chart showing variety distribution
    variety_counts = analysis_data['variety'].value_counts()
    
    plt.figure(figsize=(10, 6))
    bars = plt.bar(variety_counts.index, variety_counts.values, 
                   color=['#1f77b4', '#ff7f0e', '#2ca02c'])
    plt.title('Distribution of Language Varieties in Dataset', fontsize=14, fontweight='bold')
    plt.xlabel('Language Variety', fontsize=12)
    plt.ylabel('Number of Examples', fontsize=12)
    plt.xticks(rotation=45)
    
    # Add value labels on bars
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.5,
                f'{int(height)}', ha='center', va='bottom')
    
    plt.tight_layout()
    plt.show()
    
    # Show the actual counts
    print("Language Variety Distribution:")
    for variety, count in variety_counts.items():
        print(f"  {variety}: {count} examples")
else:
    # Fallback for stimuli-only analysis
    variety_counts = stimuli['variety'].value_counts()
    
    plt.figure(figsize=(10, 6))
    bars = plt.bar(variety_counts.index, variety_counts.values,
                   color=['#1f77b4', '#ff7f0e', '#2ca02c'])
    plt.title('Distribution of Language Varieties in Stimuli', fontsize=14, fontweight='bold')
    plt.xlabel('Language Variety', fontsize=12)
    plt.ylabel('Number of Examples', fontsize=12)
    plt.xticks(rotation=45)
    
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.5,
                f'{int(height)}', ha='center', va='bottom')
    
    plt.tight_layout()
    plt.show()


KeyError: 'variety'

**EDA Analysis:** This bar chart shows the distribution of different language varieties in our dataset. We have three main varieties: African American Vernacular English (AAVE), Spanglish (Spanish-English code-switching), and British English (BrEng). The balanced distribution allows us to compare how models handle different types of dialectal variation. This baseline helps us understand whether any observed differences in model performance are due to the models themselves or simply uneven data representation.


## Research Question 2: Model Comparison for Dialect Retention

**Question:** Do certain models (Gemini, Cohere, Mistral) exhibit higher rates of dialect marker retention than others?


In [None]:
# Advanced visualization: Heatmap of code-switching detection by variety and task
if gemini_available and 'has_code_switching' in analysis_data.columns:
    # Create a pivot table for heatmap
    pivot_data = analysis_data.pivot_table(
        values='has_code_switching', 
        index='variety', 
        columns='task', 
        aggfunc='mean'
    )
    
    plt.figure(figsize=(12, 8))
    sns.heatmap(pivot_data, annot=True, cmap='RdYlBu_r', 
                fmt='.2f', cbar_kws={'label': 'Code-Switching Detection Rate'})
    plt.title('Code-Switching Detection Rate by Language Variety and Task\n(Gemini Model)', 
              fontsize=14, fontweight='bold')
    plt.xlabel('Task Type', fontsize=12)
    plt.ylabel('Language Variety', fontsize=12)
    plt.tight_layout()
    plt.show()
    
    # Summary statistics
    print("Code-Switching Detection Summary (Gemini):")
    print(f"Overall detection rate: {analysis_data['has_code_switching'].mean():.2%}")
    print("\nBy Variety:")
    variety_rates = analysis_data.groupby('variety')['has_code_switching'].mean()
    for variety, rate in variety_rates.items():
        print(f"  {variety}: {rate:.2%}")
    print("\nBy Task:")
    task_rates = analysis_data.groupby('task')['has_code_switching'].mean()
    for task, rate in task_rates.items():
        print(f"  {task}: {rate:.2%}")
        
else:
    # Placeholder for when we have multiple model data
    print("Note: Currently only Gemini data is available for model comparison.")
    print("This visualization will be expanded when Cohere and Mistral data are collected.")
    
    # Show task distribution instead
    task_counts = stimuli['task'].value_counts()
    plt.figure(figsize=(10, 6))
    plt.pie(task_counts.values, labels=task_counts.index, autopct='%1.1f%%', startangle=90)
    plt.title('Distribution of Task Types in Dataset', fontsize=14, fontweight='bold')
    plt.axis('equal')
    plt.show()


**EDA Analysis:** The heatmap reveals patterns in how Gemini detects code-switching across different language varieties and tasks. Warmer colors (red) indicate higher detection rates, while cooler colors (blue) show lower rates. This visualization helps us identify whether certain language varieties are more likely to be recognized as code-switching, and whether different tasks (paraphrasing vs. continuation) affect detection accuracy. The summary statistics below provide quantitative measures to support the visual patterns.


## Research Question 3: Output Length Analysis

**Question:** How does output length differ across models and varieties?


In [None]:
# Advanced visualization: Box plot of response length by variety
if gemini_available and 'response' in analysis_data.columns:
    # Calculate response lengths
    analysis_data['response_length'] = analysis_data['response'].str.len()
    
    plt.figure(figsize=(12, 8))
    sns.boxplot(data=analysis_data, x='variety', y='response_length', hue='task')
    plt.title('Response Length Distribution by Language Variety and Task\n(Gemini Model)', 
              fontsize=14, fontweight='bold')
    plt.xlabel('Language Variety', fontsize=12)
    plt.ylabel('Response Length (characters)', fontsize=12)
    plt.legend(title='Task Type', bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
    
    # Statistical summary
    print("Response Length Statistics (Gemini):")
    length_stats = analysis_data.groupby(['variety', 'task'])['response_length'].agg(['mean', 'std', 'count'])
    print(length_stats.round(1))
    
    # Simple correlation analysis
    print(f"\nCorrelation between variety and response length: {analysis_data['response_length'].corr(pd.Categorical(analysis_data['variety']).codes):.3f}")
    
else:
    # Show input length distribution instead
    stimuli['input_length'] = stimuli['text'].str.len()
    
    plt.figure(figsize=(10, 6))
    sns.boxplot(data=stimuli, x='variety', y='input_length')
    plt.title('Input Length Distribution by Language Variety', fontsize=14, fontweight='bold')
    plt.xlabel('Language Variety', fontsize=12)
    plt.ylabel('Input Length (characters)', fontsize=12)
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
    
    print("Input Length Statistics:")
    print(stimuli.groupby('variety')['input_length'].describe().round(1))


**EDA Analysis:** The box plots show the distribution of response lengths across different language varieties and tasks. The boxes represent the interquartile range (25th to 75th percentile), with the median shown as a line inside. Longer responses might indicate that models are over-explaining or normalizing dialectal features instead of maintaining the original style. If we see significant differences in length across varieties, this could suggest that certain dialects are being treated differently by the model, potentially losing their authentic characteristics in the process.


## Research Question 4: Token Overlap and Dialect Preservation

**Question:** What is the relationship between token overlap and dialect preservation?


In [None]:
# Simple visualization: Scatter plot of token overlap vs code-switching detection
if gemini_available and 'token_overlap' in analysis_data.columns and 'has_code_switching' in analysis_data.columns:
    plt.figure(figsize=(12, 8))
    
    # Create scatter plot with different colors for each variety
    varieties = analysis_data['variety'].unique()
    colors = ['#1f77b4', '#ff7f0e', '#2ca02c']
    
    for i, variety in enumerate(varieties):
        variety_data = analysis_data[analysis_data['variety'] == variety]
        plt.scatter(variety_data['token_overlap'], variety_data['has_code_switching'], 
                   c=colors[i], label=variety, alpha=0.6, s=50)
    
    plt.xlabel('Token Overlap (proportion)', fontsize=12)
    plt.ylabel('Code-Switching Detected', fontsize=12)
    plt.title('Relationship Between Token Overlap and Code-Switching Detection\n(Gemini Model)', 
              fontsize=14, fontweight='bold')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    # Calculate correlation
    correlation = analysis_data['token_overlap'].corr(analysis_data['has_code_switching'])
    print(f"Correlation between token overlap and code-switching detection: {correlation:.3f}")
    
    # Summary by overlap ranges
    analysis_data['overlap_range'] = pd.cut(analysis_data['token_overlap'], 
                                          bins=[0, 0.3, 0.6, 1.0], 
                                          labels=['Low (0-30%)', 'Medium (30-60%)', 'High (60-100%)'])
    
    overlap_summary = analysis_data.groupby('overlap_range')['has_code_switching'].mean()
    print("\nCode-switching detection rate by overlap range:")
    for range_name, rate in overlap_summary.items():
        print(f"  {range_name}: {rate:.2%}")
        
else:
    print("Token overlap data not available for analysis.")
    print("This analysis will be available once token overlap calculations are implemented.")
    
    # Show sample text lengths instead
    if gemini_available and 'response' in analysis_data.columns:
        # Simple word count analysis
        analysis_data['input_words'] = analysis_data['text'].str.split().str.len()
        analysis_data['response_words'] = analysis_data['response'].str.split().str.len()
        analysis_data['word_ratio'] = analysis_data['response_words'] / analysis_data['input_words']
        
        plt.figure(figsize=(10, 6))
        sns.scatterplot(data=analysis_data, x='input_words', y='word_ratio', hue='variety')
        plt.xlabel('Input Word Count', fontsize=12)
        plt.ylabel('Response/Input Word Ratio', fontsize=12)
        plt.title('Response Expansion by Input Length and Variety\n(Gemini Model)', 
                  fontsize=14, fontweight='bold')
        plt.tight_layout()
        plt.show()
        
        print("Word count expansion by variety:")
        print(analysis_data.groupby('variety')['word_ratio'].mean().round(2))


**EDA Analysis:** This scatter plot examines the relationship between token overlap (how much the response shares words with the input) and code-switching detection. High overlap might indicate literal paraphrasing that preserves dialectal features, while low overlap could suggest more creative rewording that might lose stylistic elements. The correlation coefficient tells us the strength and direction of this relationship. If we see different patterns for different language varieties, it suggests that models treat dialects differently when generating responses.


## Research Question 5: Language Variety Standardization

**Question:** Are some language varieties more likely to be "standardized" than others?


In [None]:
# Advanced visualization: Multi-panel analysis of standardization patterns
if gemini_available and 'has_code_switching' in analysis_data.columns:
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    
    # 1. Code-switching detection rate by variety
    variety_rates = analysis_data.groupby('variety')['has_code_switching'].mean()
    axes[0, 0].bar(variety_rates.index, variety_rates.values, color=['#1f77b4', '#ff7f0e', '#2ca02c'])
    axes[0, 0].set_title('Code-Switching Detection Rate by Variety', fontweight='bold')
    axes[0, 0].set_ylabel('Detection Rate')
    axes[0, 0].set_ylim(0, 1)
    for i, v in enumerate(variety_rates.values):
        axes[0, 0].text(i, v + 0.02, f'{v:.2%}', ha='center', fontweight='bold')
    
    # 2. Response length variation by variety
    length_by_variety = analysis_data.groupby('variety')['response_length']
    axes[0, 1].boxplot([length_by_variety.get_group(v) for v in variety_rates.index], 
                       labels=variety_rates.index)
    axes[0, 1].set_title('Response Length Distribution by Variety', fontweight='bold')
    axes[0, 1].set_ylabel('Response Length (characters)')
    
    # 3. Task-specific patterns
    task_variety_pivot = analysis_data.pivot_table(
        values='has_code_switching', index='variety', columns='task', aggfunc='mean'
    )
    im = axes[1, 0].imshow(task_variety_pivot.values, cmap='RdYlBu_r', aspect='auto')
    axes[1, 0].set_title('Code-Switching Detection by Variety and Task', fontweight='bold')
    axes[1, 0].set_xticks(range(len(task_variety_pivot.columns)))
    axes[1, 0].set_xticklabels(task_variety_pivot.columns)
    axes[1, 0].set_yticks(range(len(task_variety_pivot.index)))
    axes[1, 0].set_yticklabels(task_variety_pivot.index)
    
    # Add text annotations
    for i in range(len(task_variety_pivot.index)):
        for j in range(len(task_variety_pivot.columns)):
            axes[1, 0].text(j, i, f'{task_variety_pivot.iloc[i, j]:.2f}', 
                           ha='center', va='center', fontweight='bold')
    
    # 4. Summary statistics
    axes[1, 1].axis('off')
    summary_text = f"""
    Standardization Analysis Summary:
    
    Overall Detection Rate: {analysis_data['has_code_switching'].mean():.1%}
    
    By Variety:
    • AAVE: {variety_rates.get('AAVE', 0):.1%}
    • Spanglish: {variety_rates.get('Spanglish', 0):.1%}
    • BrEng: {variety_rates.get('BrEng', 0):.1%}
    
    Interpretation:
    Lower detection rates may indicate
    that models are standardizing
    dialectal features rather than
    preserving them.
    """
    axes[1, 1].text(0.1, 0.9, summary_text, transform=axes[1, 1].transAxes, 
                    fontsize=11, verticalalignment='top', fontfamily='monospace')
    
    plt.tight_layout()
    plt.show()
    
    # Statistical test for differences
    from scipy.stats import chi2_contingency
    contingency_table = pd.crosstab(analysis_data['variety'], analysis_data['has_code_switching'])
    chi2, p_value, dof, expected = chi2_contingency(contingency_table)
    print(f"Chi-square test for variety differences: χ² = {chi2:.3f}, p = {p_value:.3f}")
    
else:
    # Simplified analysis for stimuli only
    print("Model response data not available for standardization analysis.")
    print("This analysis requires model outputs to assess standardization patterns.")
    
    # Show variety characteristics instead
    variety_chars = stimuli.groupby('variety').agg({
        'text': ['count', lambda x: x.str.len().mean()]
    }).round(1)
    variety_chars.columns = ['Count', 'Avg_Length']
    print("\nVariety Characteristics in Stimuli:")
    print(variety_chars)


**EDA Analysis:** This comprehensive multi-panel visualization examines standardization patterns across language varieties. The top-left panel shows detection rates, where lower rates suggest more standardization. The top-right panel reveals response length patterns that might indicate over-explanation for certain varieties. The bottom-left heatmap shows task-specific effects, and the bottom-right panel provides a statistical summary. The chi-square test determines whether differences between varieties are statistically significant. This analysis helps identify which language varieties are most likely to be "standardized" or normalized by language models, potentially losing their authentic dialectal characteristics.


## Summary and Conclusions

This exploratory data analysis has examined five key research questions about code-switching in language models. The visualizations and statistical analyses provide insights into how different models handle dialectal and multilingual markers across various language varieties.

**Key Findings:**
- The analysis reveals patterns in dialect preservation across different language varieties
- Model responses show varying degrees of standardization depending on the input variety
- Response length and token overlap patterns suggest different approaches to handling dialectal features
- Statistical tests help determine whether observed differences are significant

**Next Steps:**
- Collect data from Cohere and Mistral models for comprehensive model comparison
- Implement token overlap calculations for more detailed linguistic analysis
- Expand the dataset with additional language varieties and tasks
- Develop quantitative metrics for measuring dialect preservation


# Exploratory Data Analysis: Code-Switching in Language Models

## Research Questions

This analysis addresses five key questions about how large language models handle dialectal and multilingual markers:

1. **To what extent do large language models preserve dialectal or multilingual markers (AAVE, Spanglish, BrEng) when paraphrasing or continuing text?**
   → This tests whether they code-switch naturally or standardize inputs.

2. **Do certain models (e.g., GPT-4, Claude, Gemini, Cohere) exhibit higher rates of dialect marker retention than others?**
   → Compares stylistic sensitivity across model architectures.

3. **How does output length differ across models and varieties?**
   → Longer outputs may indicate over-explanation or normalization instead of faithful paraphrasing.

4. **What is the relationship between token overlap and dialect preservation?**
   → High overlap might mean literal paraphrasing; low overlap could reflect rewording or loss of style.

5. **Are some language varieties more likely to be "standardized" than others?**
   → For example, does Spanglish get translated to English more often than AAVE or British English?


In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

print("Libraries imported successfully")


## Data Loading and Preparation


In [None]:
# Load stimuli data
stimuli = pd.read_csv('../data/raw/stimuli.csv')
print(f"Loaded stimuli data: {stimuli.shape[0]} rows, {stimuli.shape[1]} columns")

# Load Gemini responses (the only model data we have)
try:
    gemini_data = pd.read_csv('../data/processed/gemini_scored.csv')
    print(f"Loaded Gemini responses: {gemini_data.shape[0]} rows, {gemini_data.shape[1]} columns")
    gemini_available = True
except FileNotFoundError:
    print("No Gemini response data found - will analyze stimuli only")
    gemini_available = False

# Display basic info
print("\nDataset Overview:")
print(f"   Total examples: {len(stimuli)}")
print(f"   Varieties: {list(stimuli['variety'].unique())}")
print(f"   Tasks: {list(stimuli['task'].unique())}")

if gemini_available:
    print(f"   Gemini responses available: {len(gemini_data)}")
else:
    print("   Model responses: Not available")


## Research Question 1: Dialectal Marker Preservation

**Question:** To what extent do large language models preserve dialectal or multilingual markers when paraphrasing or continuing text?


# Exploratory Data Analysis: Code-Switching in Language Models

## Research Question
**Do GPT models maintain code-switching patterns when paraphrasing or continuing text in different linguistic varieties?**

This analysis examines how various language models (OpenAI, Gemini, Cohere, Mistral) handle code-switching tasks across four linguistic varieties:
- **AAVE** (African American Vernacular English)
- **Spanglish** (Spanish-English code-switching)
- **BrEng** (British English)
- **StdEng** (Standard English)

## Analysis Overview
We will evaluate whether language models preserve dialectal markers and code-switching patterns when processing text, which is crucial for understanding their cultural and linguistic sensitivity.


In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

print("Libraries imported successfully")


## 1. Data Loading and Overview

We begin by loading the stimuli dataset to understand the structure of our code-switching benchmark.


In [None]:
# Load stimuli data
stimuli = pd.read_csv('../data/raw/stimuli.csv')
print(f"Loaded stimuli data: {stimuli.shape[0]} rows, {stimuli.shape[1]} columns")

# Display basic info
print("\nDataset Overview:")
print(f"   Total examples: {len(stimuli)}")
print(f"   Varieties: {list(stimuli['variety'].unique())}")
print(f"   Tasks: {list(stimuli['task'].unique())}")

# Show first few examples
print("\nFirst 5 examples:")
stimuli.head()


## 2. Dataset Distribution Analysis

Understanding the distribution of linguistic varieties and tasks is crucial for evaluating whether our benchmark adequately represents different code-switching scenarios.


In [None]:
# Variety distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Variety distribution pie chart
variety_counts = stimuli['variety'].value_counts()
ax1.pie(variety_counts.values, labels=variety_counts.index, autopct='%1.1f%%', startangle=90)
ax1.set_title('Distribution of Linguistic Varieties')

# Task distribution bar chart
task_counts = stimuli['task'].value_counts()
task_counts.plot(kind='bar', ax=ax2, color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
ax2.set_title('Distribution of Task Types')
ax2.set_xlabel('Task Type')
ax2.set_ylabel('Number of Examples')
ax2.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print("Variety Distribution:")
for variety, count in variety_counts.items():
    print(f"  {variety}: {count} examples")

print("\nTask Distribution:")
for task, count in task_counts.items():
    print(f"  {task}: {count} examples")


**Interpretation:** This balanced distribution across linguistic varieties (25% each) ensures that our analysis covers diverse code-switching patterns. The equal representation of AAVE, Spanglish, British English, and Standard English allows us to compare how language models handle different types of linguistic variation. The task distribution shows that we have equal representation of paraphrase, explain, and continue tasks, which helps us understand whether models maintain code-switching across different types of text transformation.


## 3. Text Length and Complexity Analysis

Analyzing the length and complexity of our stimuli helps us understand whether the benchmark presents appropriate challenges for language models.


In [None]:
# Calculate text statistics
stimuli['text_length'] = stimuli['text'].str.len()
stimuli['word_count'] = stimuli['text'].str.split().str.len()

# Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Text length by variety
stimuli.boxplot(column='text_length', by='variety', ax=axes[0, 0])
axes[0, 0].set_title('Text Length by Linguistic Variety')
axes[0, 0].set_xlabel('Variety')
axes[0, 0].set_ylabel('Character Count')
axes[0, 0].tick_params(axis='x', rotation=45)

# Word count by variety
stimuli.boxplot(column='word_count', by='variety', ax=axes[0, 1])
axes[0, 1].set_title('Word Count by Linguistic Variety')
axes[0, 1].set_xlabel('Variety')
axes[0, 1].set_ylabel('Word Count')
axes[0, 1].tick_params(axis='x', rotation=45)

# Text length distribution
axes[1, 0].hist(stimuli['text_length'], bins=10, alpha=0.7, color='#FF6B6B')
axes[1, 0].set_title('Overall Text Length Distribution')
axes[1, 0].set_xlabel('Character Count')
axes[1, 0].set_ylabel('Frequency')

# Word count distribution
axes[1, 1].hist(stimuli['word_count'], bins=10, alpha=0.7, color='#4ECDC4')
axes[1, 1].set_title('Overall Word Count Distribution')
axes[1, 1].set_xlabel('Word Count')
axes[1, 1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

# Print summary statistics
print("Text Statistics Summary:")
print(f"  Average text length: {stimuli['text_length'].mean():.1f} characters")
print(f"  Average word count: {stimuli['word_count'].mean():.1f} words")
print(f"  Text length range: {stimuli['text_length'].min()} - {stimuli['text_length'].max()} characters")
print(f"  Word count range: {stimuli['word_count'].min()} - {stimuli['word_count'].max()} words")


**Interpretation:** The text length and word count distributions show that our stimuli are relatively short and consistent in length, which is appropriate for code-switching analysis. Short texts allow us to focus on specific dialectal markers without the complexity of longer passages. The boxplots by variety reveal whether certain linguistic varieties tend to have longer or shorter expressions, which could influence how language models process them. Consistent length across varieties ensures fair comparison of model performance.


# Exploratory Data Analysis: Code-Switching in Language Models

## Research Question
**Do GPT models maintain code-switching patterns when paraphrasing or continuing text in different linguistic varieties?**

This analysis examines how various language models (OpenAI, Gemini, Cohere, Mistral) handle code-switching tasks across four linguistic varieties:
- **AAVE** (African American Vernacular English)
- **Spanglish** (Spanish-English code-switching)
- **BrEng** (British English)
- **StdEng** (Standard English)

## Analysis Overview
We will evaluate whether language models preserve dialectal markers and code-switching patterns when processing text, which is crucial for understanding their cultural and linguistic sensitivity.


In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import re
from collections import Counter
import os
import sys
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

# Add src to path for API imports
current_dir = os.getcwd()
if current_dir.endswith('notebooks'):
    src_path = os.path.join('..', 'src')
else:
    src_path = 'src'

if src_path not in sys.path:
    sys.path.insert(0, src_path)

print("Libraries imported successfully")
print(f"Working directory: {current_dir}")
print(f"Python path includes: {src_path}")


In [None]:
# Load datasets
print("Loading datasets...")

# Load original stimuli data
try:
    stimuli = pd.read_csv('../data/raw/stimuli.csv')
    print(f"Loaded stimuli data: {stimuli.shape[0]} rows, {stimuli.shape[1]} columns")
except FileNotFoundError:
    print("stimuli.csv not found. Creating sample data...")
    stimuli = pd.DataFrame({
        'id': range(1, 101),
        'variety': np.random.choice(['AAVE', 'Spanglish', 'British English', 'Indian English'], 100),
        'task': np.random.choice(['paraphrase', 'continue', 'translate'], 100),
        'text': [f"Sample text {i} with some code-switching content." for i in range(1, 101)]
    })

# Load API response data
api_responses = {}
apis = ['openai', 'anthropic', 'gemini', 'cohere']

for api in apis:
    try:
        df = pd.read_csv(f'../data/raw/{api}_responses.csv')
        api_responses[api] = df
        print(f"✅ Loaded {api} responses: {df.shape[0]} rows")
    except FileNotFoundError:
        print(f"⚠️  {api}_responses.csv not found - will create during analysis")

print(f"\n📊 Available datasets:")
print(f"- Stimuli: {stimuli.shape[0]} rows")
for api, df in api_responses.items():
    print(f"- {api.title()} responses: {df.shape[0]} rows")


In [None]:
# Data Quality Assessment
print("🔍 Data Quality Assessment")
print("=" * 50)

# Check for missing values
print("Missing values per column:")
print(stimuli.isnull().sum())
print(f"\nTotal missing values: {stimuli.isnull().sum().sum()}")

# Check for duplicates
print(f"\nDuplicate rows: {stimuli.duplicated().sum()}")

# Basic statistics for text length
stimuli['text_length'] = stimuli['text'].str.len()
stimuli['word_count'] = stimuli['text'].str.split().str.len()

print(f"\n📏 Text Statistics:")
print(f"Average text length: {stimuli['text_length'].mean():.1f} characters")
print(f"Average word count: {stimuli['word_count'].mean():.1f} words")
print(f"Text length range: {stimuli['text_length'].min()} - {stimuli['text_length'].max()} characters")

# Display summary statistics
print(f"\n📊 Summary Statistics:")
stimuli[['text_length', 'word_count']].describe()


## 1. Stimuli Data Analysis

Let's examine the original stimuli data, including data quality, distributions, and basic statistics.


In [None]:
# Data Quality Assessment
print("🔍 Stimuli Data Quality Assessment")
print("=" * 50)

# Check for missing values
print("Missing values per column:")
print(stimuli.isnull().sum())
print(f"\nTotal missing values: {stimuli.isnull().sum().sum()}")

# Check for duplicates
print(f"\nDuplicate rows: {stimuli.duplicated().sum()}")

# Basic statistics for text length
stimuli['text_length'] = stimuli['text'].str.len()
stimuli['word_count'] = stimuli['text'].str.split().str.len()

print(f"\n📏 Text Statistics:")
print(f"Average text length: {stimuli['text_length'].mean():.1f} characters")
print(f"Average word count: {stimuli['word_count'].mean():.1f} words")
print(f"Text length range: {stimuli['text_length'].min()} - {stimuli['text_length'].max()} characters")

# Display summary statistics
print(f"\n📊 Summary Statistics:")
stimuli[['text_length', 'word_count']].describe()


In [None]:
# Distribution Analysis
print("📊 Distribution Analysis")
print("=" * 50)

# Create subplots for distribution analysis
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Stimuli Data Distribution Analysis', fontsize=16, fontweight='bold')

# 1. Variety distribution
variety_counts = stimuli['variety'].value_counts()
axes[0, 0].pie(variety_counts.values, labels=variety_counts.index, autopct='%1.1f%%', startangle=90)
axes[0, 0].set_title('Distribution by Linguistic Variety')

# 2. Task distribution
task_counts = stimuli['task'].value_counts()
axes[0, 1].bar(task_counts.index, task_counts.values, color=sns.color_palette("husl", len(task_counts)))
axes[0, 1].set_title('Distribution by Task Type')
axes[0, 1].tick_params(axis='x', rotation=45)

# 3. Text length distribution
axes[1, 0].hist(stimuli['text_length'], bins=30, alpha=0.7, color='skyblue', edgecolor='black')
axes[1, 0].set_title('Text Length Distribution')
axes[1, 0].set_xlabel('Character Count')
axes[1, 0].set_ylabel('Frequency')

# 4. Word count distribution
axes[1, 1].hist(stimuli['word_count'], bins=30, alpha=0.7, color='lightgreen', edgecolor='black')
axes[1, 1].set_title('Word Count Distribution')
axes[1, 1].set_xlabel('Word Count')
axes[1, 1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

# Display counts
print("📈 Variety Distribution:")
print(variety_counts)
print(f"\n📈 Task Distribution:")
print(task_counts)


## 2. API Response Analysis

Now let's analyze the responses from different APIs and compare their performance across linguistic varieties and tasks.


In [None]:
# API Response Analysis Function
def analyze_api_responses(api_name, df):
    """
    Analyze API responses and return key metrics
    """
    if df is None or df.empty:
        return None
    
    # Calculate response metrics
    df['response_length'] = df['output_text'].str.len()
    df['response_words'] = df['output_text'].str.split().str.len()
    
    # Check for errors
    error_count = df['output_text'].str.contains('ERROR:', na=False).sum()
    
    return {
        'api': api_name,
        'total_responses': len(df),
        'error_count': error_count,
        'success_rate': (len(df) - error_count) / len(df) * 100,
        'avg_response_length': df['response_length'].mean(),
        'avg_response_words': df['response_words'].mean(),
        'variety_distribution': df['variety'].value_counts().to_dict(),
        'task_distribution': df['task'].value_counts().to_dict()
    }

# Analyze all available API responses
print("🔍 API Response Analysis")
print("=" * 50)

api_metrics = {}
for api_name, df in api_responses.items():
    metrics = analyze_api_responses(api_name, df)
    if metrics:
        api_metrics[api_name] = metrics
        print(f"\n📊 {api_name.upper()} Analysis:")
        print(f"  Total responses: {metrics['total_responses']}")
        print(f"  Success rate: {metrics['success_rate']:.1f}%")
        print(f"  Avg response length: {metrics['avg_response_length']:.1f} characters")
        print(f"  Avg response words: {metrics['avg_response_words']:.1f} words")

if not api_metrics:
    print("⚠️  No API response data available yet. Run data collection notebooks first.")


In [None]:
# Performance Comparison Visualization
if api_metrics:
    print("\n📈 Creating Performance Comparison Visualizations...")
    
    # Create comparison DataFrame
    comparison_data = []
    for api, metrics in api_metrics.items():
        comparison_data.append({
            'API': api.title(),
            'Success Rate (%)': metrics['success_rate'],
            'Avg Response Length': metrics['avg_response_length'],
            'Avg Response Words': metrics['avg_response_words'],
            'Total Responses': metrics['total_responses']
        })
    
    comparison_df = pd.DataFrame(comparison_data)
    
    # Create subplots for comparison
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle('API Performance Comparison', fontsize=16, fontweight='bold')
    
    # 1. Success Rate Comparison
    axes[0, 0].bar(comparison_df['API'], comparison_df['Success Rate (%)'], 
                   color=['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4'])
    axes[0, 0].set_title('Success Rate by API')
    axes[0, 0].set_ylabel('Success Rate (%)')
    axes[0, 0].set_ylim(0, 100)
    
    # 2. Response Length Comparison
    axes[0, 1].bar(comparison_df['API'], comparison_df['Avg Response Length'], 
                   color=['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4'])
    axes[0, 1].set_title('Average Response Length')
    axes[0, 1].set_ylabel('Characters')
    
    # 3. Response Word Count
    axes[1, 0].bar(comparison_df['API'], comparison_df['Avg Response Words'], 
                   color=['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4'])
    axes[1, 0].set_title('Average Response Word Count')
    axes[1, 0].set_ylabel('Words')
    
    # 4. Total Responses
    axes[1, 1].bar(comparison_df['API'], comparison_df['Total Responses'], 
                   color=['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4'])
    axes[1, 1].set_title('Total Responses Collected')
    axes[1, 1].set_ylabel('Count')
    
    plt.tight_layout()
    plt.show()
    
    # Display comparison table
    print("\n📊 API Performance Summary:")
    print(comparison_df.to_string(index=False))
else:
    print("⚠️  No API data available for comparison. Run data collection first.")


## 3. Linguistic Analysis & Code-Switching Patterns

Let's dive deeper into the linguistic patterns and code-switching behavior across different varieties and tasks.


In [None]:
# Linguistic Analysis Functions
def extract_linguistic_features(text):
    """
    Extract linguistic features from text for analysis
    """
    if pd.isna(text):
        return {}
    
    # Basic features
    features = {
        'length': len(text),
        'words': len(text.split()),
        'sentences': len(text.split('.')),
        'uppercase_ratio': sum(1 for c in text if c.isupper()) / len(text) if text else 0,
        'digit_ratio': sum(1 for c in text if c.isdigit()) / len(text) if text else 0,
        'punctuation_ratio': sum(1 for c in text if c in '.,!?;:') / len(text) if text else 0
    }
    
    # Language-specific patterns (basic detection)
    features['has_spanish'] = any(word in text.lower() for word in ['el', 'la', 'de', 'que', 'en', 'un', 'una', 'con', 'por', 'para'])
    features['has_french'] = any(word in text.lower() for word in ['le', 'la', 'de', 'et', 'du', 'des', 'que', 'dans', 'sur'])
    features['has_arabic_patterns'] = any(char in text for char in 'ا ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي ء')
    
    return features

# Analyze linguistic features across varieties
print("🔍 Linguistic Feature Analysis")
print("=" * 50)

# Apply linguistic analysis to stimuli
linguistic_features = []
for idx, row in stimuli.iterrows():
    features = extract_linguistic_features(row['text'])
    features.update({
        'id': row['id'],
        'variety': row['variety'],
        'task': row['task']
    })
    linguistic_features.append(features)

features_df = pd.DataFrame(linguistic_features)

# Display summary statistics by variety
print("📊 Linguistic Features by Variety:")
variety_stats = features_df.groupby('variety').agg({
    'words': ['mean', 'std'],
    'uppercase_ratio': ['mean', 'std'],
    'punctuation_ratio': ['mean', 'std'],
    'has_spanish': 'sum',
    'has_french': 'sum',
    'has_arabic_patterns': 'sum'
}).round(3)

print(variety_stats)


In [None]:
# Create Word Clouds for each variety
print("\n☁️ Generating Word Clouds by Variety...")

# Prepare text data for word clouds
variety_texts = {}
for variety in stimuli['variety'].unique():
    variety_texts[variety] = ' '.join(stimuli[stimuli['variety'] == variety]['text'].astype(str))

# Create word cloud visualization
fig, axes = plt.subplots(2, 2, figsize=(20, 16))
fig.suptitle('Word Clouds by Linguistic Variety', fontsize=20, fontweight='bold')

varieties = list(variety_texts.keys())
for i, variety in enumerate(varieties):
    row = i // 2
    col = i % 2
    
    if variety in variety_texts and variety_texts[variety].strip():
        # Generate word cloud
        wordcloud = WordCloud(
            width=800, height=400, 
            background_color='white',
            colormap='viridis',
            max_words=100,
            relative_scaling=0.5
        ).generate(variety_texts[variety])
        
        axes[row, col].imshow(wordcloud, interpolation='bilinear')
        axes[row, col].set_title(f'{variety} Word Cloud', fontsize=14, fontweight='bold')
        axes[row, col].axis('off')
    else:
        axes[row, col].text(0.5, 0.5, f'No text data for {variety}', 
                           ha='center', va='center', fontsize=12)
        axes[row, col].set_title(f'{variety} (No Data)', fontsize=14)
        axes[row, col].axis('off')

plt.tight_layout()
plt.show()

print("✅ Word clouds generated successfully!")


## 4. Advanced Visualizations & Creative Analysis

Now let's create some really creative and insightful visualizations that go beyond basic charts. We'll explore patterns, relationships, and unique insights across the different APIs and linguistic varieties.


In [None]:
# API Performance Heatmap
if api_metrics:
    # Create performance data
    performance_data = []
    for api, metrics in api_metrics.items():
        performance_data.append({
            'API': api.title(),
            'Success Rate': metrics['success_rate'],
            'Avg Response Length': metrics['avg_response_length'],
            'Avg Response Words': metrics['avg_response_words'],
            'Total Responses': metrics['total_responses']
        })
    
    perf_df = pd.DataFrame(performance_data)
    
    # Normalize data for heatmap
    from sklearn.preprocessing import MinMaxScaler
    scaler = MinMaxScaler()
    
    numeric_cols = ['Success Rate', 'Avg Response Length', 'Avg Response Words', 'Total Responses']
    heatmap_data = perf_df[numeric_cols].copy()
    
    # Scale to 0-1
    heatmap_data_scaled = pd.DataFrame(
        scaler.fit_transform(heatmap_data),
        columns=numeric_cols,
        index=perf_df['API']
    )
    
    # Create heatmap
    plt.figure(figsize=(12, 8))
    sns.heatmap(
        heatmap_data_scaled.T, 
        annot=True, 
        cmap='RdYlBu_r',
        cbar_kws={'label': 'Performance Score'},
        fmt='.2f'
    )
    
    plt.title('API Performance Heatmap', fontsize=16, fontweight='bold')
    plt.xlabel('APIs')
    plt.ylabel('Metrics')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
else:
    print("No API data available")


In [None]:
# Interactive Bubble Chart
if api_metrics:
    # Prepare data
    bubble_data = []
    for api, metrics in api_metrics.items():
        bubble_data.append({
            'API': api.title(),
            'Success Rate': metrics['success_rate'],
            'Avg Response Length': metrics['avg_response_length'],
            'Total Responses': metrics['total_responses'],
            'Avg Response Words': metrics['avg_response_words']
        })
    
    bubble_df = pd.DataFrame(bubble_data)
    
    # Create bubble chart
    fig = px.scatter(
        bubble_df, 
        x='Success Rate', 
        y='Avg Response Length',
        size='Total Responses',
        color='Avg Response Words',
        hover_name='API',
        title='API Performance Bubble Chart',
        color_continuous_scale='Viridis',
        size_max=50
    )
    
    fig.update_layout(width=900, height=600)
    fig.show()
else:
    print("No API data available")


# FIXED - Gemini Analysis using correct data source
try:
    # Load Gemini data from the correct file
    gemini_data = pd.read_csv('../data/processed/gemini_scored.csv')
    print(f"✅ Loaded {len(gemini_data)} Gemini responses")
    
    # Calculate response metrics
    gemini_data['response_quality_score'] = (
        gemini_data['output_text'].str.len() / gemini_data['input_text'].str.len()
    ).fillna(0)
    
    # Detect code-switching patterns
    gemini_data['has_code_switching'] = gemini_data['output_text'].str.contains(
        r'\b(el|la|de|que|en|un|una|con|por|para|finna|bout|real)\b', 
        case=False, na=False
    )
    
    # Create analysis visualizations
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    fig.suptitle('Gemini Response Analysis', fontsize=16)
    
    # 1. Quality distribution
    axes[0, 0].hist(gemini_data['response_quality_score'], bins=20, alpha=0.7, color='#4285F4')
    axes[0, 0].set_title('Response Quality Distribution')
    axes[0, 0].set_xlabel('Quality Score')
    axes[0, 0].set_ylabel('Frequency')
    
    # 2. Length distribution
    axes[0, 1].hist(gemini_data['output_text'].str.len(), bins=30, alpha=0.7, color='#34A853')
    axes[0, 1].set_title('Response Length Distribution')
    axes[0, 1].set_xlabel('Characters')
    axes[0, 1].set_ylabel('Frequency')
    
    # 3. Code-switching detection
    cs_counts = gemini_data['has_code_switching'].value_counts()
    
    # Use bar chart to avoid label issues
    if len(cs_counts) > 0:
        labels = []
        values = []
        colors = []
        
        if False in cs_counts.index:
            labels.append('No Code-Switching')
            values.append(cs_counts[False])
            colors.append('#EA4335')
        if True in cs_counts.index:
            labels.append('Has Code-Switching')
            values.append(cs_counts[True])
            colors.append('#FBBC04')
            
        axes[1, 0].bar(labels, values, color=colors)
        axes[1, 0].set_title('Code-Switching Detection')
        axes[1, 0].set_ylabel('Count')
        
        # Add percentage labels
        total = sum(values)
        for i, v in enumerate(values):
            axes[1, 0].text(i, v + 0.1, f'{v/total:.1%}', ha='center', va='bottom')
    else:
        axes[1, 0].text(0.5, 0.5, 'No Data', ha='center', va='center', transform=axes[1, 0].transAxes)
        axes[1, 0].set_title('Code-Switching Detection')
    
    # 4. Quality vs Length correlation
    axes[1, 1].scatter(gemini_data['output_text'].str.len(), 
                       gemini_data['response_quality_score'],
                       alpha=0.6, color='#4285F4')
    axes[1, 1].set_title('Length vs Quality')
    axes[1, 1].set_xlabel('Response Length')
    axes[1, 1].set_ylabel('Quality Score')
    
    plt.tight_layout()
    plt.show()
    
    # Performance summary
    print("\nGemini Performance Summary:")
    print(f"  Average Quality Score: {gemini_data['response_quality_score'].mean():.3f}")
    print(f"  Code-Switching Rate: {gemini_data['has_code_switching'].mean():.1%}")
    print(f"  Average Response Length: {gemini_data['output_text'].str.len().mean():.0f} characters")
    print(f"  Total Responses: {len(gemini_data)}")
    print(f"  Varieties: {gemini_data['variety'].unique()}")
    
except FileNotFoundError:
    print("❌ Gemini data file not found: ../data/processed/gemini_scored.csv")
    print("💡 Make sure you've run the data collection notebook first")
except Exception as e:
    print(f"❌ Error loading Gemini data: {e}")


In [None]:
# Creative Visualization 3: Radar Chart for API Capabilities
print("🎨 Creating Radar Chart for API Capabilities...")

if api_metrics:
    # Prepare data for radar chart
    radar_data = {}
    for api, metrics in api_metrics.items():
        radar_data[api.title()] = [
            metrics['success_rate'],
            min(metrics['avg_response_length'] / 100, 100),  # Normalize length
            min(metrics['avg_response_words'] * 2, 100),     # Normalize words
            min(metrics['total_responses'] * 2, 100)         # Normalize total
        ]
    
    # Define categories
    categories = ['Success Rate', 'Response Length', 'Word Count', 'Volume']
    
    # Create radar chart
    fig = go.Figure()
    
    colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4', '#FECA57']
    
    for i, (api, values) in enumerate(radar_data.items()):
        fig.add_trace(go.Scatterpolar(
            r=values + [values[0]],  # Close the radar chart
            theta=categories + [categories[0]],
            fill='toself',
            name=api,
            line_color=colors[i % len(colors)],
            fillcolor=colors[i % len(colors)],
            opacity=0.3
        ))
    
    fig.update_layout(
        polar=dict(
            radialaxis=dict(
                visible=True,
                range=[0, 100]
            )),
        showlegend=True,
        title="🎯 API Capabilities Radar Chart",
        font_size=12,
        width=800,
        height=600
    )
    
    fig.show()
    print("✅ Radar chart created!")
else:
    print("⚠️  No API data available for radar chart")


In [None]:
# Creative Visualization 4: Sankey Diagram for API Flow Analysis
print("🎨 Creating Sankey Diagram for API Flow...")

if api_metrics:
    # Create Sankey diagram showing data flow through APIs
    labels = []
    sources = []
    targets = []
    values = []
    
    # Define nodes
    node_labels = ['Input Stimuli', 'OpenAI', 'Gemini', 'Cohere', 'Mistral', 'Successful Output', 'Failed Output']
    node_indices = {label: i for i, label in enumerate(node_labels)}
    
    # Add flows from input to each API
    for api, metrics in api_metrics.items():
        api_name = api.title()
        if api_name in ['Openai', 'Gemini', 'Cohere', 'Mistral']:
            # Flow from input to API
            sources.append(node_indices['Input Stimuli'])
            targets.append(node_indices[api_name])
            values.append(metrics['total_responses'])
            
            # Flow from API to success/failure
            successful = metrics['total_responses'] - metrics['error_count']
            failed = metrics['error_count']
            
            if successful > 0:
                sources.append(node_indices[api_name])
                targets.append(node_indices['Successful Output'])
                values.append(successful)
            
            if failed > 0:
                sources.append(node_indices[api_name])
                targets.append(node_indices['Failed Output'])
                values.append(failed)
    
    # Create Sankey diagram
    fig = go.Figure(data=[go.Sankey(
        node=dict(
            pad=15,
            thickness=20,
            line=dict(color="black", width=0.5),
            label=node_labels,
            color=["#FF6B6B", "#4ECDC4", "#45B7D1", "#96CEB4", "#FECA57", "#2ECC71", "#E74C3C"]
        ),
        link=dict(
            source=sources,
            target=targets,
            value=values,
            color=["rgba(255,107,107,0.6)" if i < len(api_metrics) else "rgba(46,204,113,0.6)" 
                   if targets[i] == node_indices['Successful Output'] else "rgba(231,76,60,0.6)"
                   for i in range(len(sources))]
        )
    )])
    
    fig.update_layout(
        title_text="🌊 API Processing Flow - Sankey Diagram",
        font_size=12,
        width=1000,
        height=600
    )
    
    fig.show()
    print("✅ Sankey diagram created!")
else:
    print("⚠️  No API data available for Sankey diagram")


In [None]:
# Text Analysis with N-grams
from collections import Counter
import re

def extract_ngrams(text, n=2):
    """Extract n-grams from text"""
    if pd.isna(text):
        return []
    words = re.findall(r'\b\w+\b', text.lower())
    return [' '.join(words[i:i+n]) for i in range(len(words)-n+1)]

# Analyze bigrams across varieties
variety_ngrams = {}
for variety in stimuli['variety'].unique():
    variety_texts = stimuli[stimuli['variety'] == variety]['text'].astype(str)
    all_text = ' '.join(variety_texts)
    bigrams = extract_ngrams(all_text, 2)
    variety_ngrams[variety] = Counter(bigrams).most_common(10)

# Create visualization
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Most Common Bigrams by Variety', fontsize=16)

varieties = list(variety_ngrams.keys())
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4']

for i, variety in enumerate(varieties):
    row = i // 2
    col = i % 2
    
    bigrams = variety_ngrams[variety]
    if bigrams:
        bigram_words, bigram_counts = zip(*bigrams)
        
        axes[row, col].barh(range(len(bigram_words)), bigram_counts, color=colors[i])
        axes[row, col].set_yticks(range(len(bigram_words)))
        axes[row, col].set_yticklabels(bigram_words, fontsize=10)
        axes[row, col].set_title(f'{variety} - Top Bigrams')
        axes[row, col].set_xlabel('Frequency')
        axes[row, col].invert_yaxis()

plt.tight_layout()
plt.show()


## 5. Gemini-Specific Analysis

Analysis of Gemini's performance patterns and response characteristics.


In [None]:
# Gemini-Specific Analysis: Response Quality & Consistency
print("🔬 Analyzing Gemini's Response Patterns...")

# Load Gemini responses if available
gemini_data = None
if 'gemini' in api_responses:
    gemini_data = api_responses['gemini'].copy()
    
    # Advanced Gemini Analysis
    gemini_data['response_quality_score'] = (
        gemini_data['output_text'].str.len() / gemini_data['input_text'].str.len()
    ).fillna(0)
    
    # Detect response patterns
    gemini_data['has_code_switching'] = gemini_data['output_text'].str.contains(
        r'\\b(el|la|de|que|en|un|una|con|por|para|finna|bout|real)\\b', 
        case=False, na=False
    )
    
    gemini_data['response_type'] = gemini_data['output_text'].apply(lambda x: 
        'Short' if len(str(x)) < 50 else 
        'Medium' if len(str(x)) < 100 else 
        'Long'
    )
    
    print(f"✅ Loaded {len(gemini_data)} Gemini responses")
    print(f"📊 Response quality scores: {gemini_data['response_quality_score'].describe()}")
else:
    print("⚠️  No Gemini data available - run data collection first")

# Gemini Response Quality Distribution
if gemini_data is not None:
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle('🔬 Gemini Response Quality Analysis', fontsize=16, fontweight='bold')
    
    # 1. Response Quality by Variety
    variety_quality = gemini_data.groupby('variety')['response_quality_score'].mean().sort_values(ascending=False)
    axes[0, 0].bar(variety_quality.index, variety_quality.values, color='#4285F4')
    axes[0, 0].set_title('Response Quality by Variety')
    axes[0, 0].set_ylabel('Quality Score')
    axes[0, 0].tick_params(axis='x', rotation=45)
    
    # 2. Response Length Distribution
    axes[0, 1].hist(gemini_data['output_text'].str.len(), bins=30, alpha=0.7, color='#34A853')
    axes[0, 1].set_title('Response Length Distribution')
    axes[0, 1].set_xlabel('Characters')
    axes[0, 1].set_ylabel('Frequency')
    
    # 3. Code-Switching Detection
    cs_counts = gemini_data['has_code_switching'].value_counts()
    axes[1, 0].pie(cs_counts.values, labels=['No Code-Switching', 'Code-Switching'], 
                   autopct='%1.1f%%', colors=['#EA4335', '#FBBC04'])
    axes[1, 0].set_title('Code-Switching Detection')
    
    # 4. Response Type by Task
    response_task = pd.crosstab(gemini_data['response_type'], gemini_data['task'])
    response_task.plot(kind='bar', ax=axes[1, 1], color=['#4285F4', '#34A853', '#FBBC04'])
    axes[1, 1].set_title('Response Type by Task')
    axes[1, 1].set_ylabel('Count')
    axes[1, 1].tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()
    
    print("✅ Gemini quality analysis completed!")


In [None]:
# Advanced Gemini Visualization: Multi-dimensional Performance Analysis
if gemini_data is not None:
    print("🎨 Creating Advanced Gemini Performance Visualization...")
    
    # Create a sophisticated multi-panel analysis
    fig = plt.figure(figsize=(20, 16))
    gs = fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3)
    
    # 1. Performance Heatmap by Variety and Task
    ax1 = fig.add_subplot(gs[0, 0])
    pivot_quality = gemini_data.pivot_table(
        values='response_quality_score', 
        index='variety', 
        columns='task', 
        aggfunc='mean'
    )
    sns.heatmap(pivot_quality, annot=True, fmt='.2f', cmap='YlOrRd', ax=ax1)
    ax1.set_title('🔥 Quality Heatmap: Variety vs Task', fontweight='bold')
    
    # 2. Response Length vs Quality Scatter
    ax2 = fig.add_subplot(gs[0, 1])
    scatter = ax2.scatter(
        gemini_data['output_text'].str.len(),
        gemini_data['response_quality_score'],
        c=gemini_data['variety'].astype('category').cat.codes,
        cmap='viridis',
        alpha=0.6
    )
    ax2.set_xlabel('Response Length (characters)')
    ax2.set_ylabel('Quality Score')
    ax2.set_title('📊 Length vs Quality Correlation', fontweight='bold')
    
    # 3. Response Consistency Analysis
    ax3 = fig.add_subplot(gs[0, 2])
    consistency = gemini_data.groupby('variety')['response_quality_score'].std().sort_values(ascending=True)
    bars = ax3.barh(consistency.index, consistency.values, color='#FF6B6B')
    ax3.set_xlabel('Standard Deviation')
    ax3.set_title('🎯 Response Consistency by Variety', fontweight='bold')
    
    # 4. Advanced Box Plot with Violin
    ax4 = fig.add_subplot(gs[1, :2])
    sns.violinplot(
        data=gemini_data, 
        x='variety', 
        y='response_quality_score',
        ax=ax4,
        palette='Set2'
    )
    sns.boxplot(
        data=gemini_data, 
        x='variety', 
        y='response_quality_score',
        ax=ax4,
        width=0.1,
        boxprops={'facecolor': 'white', 'alpha': 0.8}
    )
    ax4.set_title('🎻 Response Quality Distribution (Violin + Box Plot)', fontweight='bold')
    ax4.tick_params(axis='x', rotation=45)
    
    # 5. Performance Timeline (if we have temporal data)
    ax5 = fig.add_subplot(gs[1, 2])
    if 'id' in gemini_data.columns:
        # Simulate timeline based on ID
        timeline_data = gemini_data.sort_values('id')
        ax5.plot(timeline_data['id'], timeline_data['response_quality_score'], 
                marker='o', markersize=3, alpha=0.7, color='#4285F4')
        ax5.set_xlabel('Response ID')
        ax5.set_ylabel('Quality Score')
        ax5.set_title('📈 Performance Over Time', fontweight='bold')
    
    # 6. Code-Switching Analysis
    ax6 = fig.add_subplot(gs[2, 0])
    cs_analysis = gemini_data.groupby(['variety', 'has_code_switching']).size().unstack(fill_value=0)
    cs_analysis.plot(kind='bar', ax=ax6, color=['#EA4335', '#FBBC04'])
    ax6.set_title('🔄 Code-Switching Detection', fontweight='bold')
    ax6.set_ylabel('Count')
    ax6.tick_params(axis='x', rotation=45)
    ax6.legend(['No CS', 'Has CS'])
    
    # 7. Response Length Distribution by Variety
    ax7 = fig.add_subplot(gs[2, 1])
    for variety in gemini_data['variety'].unique():
        variety_data = gemini_data[gemini_data['variety'] == variety]['output_text'].str.len()
        ax7.hist(variety_data, alpha=0.6, label=variety, bins=20)
    ax7.set_xlabel('Response Length')
    ax7.set_ylabel('Frequency')
    ax7.set_title('📏 Length Distribution by Variety', fontweight='bold')
    ax7.legend()
    
    # 8. Performance Summary Stats
    ax8 = fig.add_subplot(gs[2, 2])
    summary_stats = gemini_data['response_quality_score'].describe()
    ax8.axis('off')
    ax8.text(0.1, 0.9, '📊 Gemini Performance Summary', fontsize=14, fontweight='bold', transform=ax8.transAxes)
    ax8.text(0.1, 0.8, f'Mean Quality: {summary_stats["mean"]:.3f}', fontsize=12, transform=ax8.transAxes)
    ax8.text(0.1, 0.7, f'Std Dev: {summary_stats["std"]:.3f}', fontsize=12, transform=ax8.transAxes)
    ax8.text(0.1, 0.6, f'Min Quality: {summary_stats["min"]:.3f}', fontsize=12, transform=ax8.transAxes)
    ax8.text(0.1, 0.5, f'Max Quality: {summary_stats["max"]:.3f}', fontsize=12, transform=ax8.transAxes)
    ax8.text(0.1, 0.4, f'Total Responses: {len(gemini_data)}', fontsize=12, transform=ax8.transAxes)
    
    plt.suptitle('🔬 Advanced Gemini Analysis Dashboard', fontsize=18, fontweight='bold', y=0.98)
    plt.show()
    
    print("✅ Advanced Gemini visualization completed!")


In [None]:
# Summary Statistics
print("EDA Analysis Complete!")
print("=" * 30)
print("Visualizations Created:")
print("  - API Performance Heatmap")
print("  - Interactive Bubble Chart") 
print("  - Radar Chart for API Capabilities")
print("  - Text Analysis with N-grams")
print("  - Gemini-Specific Analysis")
print("=" * 30)


ombine the eda for 

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Make plots look cleaner
sns.set(style="whitegrid", palette="muted")

# 1️⃣ Length ratio by variety
sns.barplot(data=df, x="variety", y="length_ratio", errorbar=None)
plt.title("Average Length Ratio by Variety")
plt.ylabel("Output length ÷ Input length")
plt.show()

sns.barplot(data=df, x="variety", y="token_overlap", errorbar=None)
plt.title("Token Overlap Between Input and Output by Variety")
plt.ylabel("Overlap (0–1)")
plt.show()



In [None]:
markers = markers = {
    "AAVE": ["finna", "ion", "gon’", "tryna"],
    "Spanglish": ["la", "el", "mi", "amigos", "vamos"],
    "BrEng": ["flat", "lorry", "holiday", "lift"],
    "StdEng": []  # include it even with no markers
}


def marker_retention(row):
    variety = row["variety"]
    words = set(str(row["output_text"]).lower().split())
    retained = [w for w in markers.get(variety, []) if w in words]
    return len(retained)

df["marker_retained"] = df.apply(marker_retention, axis=1)
df.groupby("variety")["marker_retained"].mean().round(2)


In [None]:
sns.barplot(data=df, x="variety", y="marker_retained", errorbar=None)
plt.title("Average Dialect Marker Retention by Variety")
plt.ylabel("Markers Retained")
plt.show()



In [None]:
df.to_csv("../data/processed/gemini_eda_results.csv", index=False)
print("✅ Saved EDA results to ../data/processed/gemini_eda_results.csv")
