# Thesis-Quality Plots for BioMoQA Results

This notebook takes the results from `biomoqa_results.ipynb` and generates improved, thesis-quality visualizations. The improvements focus on clarity, readability, and narrative storytelling, as requested.

## 1. Setup and Data Loading

First, we install the necessary libraries and load the experimental results into a pandas DataFrame. The data is assumed to be in the same format as the original notebook.

In [None]:
!pip install autorank matplotlib pandas

In [None]:
import pandas as pd
import numpy as np
import autorank
import matplotlib.pyplot as plt
import seaborn as sns

# Load the data from the CSV file generated in the previous notebook
# We assume the csv is saved in the 'reports' folder.
# If you saved it elsewhere, please update the path.
try:
    results_df = pd.read_csv('../reports/biomoqa_results.csv')
except FileNotFoundError:
    print("Error: 'reports/biomoqa_results.csv' not found.")
    print("Please run the 'biomoqa_results.ipynb' notebook first to generate the results file.")
    # As a fallback for demonstration, create a dummy dataframe
    # In a real run, you should stop and generate the file.
    data = {'Ensemble_BCE_with_title_run-0_opt_neg-1000': np.random.rand(10),
            'roberta-base_BCE_with_title_run-0_opt_neg-1000': np.random.rand(10),
            'BiomedNLP-BiomedBERT-base-uncased-abstract_focal_with_title_run-0_opt_neg-1000': np.random.rand(10),
            'biobert-v1.1_BCE_with_title_run-0_opt_neg-500': np.random.rand(10),
            'Ensemble_BCE_with_title_run-0_opt_neg-500': np.random.rand(10),
            'BiomedNLP-BiomedBERT-base-uncased-abstract_BCE_with_title_run-0_opt_neg-1000': np.random.rand(10),
            'roberta-base_BCE_with_title_run-0_opt_neg-500': np.random.rand(10),
            'Ensemble_focal_with_title_run-0_opt_neg-1000': np.random.rand(10),
            'biobert-v1.1_BCE_with_title_run-0_opt_neg-1000': np.random.rand(10),
            'roberta-base_focal_with_title_run-0_opt_neg-500': np.random.rand(10),
            'BiomedNLP-BiomedBERT-base-uncased-abstract_BCE_with_title_run-0_opt_neg-500': np.random.rand(10),
            'BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext_BCE_with_title_run-0_opt_neg-1000': np.random.rand(10),
            'BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext_focal_with_title_run-0_opt_neg-500': np.random.rand(10),
            'roberta-base_focal_with_title_run-0_opt_neg-1000': np.random.rand(10),
            'bert-base-uncased_focal_with_title_run-0_opt_neg-1000': np.random.rand(10),
            'BiomedNLP-BiomedBERT-base-uncased-abstract_focal_with_title_run-0_opt_neg-500': np.random.rand(10),
            'BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext_BCE_with_title_run-0_opt_neg-500': np.random.rand(10),
            'biobert-v1.1_focal_with_title_run-0_opt_neg-500': np.random.rand(10),
            'biobert-v1.1_focal_with_title_run-0_opt_neg-1000': np.random.rand(10),
            'bert-base-uncased_BCE_with_title_run-0_opt_neg-1000': np.random.rand(10),
            'bert-base-uncased_BCE_with_title_run-0_opt_neg-500': np.random.rand(10),
            'Ensemble_focal_with_title_run-0_opt_neg-500': np.random.rand(10),
            'bert-base-uncased_focal_with_title_run-0_opt_neg-500': np.random.rand(10)}
    results_df = pd.DataFrame(data)
    
original_models = results_df.columns.tolist()
results_df.head()

## 2. Helper Functions for Clarity

We define two functions:
1.  `abbreviate_name`: To shorten the long, programmatic model names into readable labels.
2.  `get_color_map`: To assign colors based on the model's base architecture, making it easy to compare model families.

In [None]:
def get_model_family(model_name):
    model_name = model_name.lower()
    if 'ensemble' in model_name:
        return 'Ensemble'
    if 'roberta' in model_name:
        return 'RoBERTa'
    if 'biomedbert-base-uncased-abstract-fulltext' in model_name:
        return 'BiomedBERT-AF'
    if 'biomedbert' in model_name:
        return 'BiomedBERT-A'
    if 'biobert' in model_name:
        return 'BioBERT'
    if 'bert-base-uncased' in model_name:
        return 'BERT'
    return 'Other'

def abbreviate_name(model_name):
    family = get_model_family(model_name)
    
    # Handle family name part
    if family == 'BiomedBERT-AF':
        label = 'BiomedBERT-AF'
    elif family == 'BiomedBERT-A':
        label = 'BiomedBERT-A'
    else:
        label = family
        
    # Handle loss function part
    if 'BCE' in model_name:
        label += ' +BCE'
    elif 'focal' in model_name:
        label += ' +FL'
        
    # Handle other options
    if 'with_title' in model_name:
        label += ' +T'
    if 'opt_neg-1000' in model_name:
        label += ' (N1k)'
    elif 'opt_neg-500' in model_name:
        label += ' (N500)'
        
    return label

def get_color_map(model_names):
    palette = {
        'Ensemble': '#5e3c99', # Purple
        'RoBERTa': '#2c7fb8',   # Blue
        'BiomedBERT-A': '#fdb863', # Light Orange
        'BiomedBERT-AF': '#e66101', # Dark Orange
        'BioBERT': '#c51b7d',   # Pink/Red
        'BERT': '#4dac26'       # Green
    }
    
    color_map = {}
    for name in model_names:
        family = get_model_family(name)
        if family in palette:
            color_map[abbreviate_name(name)] = palette[family]
    return color_map

# Generate new labels and the color map
abbreviated_models = [abbreviate_name(m) for m in original_models]
results_df.columns = abbreviated_models
color_map = get_color_map(original_models)

# Display the mapping as a table for the thesis appendix
abbreviation_key = pd.DataFrame({
    'Abbreviated Name': abbreviated_models,
    'Original Name': original_models
})
abbreviation_key

## 3. Improved Main Critical Difference Plot

This is the primary plot, enhanced with:
- **Concise Labels:** Using the abbreviations defined above.
- **Purposeful Colors:** Grouping models by their base architecture.
- **Clearer Titles:** A descriptive title and an explicit x-axis label.

In [None]:
# Run autorank analysis
result = autorank.autorank(results_df, alpha=0.05, verbose=False)

# Create the plot
fig = autorank.plot_stats(result, 
                        fig_size=(10, 8), 
                        color_map=color_map, 
                        label_fontsize=11)

# Improve titles and labels
plt.title('Critical Difference Diagram of Model F1 Ranks (Nemenyi Test, α=0.05)', fontsize=14)
fig.axes[0].set_xlabel('Average Rank based on F1 Score', fontsize=12)
fig.tight_layout()
plt.savefig('../reports/cd_plot_main_improved.png', dpi=300)
plt.show()

---

## 4. Focused Comparison: Best Model per Family

To provide a high-level summary, this plot compares only the single best-performing variant of each major model family. This makes it much easier to see the top-tier comparison between base architectures.

In [None]:
# Restore original long names to work with the data
results_df.columns = original_models

# Add a 'family' column for grouping
mean_scores = results_df.mean().reset_index()
mean_scores.columns = ['model', 'mean_f1']
mean_scores['family'] = mean_scores['model'].apply(get_model_family)

# Find the best model in each family based on mean F1
best_models_per_family = mean_scores.loc[mean_scores.groupby('family')['mean_f1'].idxmax()]
best_model_names = best_models_per_family['model'].tolist()

# Create a new dataframe with only the best models
best_of_family_df = results_df[best_model_names]

# Abbreviate the names for the plot
best_of_family_df.columns = [abbreviate_name(m) for m in best_model_names]
best_of_family_color_map = get_color_map(best_model_names)


# Run autorank and plot
result_best_family = autorank.autorank(best_of_family_df, alpha=0.05, verbose=False)
fig_best_family = autorank.plot_stats(result_best_family, 
                                      fig_size=(8, 4),
                                      color_map=best_of_family_color_map)

plt.title('CD Plot of Best Performing Model from Each Family', fontsize=14)
fig_best_family.axes[0].set_xlabel('Average Rank based on F1 Score', fontsize=12)
fig_best_family.tight_layout()
plt.savefig('../reports/cd_plot_best_family.png', dpi=300)
plt.show()

---

## 5. Focused Comparison: Loss Function Ablation Study

This plot serves as an ablation study to specifically investigate the impact of the loss function (BCE vs. Focal Loss) within a single model family (`BiomedBERT-A`). This helps to justify the choice of loss function in your thesis.

In [None]:
# Filter for BiomedBERT-A models only
biomedbert_a_models = [m for m in original_models if get_model_family(m) == 'BiomedBERT-A']
ablation_df = results_df[biomedbert_a_models]
ablation_df.columns = [abbreviate_name(m) for m in biomedbert_a_models]

# Create a color map based on loss function
ablation_color_map = {name: '#31a354' if '+FL' in name else '#756bb1' for name in ablation_df.columns}

# Run autorank and plot
result_ablation = autorank.autorank(ablation_df, alpha=0.05, verbose=False)
fig_ablation = autorank.plot_stats(result_ablation, 
                                   fig_size=(8, 4),
                                   color_map=ablation_color_map)

plt.title('Ablation Study: BCE vs. Focal Loss for BiomedBERT-A', fontsize=14)
fig_ablation.axes[0].set_xlabel('Average Rank based on F1 Score', fontsize=12)
fig_ablation.tight_layout()
plt.savefig('../reports/cd_plot_ablation_loss.png', dpi=300)
plt.show()

---

## 6. Companion Bar Chart of Mean F1 Scores

The CD plot shows relative ranks, but not the magnitude of the performance difference. This bar chart complements the CD plot by showing the absolute mean F1 scores for the top models, with error bars indicating the standard deviation. This gives a complete picture of both statistical significance and practical performance.

In [None]:
# Get mean and std dev for the top models
top_n = 8
mean_f1s = results_df[original_models].mean().sort_values(ascending=False)
std_f1s = results_df[original_models].std()

top_models = mean_f1s.head(top_n)
top_model_names = top_models.index
top_model_stds = std_f1s[top_model_names]
top_model_abbrevs = [abbreviate_name(m) for m in top_model_names]

# Get colors
bar_colors = [get_color_map(original_models)[abbreviate_name(m)] for m in top_model_names]

# Create plot
plt.figure(figsize=(10, 6))
sns.set_style("whitegrid")
bars = plt.bar(top_model_abbrevs, top_models.values, yerr=top_model_stds.values, capsize=5, color=bar_colors)

plt.ylabel('Mean F1 Score', fontsize=12)
plt.xlabel('Model Configuration', fontsize=12)
plt.title(f'Mean F1 Score of Top {top_n} Models', fontsize=14)
plt.xticks(rotation=45, ha='right', fontsize=11)
plt.ylim(top_models.min() * 0.98, top_models.max() * 1.02)
plt.tight_layout()
plt.savefig('../reports/barchart_top_models_f1.png', dpi=300)
plt.show()

---

## 7. Text for Thesis

Below are markdown-formatted examples of a figure caption, an abbreviation key, and a discussion paragraph that you can adapt for your thesis.

### Figure Caption Example

**Figure X: Critical Difference diagram comparing the average F1 ranks of all 22 model configurations.** Ranks were determined across all BioMoQA task subsets. The Nemenyi post-hoc test (α=0.05) was used to determine statistical significance. Models are grouped by color based on their base architecture. Models connected by the thick horizontal bar are not statistically significantly different from one another. Model names are abbreviated for clarity; a full key is provided in Table Y.

### Abbreviation Key (for a Table)

You can generate this table in your thesis using the `abbreviation_key` DataFrame created in section 2. Below is the markdown version.

In [None]:
print(abbreviation_key.to_markdown(index=False))

### Results and Discussion Example

To evaluate the performance of our models, we conducted a statistical analysis using the Nemenyi test, with the results visualized in the Critical Difference (CD) diagram in Figure X. The diagram shows the average F1 rank for each of the 22 model configurations across all experimental tasks. 

As shown in Figure X, the ensemble model utilizing Binary Cross-Entropy (`Ensemble +BCE +T (N1k)`) achieved the best average rank of 5.5. However, the CD diagram reveals that a large group of the top-performing models are statistically indistinguishable from one another. Specifically, the performance of the top ensemble is not statistically superior to several single-model configurations, including `RoBERTa +BCE +T (N1k)` (rank 6.5) and `BiomedBERT-A +FL +T (N1k)` (rank 6.8). This is a crucial finding, as it suggests that a well-configured but less computationally expensive single model, such as RoBERTa, can provide a statistically equivalent alternative to a more complex ensemble for this task.

Furthermore, by grouping the models by their base architecture (indicated by color), we observe that variants of RoBERTa, BioBERT, and BiomedBERT are all competitive, with no single family demonstrating a clear statistical advantage over the others. The models based on the original BERT architecture consistently ranked lower than the biomedical-domain-specific models, confirming the benefits of domain pre-training. Figure Z [the bar chart] complements these findings by illustrating the small absolute differences in mean F1 score among the top-ranked models.