# Polish Language Model Evaluation

This notebook evaluates three language models on Polish language tasks:

1. **Bielik-11B-v2.3-Instruct** - A specialized Polish language model
2. **Google Gemma-3-4B-IT** - A multilingual model
3. **Microsoft Phi-4-mini-instruct** - A multilingual model

We will compare their performance on several Polish language benchmark datasets from the KLEJ benchmark.

## Environment Setup

First, we'll set up the environment and check that all requirements are met.

In [None]:
import os
import sys
import logging
import torch
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Add project root to path for imports
sys.path.append('..')

# Set up logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger('model_evaluation')

# Create results directory if it doesn't exist
os.makedirs('../results', exist_ok=True)

In [None]:
# Check if GPU is available
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
    # Print GPU memory info
    gpu_properties = torch.cuda.get_device_properties(0)
    print(f"Total memory: {gpu_properties.total_memory / 1e9:.2f} GB")
    print(f"CUDA capability: {gpu_properties.major}.{gpu_properties.minor}")
else:
    device = torch.device("cpu")
    print("No GPU available. Using CPU (this will be very slow for large models).")

In [None]:
# Import project modules
from src.utils import check_environment_setup, optimize_memory, check_gpu_compatibility
from src.datasets import PolishDatasetLoader, prepare_datasets_for_evaluation
from src.init_models import setup_huggingface_auth, initialize_model, MODEL_CONFIGS
from src.evaluation import run_evaluation, aggregate_results, visualize_results, generate_summary_report, save_results

In [None]:
# Check environment setup
setup_ok = check_environment_setup()
if not setup_ok:
    print("There are issues with your environment setup. See warnings above.")
else:
    print("Environment setup looks good!")

## Model Size and Memory Requirements

Let's check the memory requirements for each model to determine the best loading strategy.

In [None]:
# Display model size information and memory requirements
model_sizes = {
    'bielik': 11,    # 11B parameters
    'gemma': 4,      # 4B parameters
    'phi': 3.8       # 3.8B parameters (approximate)
}

for model_key, size in model_sizes.items():
    print(f"\n{model_key.upper()} - {MODEL_CONFIGS[model_key]['model_id']}")
    print(f"Size: {size}B parameters")
    print(check_gpu_compatibility(size))

## Load Datasets

Now, let's load the Polish language datasets for evaluation.

In [None]:
# Set up Hugging Face authentication
setup_huggingface_auth()

In [None]:
# Create dataset loader
dataset_loader = PolishDatasetLoader(cache_dir='../data')

In [None]:
# Define number of samples per dataset (adjust based on your needs and time constraints)
samples_per_dataset = {
    'dyk': 50,      # Question-answer correctness
    'polemo2': 50,  # Sentiment analysis
    'psc': 50,      # Text similarity
    'cdsc': 50      # Entailment
}

# Load and sample datasets
evaluation_datasets = dataset_loader.create_evaluation_dataset(
    samples_per_dataset=samples_per_dataset,
    split='test'
)

### Examine Dataset Examples

Let's look at a few examples from each dataset to understand the tasks better.

In [None]:
# View examples from DYK dataset
dyk_examples = evaluation_datasets['dyk'].select(range(3))
for i, example in enumerate(dyk_examples):
    print(f"Example {i+1}")
    print(f"Question: {example['question']}")
    print(f"Answer: {example['answer']}")
    print(f"Target: {example['target']}")
    print("---")

In [None]:
# View examples from POLEMO dataset
polemo_examples = evaluation_datasets['polemo2'].select(range(3))
for i, example in enumerate(polemo_examples):
    print(f"Example {i+1}")
    print(f"Sentence: {example['sentence']}")
    print(f"Target: {example['target']}")
    print("---")

In [None]:
# View examples from PSC dataset
psc_examples = evaluation_datasets['psc'].select(range(3))
for i, example in enumerate(psc_examples):
    print(f"Example {i+1}")
    print(f"Extract Text: {example['extract_text'][:200]}...")
    print(f"Summary Text: {example['summary_text']}")
    print(f"Label: {example['label']}")
    print("---")

In [None]:
# View examples from CDSC dataset
cdsc_examples = evaluation_datasets['cdsc'].select(range(3))
for i, example in enumerate(cdsc_examples):
    print(f"Example {i+1}")
    print(f"Sentence A: {example['sentence_A']}")
    print(f"Sentence B: {example['sentence_B']}")
    print(f"Entailment: {example['entailment_judgment']}")
    print("---")

## Model Evaluation

Now, let's evaluate each model on the datasets. This will take some time, especially for large models.

### Model 1: Bielik-11B-v2.3-Instruct

In [None]:
# Initialize Bielik model (using 8-bit quantization due to its size)
print("Loading Bielik-11B model...")
bielik_model, bielik_tokenizer = initialize_model(
    model_key='bielik',
    device="cuda",
    load_in_8bit=True,  # Use 8-bit quantization for memory efficiency
    cache_dir='../models'
)

In [None]:
# Evaluate Bielik model
bielik_results = run_evaluation(
    model_key='bielik',
    model=bielik_model,
    tokenizer=bielik_tokenizer,
    datasets=evaluation_datasets,
    device="cuda",
    max_samples_per_dataset=None,  # Use all sampled examples
    output_dir='../results'
)

In [None]:
# Free up GPU memory
del bielik_model
del bielik_tokenizer
optimize_memory()

### Model 2: Google Gemma-3-4B-IT

In [None]:
# Initialize Gemma model
print("Loading Gemma-3-4B model...")
gemma_model, gemma_tokenizer = initialize_model(
    model_key='gemma',
    device="cuda",
    load_in_8bit=False,  # Don't need 8-bit for this smaller model
    cache_dir='../models'
)

In [None]:
# Evaluate Gemma model
gemma_results = run_evaluation(
    model_key='gemma',
    model=gemma_model,
    tokenizer=gemma_tokenizer,
    datasets=evaluation_datasets,
    device="cuda",
    max_samples_per_dataset=None,
    output_dir='../results'
)

In [None]:
# Free up GPU memory
del gemma_model
del gemma_tokenizer
optimize_memory()

### Model 3: Microsoft Phi-4-mini-instruct

In [None]:
# Initialize Phi model
print("Loading Phi-4-mini model...")
phi_model, phi_tokenizer = initialize_model(
    model_key='phi',
    device="cuda",
    load_in_8bit=False,  # Don't need 8-bit for this smaller model
    cache_dir='../models'
)

In [None]:
# Evaluate Phi model
phi_results = run_evaluation(
    model_key='phi',
    model=phi_model,
    tokenizer=phi_tokenizer,
    datasets=evaluation_datasets,
    device="cuda",
    max_samples_per_dataset=None,
    output_dir='../results'
)

In [None]:
# Free up GPU memory
del phi_model
del phi_tokenizer
optimize_memory()

## Results Analysis

Now let's analyze the results of all three models.

In [None]:
# Combine all results
all_results = {
    'bielik': bielik_results,
    'gemma': gemma_results,
    'phi': phi_results
}

# Save complete results
save_results(
    results=all_results,
    output_dir='../results',
    prefix='full_evaluation'
)

In [None]:
# Create aggregated results DataFrame
results_df = aggregate_results(all_results)
results_df

In [None]:
# Visualize results
visualize_results(results_df, output_path='../results/performance_comparison.png')

In [None]:
# Generate and display summary report
summary_report = generate_summary_report(results_df)
print(summary_report)

In [None]:
# Save summary report to file
with open("../results/summary_report.md", "w", encoding="utf-8") as f:
    f.write(summary_report)

## Dataset-Specific Analysis

Let's look at the performance on each dataset separately.

In [None]:
# Analyze performance on DYK dataset
dyk_results = results_df[results_df['dataset'] == 'dyk']
plt.figure(figsize=(10, 6))
sns.barplot(x='model', y='f1', data=dyk_results)
plt.title('Performance on DYK Dataset (F1 Score)')
plt.ylim(0, 1)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

In [None]:
# Analyze performance on POLEMO dataset
polemo_results = results_df[results_df['dataset'] == 'polemo2']
plt.figure(figsize=(10, 6))
sns.barplot(x='model', y='accuracy', data=polemo_results)
plt.title('Performance on POLEMO Dataset (Accuracy)')
plt.ylim(0, 1)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

In [None]:
# Analyze performance on PSC dataset
psc_results = results_df[results_df['dataset'] == 'psc']
plt.figure(figsize=(10, 6))
sns.barplot(x='model', y='f1', data=psc_results)
plt.title('Performance on PSC Dataset (F1 Score)')
plt.ylim(0, 1)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

In [None]:
# Analyze performance on CDSC dataset
cdsc_results = results_df[results_df['dataset'] == 'cdsc']
plt.figure(figsize=(10, 6))
sns.barplot(x='model', y='accuracy', data=cdsc_results)
plt.title('Performance on CDSC Dataset (Accuracy)')
plt.ylim(0, 1)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

## Performance Comparison

Let's compare the overall performance of the models across all metrics.

In [None]:
# Calculate average performance per model
avg_performance = results_df.groupby('model')[['accuracy', 'f1', 'precision']].mean()
avg_performance

In [None]:
# Visualization of overall performance
avg_performance_long = avg_performance.reset_index().melt(
    id_vars=['model'],
    value_vars=['accuracy', 'f1', 'precision'],
    var_name='metric',
    value_name='score'
)

plt.figure(figsize=(12, 8))
sns.barplot(x='model', y='score', hue='metric', data=avg_performance_long)
plt.title('Overall Model Performance Across Metrics')
plt.xlabel('Model')
plt.ylabel('Score')
plt.ylim(0, 1)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.legend(title='Metric')
plt.tight_layout()
plt.savefig('../results/overall_performance.png', dpi=300, bbox_inches='tight')
plt.show()

## Conclusion

Based on the evaluation results, we can draw several conclusions:

1. **Overall Performance Comparison**: [To be filled after running the evaluation]
2. **Task-Specific Strengths**: [To be filled after running the evaluation]
3. **Polish vs. Multilingual Models**: [To be filled after running the evaluation]
4. **Size vs. Performance Trade-off**: [To be filled after running the evaluation]

The evaluation demonstrates [overall conclusion to be added after running].