# Arabic LLM Stress Test â€“ Evaluation Pipeline
This notebook evaluates any Large Language Model (LLM) using the stress-test suite included in this repository.

### **Features:**
- Multi-step reasoning evaluation
- Arabic dialect understanding
- Logic & math tests
- Cultural sensitivity tests
- Benchmark scoring system


In [1]:
import json
import pandas as pd
from datetime import datetime

print('Notebook Ready â€“ ØªØ¨Ø¯Ø£ Ø§Ù„ØªØ¬Ø±Ø¨Ø© Ø§Ù„Ø¢Ù†')

## ðŸ”§ Load Test Files
This cell loads all JSONL test files from the `tests/` folder.

In [2]:
def load_jsonl(path):
    data = []
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            data.append(json.loads(line.strip()))
    return data

tests = {
    'multistep': load_jsonl('tests/multistep_reasoning.jsonl'),
    'dialects': load_jsonl('tests/arabic_dialect_understanding.jsonl'),
    'logic_math': load_jsonl('tests/logic_and_math.jsonl'),
    'cultural': load_jsonl('tests/cultural_sensitivity.jsonl')
}

print('Test files loaded successfully!')

## ðŸ”¥ Evaluate a Model (GPT, Llama, Grokâ€¦)
Enter the model name and a function that sends prompts to the model.
You can integrate Grok, GPT-4, Gemini, Llama, or any API.

In [3]:
def mock_model(prompt):
    """ Temporary mock function for demonstration. Replace with real API call. """
    return "Mocked answer â€“ integrate real model API here."

model = mock_model

print('Model function loaded. Replace mock_model with actual API integration.')

## ðŸ§ª Run Evaluation
This will:
- send prompts to the model
- compare results with expected answers
- generate a scoring table


In [4]:
def score_answer(model_output, ideal_answer):
    model_output = model_output.lower()
    ideal_answer = ideal_answer.lower()
    if ideal_answer[:15] in model_output:
        return 1
    return 0

results = []
for category, items in tests.items():
    for item in items:
        out = model(item['prompt'])
        s = score_answer(out, item['ideal_answer'])
        results.append({
            'category': category,
            'prompt': item['prompt'],
            'output': out,
            'score': s
        })

df = pd.DataFrame(results)
df

## ðŸ“Š Summary Scores


In [5]:
summary = df.groupby('category')['score'].mean().reset_index()
summary

## ðŸ’¾ Save Results
Your benchmark results will be saved automatically.

In [6]:
timestamp = datetime.now().strftime('%Y-%m-%d_%H-%M')
df.to_csv(f'results_{timestamp}.csv', index=False)
print('Results saved!')