# üá™üá¨ EgySentiment: Transfer Learning for Egyptian Financial Sentiment

**Author:** AI Research Scientist  
**Goal:** Fine-tune `Llama-3-8b-Instruct` on the gold-standard **Financial PhraseBank** dataset and evaluate its performance on **Egyptian Financial News**.
**Environment:** Google Colab (Free Tier - T4 GPU)  

### üöÄ Strategy: Transfer Learning
1. **Train:** Use `Financial PhraseBank` (4800+ English financial sentences) to teach the model general financial sentiment.
2. **Test:** Evaluate the model on your local `training_data.jsonl` (Egyptian context) to see how well it adapts.

### üìã Workflow
1. **Setup:** Install Unsloth & Dependencies.
2. **Data:** Load PhraseBank (Train) and Local JSONL (Test).
3. **Model:** Load 4-bit Quantized Llama-3.
4. **Train:** SFT with LoRA on PhraseBank.
5. **Eval:** Run inference on Egyptian data & plot Confusion Matrix.
6. **Export:** Save GGUF.

## 1. Setup & Installation

In [None]:
%%capture
# Install Unsloth, Xformers (Flash Attention), and other deps
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate bitsandbytes
!pip install scikit-learn matplotlib seaborn datasets

In [None]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

## 2. Data Loading (Transfer Learning Setup)

In [None]:
import json
from datasets import load_dataset, Dataset
import pandas as pd

# 1. Load Training Data (Financial PhraseBank - Gold Standard)
print("üìö Loading Financial PhraseBank (Training Data)...")
train_dataset = load_dataset("takala/financial_phrasebank", "sentences_allagree", split="train")

# 2. Load Test Data (Your Local Egyptian Data)
print("üá™üá¨ Loading Local Egyptian Data (Test Data)...")
dataset_path = "training_data.jsonl"
local_data = []

try:
    with open(dataset_path, 'r', encoding='utf-8') as f:
        for line in f:
            local_data.append(json.loads(line))
    
    test_df = pd.DataFrame(local_data)
    print(f"‚úì Loaded {len(test_df)} local samples for testing")
    
except FileNotFoundError:
    print("‚ö†Ô∏è 'training_data.jsonl' not found. Using dummy test data.")
    test_df = pd.DataFrame([
        {"text": "EGX30 rises 2% on strong CIB earnings", "sentiment": "positive", "reasoning": "Market rise"},
        {"text": "EGP devalues against dollar", "sentiment": "negative", "reasoning": "Currency fall"}
    ])

# 3. Format Prompts
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Analyze the sentiment of the following financial news. Provide the sentiment (positive/negative/neutral) and a brief reasoning.

### Input:
{}

### Response:
{{"sentiment": "{}", "reasoning": "{}"}}"""

EOS_TOKEN = tokenizer.eos_token

def format_phrasebank(examples):
    # Map integer labels to string
    label_map = {0: "negative", 1: "neutral", 2: "positive"}
    
    inputs = examples["sentence"]
    labels = examples["label"]
    texts = []
    
    for input_text, label in zip(inputs, labels):
        sentiment = label_map[label]
        # PhraseBank doesn't have reasoning, so we provide a generic one for training context
        text = alpaca_prompt.format(input_text, sentiment, "Sentiment inferred from financial context.") + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

def format_local_test(examples):
    inputs = examples["text"]
    sentiments = examples["sentiment"]
    reasonings = examples["reasoning"]
    texts = []
    
    for input_text, sentiment, reasoning in zip(inputs, sentiments, reasonings):
        text = alpaca_prompt.format(input_text, sentiment, reasoning) + EOS_TOKEN
        texts.append(text)
    return {"text": texts, "ground_truth_sentiment": sentiments}

# Apply formatting
print("üîÑ Formatting datasets...")
train_dataset = train_dataset.map(format_phrasebank, batched=True)

test_dataset = Dataset.from_pandas(test_df)
test_dataset = test_dataset.map(format_local_test, batched=True)

print(f"‚úì Training Set (PhraseBank): {len(train_dataset)} samples")
print(f"‚úì Test Set (EgySentiment):   {len(test_dataset)} samples")

## 3. Model Configuration (LoRA)

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

## 4. Training on Financial PhraseBank

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60, # Adjust based on need (60 steps is quick, ~1 epoch for small data)
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

trainer_stats = trainer.train()

## 5. Evaluation on Egyptian Data (Critical)
We now test how well the model (trained on general finance) performs on your specific Egyptian dataset.

In [None]:
FastLanguageModel.for_inference(model)

from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import re
from tqdm import tqdm

y_true = []
y_pred = []

print(f"Running inference on {len(test_dataset)} Egyptian samples...")

for i in tqdm(range(len(test_dataset))):
    # Prepare input
    input_text = test_dataset[i]["text"].split("### Response:")[0] + "### Response:\n"
    ground_truth = test_dataset[i]["ground_truth_sentiment"]
    
    inputs = tokenizer([input_text], return_tensors = "pt").to("cuda")

    # Generate
    outputs = model.generate(**inputs, max_new_tokens = 128, use_cache = True)
    response = tokenizer.batch_decode(outputs)[0]
    
    # Extract JSON
    try:
        generated_part = response.split("### Response:")[1]
        match = re.search(r'"sentiment":\s*"(positive|negative|neutral)"', generated_part, re.IGNORECASE)
        if match:
            pred_sentiment = match.group(1).lower()
        else:
            pred_sentiment = "neutral"
    except:
        pred_sentiment = "neutral"
        
    y_true.append(ground_truth)
    y_pred.append(pred_sentiment)

# Metrics
print("\nüá™üá¨ Egyptian Data Performance Report:")
print(classification_report(y_true, y_pred, labels=["positive", "neutral", "negative"]))

# Confusion Matrix
cm = confusion_matrix(y_true, y_pred, labels=["positive", "neutral", "negative"])
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=["positive", "neutral", "negative"], 
            yticklabels=["positive", "neutral", "negative"])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix: Egyptian Financial News')
plt.show()

## 6. Export to GGUF

In [None]:
import gc
gc.collect()
torch.cuda.empty_cache()

if True: 
    model.save_pretrained_gguf(
        "model_gguf", 
        tokenizer, 
        quantization_method = "q4_k_m",
        maximum_memory_usage = 0.6,
    )
    print("‚úÖ GGUF saved!")