## Setup and Configuration
We define the paths and ensure we are targeting the correct folders. If this cell hangs, it means your computer is struggling to talk to the R: drive or initialize the llama-cpp library.

In [1]:
import os
import json
import re
import pandas as pd
from llama_cpp import Llama

# --- PATHS ---
BASE_DIR = r"R:\Files Ruben\GitRepos\DeepDiveV2AI"
MODELS_DIR = os.path.join(BASE_DIR, "TrainedAndMerged")
TRAIN_DATA_PATH = os.path.join(BASE_DIR, "lore_training_data_v2.json")

print(f"‚úÖ Environment Ready.\nTarget Directory: {MODELS_DIR}")

‚úÖ Environment Ready.
Target Directory: R:\Files Ruben\GitRepos\DeepDiveV2AI\TrainedAndMerged


## Smart Model Scanner
This cell is updated to handle your specific situation: Version1 is labeled as "Base", and it finds the .gguf file even if the name changes between folders. It skips empty folders automatically.

In [2]:
model_mapping = {}

# We want to sort Version1, Version2, Version10 correctly
def natural_sort_key(s):
    return [int(text) if text.isdigit() else text.lower() for text in re.split('([0-9]+)', s)]

folders = sorted(os.listdir(MODELS_DIR), key=natural_sort_key)

for folder in folders:
    folder_path = os.path.join(MODELS_DIR, folder)
    
    if os.path.isdir(folder_path):
        # Look for any GGUF file inside the version folder
        ggufs = [f for f in os.listdir(folder_path) if f.endswith(".gguf")]
        
        if ggufs:
            full_path = os.path.join(folder_path, ggufs[0])
            # Special naming for Version 1
            label = "BASE_MODEL" if folder.lower() == "version1" else folder
            model_mapping[label] = full_path

print("üìÇ Found Models:")
for label, path in model_mapping.items():
    print(f"  > {label}: {os.path.basename(path)}")

üìÇ Found Models:
  > BASE_MODEL: Meta-Llama-3-8B-Instruct-Q4_K_M.gguf
  > Version3: Llama-3-8B-Gloom-Lore.Q4_K_M.gguf
  > Version5: Llama-3-8B-Gloom-Lore.Q4_K_M.gguf
  > Version9: Llama-3-8B-Gloom-Lore.Q4_K_M.gguf
  > Version10: Llama-3-8B-Gloom-Lore.Q4_K_M.gguf


## Extract Test Questions and Keywords
This cell parses your JSON file to create the "exam" for the AI. It extracts the mood and specific lore keywords to check against the AI's answer.

In [3]:
# Load Lore Data
with open(TRAIN_DATA_PATH, 'r', encoding='utf-8') as f:
    lore_data = json.load(f)

test_cases = []
# We will test a variety of questions from the file
for entry in lore_data[:15]: 
    messages = entry['messages']
    user_q = next(m['content'] for m in messages if m['role'] == 'user')
    expected_a = next(m['content'] for m in messages if m['role'] == 'assistant')
    
    # Identify key lore words (words > 5 letters)
    keywords = set(re.findall(r'\w{5,}', expected_a.lower()))
    
    test_cases.append({
        "question": user_q,
        "keywords": keywords,
        "expected": expected_a
    })

print(f"üìù Prepared {len(test_cases)} test cases.")

üìù Prepared 15 test cases.


## Cell: Run Validation (GPU Accelerated)
This cell loops through each model. Because we are using the GPU, the n_gpu_layers=-1 argument will ensure the 4070 Ti handles the heavy lifting.

In [4]:
results_list = []

# System prompt to set the AI's persona
system_msg = "You are a survivor on the Ark submarine. You are gritty and superstitious."

for model_label, model_path in model_mapping.items():
    print(f"üöÄ Loading {model_label} onto GPU...", end=" ")
    
    try:
        # n_gpu_layers=-1 offloads all layers to your 4070 Ti
        # n_ctx=2048 gives the model enough "memory" for context
        llm = Llama(
            model_path=model_path, 
            n_ctx=2048, 
            n_gpu_layers=-1, 
            verbose=False
        )
        print("Ready.")
        
        for test in test_cases:
            # Constructing the Llama 3 specific chat template
            prompt = f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{system_msg}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{test['question']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
            
            # Generate response
            output = llm(prompt, max_tokens=150, stop=["<|eot_id|>", "<|start_header_id|>"])
            response = output['choices'][0]['text'].strip()
            
            # Metrics Calculation
            hit_count = sum(1 for word in test['keywords'] if word in response.lower())
            has_mood = 1 if "[Mood:" in response else 0
            
            results_list.append({
                "Model": model_label,
                "Question": test['question'],
                "Response": response,
                "Lore_Hits": hit_count,
                "Format_Correct": bool(has_mood)
            })
            
        # Clean up GPU VRAM before loading the next model
        del llm
        import gc
        gc.collect()
        
    except Exception as e:
        print(f"‚ùå Failed to load {model_label}: {e}")

print("\n‚ú® GPU Validation Complete.")

üöÄ Loading BASE_MODEL onto GPU... 

llama_context: n_ctx_per_seq (2048) < n_ctx_train (8192) -- the full capacity of the model will not be utilized


Ready.
üöÄ Loading Version3 onto GPU... 

llama_context: n_ctx_per_seq (2048) < n_ctx_train (8192) -- the full capacity of the model will not be utilized


Ready.
üöÄ Loading Version5 onto GPU... 

llama_context: n_ctx_per_seq (2048) < n_ctx_train (8192) -- the full capacity of the model will not be utilized


Ready.
üöÄ Loading Version9 onto GPU... 

llama_context: n_ctx_per_seq (2048) < n_ctx_train (8192) -- the full capacity of the model will not be utilized


Ready.
üöÄ Loading Version10 onto GPU... 

llama_context: n_ctx_per_seq (2048) < n_ctx_train (8192) -- the full capacity of the model will not be utilized


Ready.

‚ú® GPU Validation Complete.


## Result Visualization
This final cell displays the "Winner" by averaging the scores.

In [5]:
df = pd.DataFrame(results_list)

# Calculate average hits per model
summary = df.groupby("Model").agg({
    "Lore_Hits": "mean",
    "Format_Correct": "mean"
}).sort_values(by="Lore_Hits", ascending=False)

print("üèÜ Model Comparison Summary:")
display(summary)

# Display a specific comparison
print("\nüîç Sample Comparison for: 'Who is the Broker?'")
display(df[df['Question'] == "Who is the Broker?"][['Model', 'Response', 'Lore_Hits']])

üèÜ Model Comparison Summary:


Unnamed: 0_level_0,Lore_Hits,Format_Correct
Model,Unnamed: 1_level_1,Unnamed: 2_level_1
Version5,3.533333,1.0
Version9,2.066667,1.0
Version3,1.6,0.0
Version10,1.466667,1.0
BASE_MODEL,0.8,0.0



üîç Sample Comparison for: 'Who is the Broker?'


Unnamed: 0,Model,Response,Lore_Hits
1,BASE_MODEL,"(sharply) Ah, the Broker? You mean that slippe...",2
16,Version3,The Broker is the middleman. They control the ...,2
31,Version5,[Mood: Warning] *Eyes you seriously.* The Brok...,3
46,Version9,[Mood: Warning] *Eyes narrowing.* The Broker? ...,4
61,Version10,[Mood: Suspicious] *Spits on the deck.* He's t...,4
