# 3. Model Testing (Inference Simulation)

This notebook simulates the inference process on the target hardware (4 CPU, 4GB RAM).
We assume the model has been converted to GGUF format (e.g., `qwen2.5-3b-reminder-bot-q4_k_m.gguf`).

If the GGUF file is not ready, this notebook will fail. 
You can use `llama-cpp-python` to run GGUF models in Python.

## Prerequisites
```bash
pip install llama-cpp-python
```


In [2]:
from llama_cpp import Llama
import json

# Path to your GGUF model
# You must have run the conversion in step 2 or downloaded a GGUF version
MODEL_PATH = "model_q4_k_m.gguf" 

try:
    # Initialize model with 4GB RAM constraint logic
    # n_ctx=2048 context window
    llm = Llama(
        model_path=MODEL_PATH,
        n_ctx=2048,
        n_threads=4,      # 4 CPU cores
        n_gpu_layers=0    # 0 GPU layers (CPU only simulation)
    )
    print("Model loaded successfully on CPU.")
except Exception as e:
    print(f"Error loading model (Did you convert it to GGUF?): {e}")
    llm = None

llama_model_loader: loaded meta data with 26 key-value pairs and 434 tensors from model_q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 20
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.800000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 0.700000
llama_model_loader: - kv   5:                               general.name str              = Merged_Model
llama_model_loader: - kv   6:                         general.size_label str              = 3.1B
llama_model_loader: - kv   7:                          qwen2.block_count u32    

Model loaded successfully on CPU.


CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
Model metadata: {'general.file_type': '15', 'tokenizer.ggml.bos_token_id': '151643', 'qwen2.attention.layer_norm_rms_epsilon': '0.000001', 'tokenizer.ggml.eos_token_id': '151645', 'qwen2.rope.freq_base': '1000000.000000', 'general.architecture': 'qwen2', 'general.sampling.top_k': '20', 'tokenizer.ggml.padding_token_id': '151645', 'qwen2.embedding_length': '2048', 'general.sampling.top_p': '0.800000', 'qwen2.context_length': '32768', 'tokenizer.chat_template': '{%- if tools %}\n    {{- \'<|im_start|>system\\n\' }}\n    {%- if messages[0][\'role\'] == \'system\' %}\n        {{- messages[0][\'content\'] }}\n    {%- else %}\n        {{- \'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.\' }}\n    {%- endif %}\n    {{- "\\n\\n# Tools\\n\\nYou may call one or more func

In [3]:
import pandas as pd
import random

def extract_reminder(text, context_date="2026-02-18"):
    if not llm:
        return "Model not loaded."
    
    system_prompt = """Ты — система для извлечения параметров напоминаний.
Твоя задача: извлечь текст, дату, время и периодичность из сообщения пользователя и вернуть JSON.
Используй текущую дату (Context Date) для разрешения относительных дат.
Формат: {"text": "...", "date": "YYYY-MM-DD", "time": "HH:MM", "repeat": "..."}"""
    
    user_prompt = f"Context Date: {context_date}\nMessage: \"{text}\"\n\nJSON:"
    
    prompt = f"<|im_start|>system\n{system_prompt}<|im_end|>\n<|im_start|>user\n{user_prompt}<|im_end|>\n<|im_start|>assistant\n"
    
    output = llm(
        prompt,
        max_tokens=256,
        stop=["<|im_end|>"],
        temperature=0.1,
        echo=False
    )
    
    return output['choices'][0]['text']

# Load random 15 messages from CSV
try:
    df = pd.read_csv("user_messages.csv")
    # Filter for non-empty messages
    df = df.dropna(subset=['message_text'])
    df = df[df['message_text'].str.len() > 10]
    
    # Sample 15 random messages
    sample_df = df.sample(25)
    test_messages = sample_df['message_text'].tolist()
    print(f"Loaded {len(test_messages)} random messages from CSV.")
    
except Exception as e:
    print(f"Error loading CSV, using default examples: {e}")
    test_messages = [
        "Напомни купить молоко завтра в 10 утра",
        "Встреча с боссом 25 февраля в 14:00",
        "Каждый день в 9:00 пить таблетки"
    ]

for msg in test_messages:
    if llm:
        print(f"Input: {msg}")
        # Clean up newlines for cleaner output display
        print(f"Output: {extract_reminder(msg)}")
        print("-" * 20)

Error loading CSV, using default examples: [Errno 2] No such file or directory: 'user_messages.csv'
Input: Напомни купить молоко завтра в 10 утра


llama_perf_context_print:        load time =     555.72 ms
llama_perf_context_print: prompt eval time =     555.48 ms /   151 tokens (    3.68 ms per token,   271.84 tokens per second)
llama_perf_context_print:        eval time =    1864.96 ms /    44 runs   (   42.39 ms per token,    23.59 tokens per second)
llama_perf_context_print:       total time =    2462.21 ms /   195 tokens
llama_perf_context_print:    graphs reused =         41
Llama.generate: 127 prefix-match hit, remaining 27 prompt tokens to eval


Output: {"date": "2026-02-19", "repeat": "none", "text": "Напомни купить молоко", "time": "10:00"}
--------------------
Input: Встреча с боссом 25 февраля в 14:00


llama_perf_context_print:        load time =     555.72 ms
llama_perf_context_print: prompt eval time =     118.27 ms /    27 tokens (    4.38 ms per token,   228.30 tokens per second)
llama_perf_context_print:        eval time =    1866.07 ms /    43 runs   (   43.40 ms per token,    23.04 tokens per second)
llama_perf_context_print:       total time =    2013.43 ms /    70 tokens
llama_perf_context_print:    graphs reused =         40
Llama.generate: 127 prefix-match hit, remaining 24 prompt tokens to eval


Output: {"date": "2026-02-25", "repeat": null, "text": "Встреча с боссом", "time": "14:00"}
--------------------
Input: Каждый день в 9:00 пить таблетки


llama_perf_context_print:        load time =     555.72 ms
llama_perf_context_print: prompt eval time =     146.31 ms /    24 tokens (    6.10 ms per token,   164.04 tokens per second)
llama_perf_context_print:        eval time =    1780.58 ms /    42 runs   (   42.39 ms per token,    23.59 tokens per second)
llama_perf_context_print:       total time =    1954.91 ms /    66 tokens
llama_perf_context_print:    graphs reused =         39


Output: {"date": null, "repeat": "daily", "text": "Каждый день в 9:00 пить таблетки", "time": "09:00"}
--------------------


## Note setup for production

For the actual 4GB/4CPU deployment, you simply need:
1. The `.gguf` file.
2. A lightweight python script using `llama-cpp-python` (like above).
3. No heavy dependencies like `torch` or `transformers`. Just `llama-cpp-python`.
