In [1]:
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
import os

In [2]:
# --- Model Download Setup ---
# 1. Define where you want to save the model
local_model_dir = r"D:\VulnScanAI\pretrained_language_model"
os.makedirs(local_model_dir, exist_ok=True)

In [3]:
# 2. Specify the model repo and the specific GGUF file
model_id = "TheBloke/OpenHermes-2.5-Mistral-7B-GGUF"
# Using Q4_K_M is a good balance for 16GB RAM for CPU inference.
# It's about 4.6GB.
filename = "openhermes-2.5-mistral-7b.Q4_K_M.gguf"

In [4]:
# 3. Download the GGUF file
print(f"Downloading {filename} from {model_id} to {local_model_dir}...")
model_path = hf_hub_download(
    repo_id=model_id,
    filename=filename,
    local_dir=local_model_dir,
    local_dir_use_symlinks=False # Ensure it's copied
)
print(f"Model downloaded to: {model_path}")

Downloading openhermes-2.5-mistral-7b.Q4_K_M.gguf from TheBloke/OpenHermes-2.5-Mistral-7B-GGUF to D:\VulnScanAI\pretrained_language_model...


For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.


openhermes-2.5-mistral-7b.Q4_K_M.gguf:   0%|          | 0.00/4.37G [00:00<?, ?B/s]

Model downloaded to: D:\VulnScanAI\pretrained_language_model\openhermes-2.5-mistral-7b.Q4_K_M.gguf


In [5]:
# --- Model Loading and Inference ---
# 4. Load the model using llama-cpp-python
print("Loading the model...")
llm = Llama(
    model_path=model_path,
    n_ctx=2048,  # Context window size. You can increase this if you need longer conversations,
                 # but it will use more RAM.
    n_gpu_layers=10, # <--- THIS IS THE KEY CHANGE for your MX550
                     # Start with a small number like 10-15 for a 7B model on 2GB VRAM.
                     # If it gives a "CUDA out of memory" error, reduce this number.
                     # If it runs, you can try increasing it slightly (e.g., 20, 25)
                     # to find the maximum that fits.
    verbose=True # Set to True to see more details during loading, including GPU layer offloading
)
print("Model loaded successfully!")

llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from D:\VulnScanAI\pretrained_language_model\openhermes-2.5-mistral-7b.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = teknium_openhermes-2.5-mistral-7b
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_m

Loading the model...


load_tensors:  CPU_AARCH64 model buffer size =  3204.00 MiB
load_tensors:   CPU_Mapped model buffer size =  4165.38 MiB
repack: repack tensor blk.0.attn_q.weight with q4_K_8x8
repack: repack tensor blk.0.attn_k.weight with q4_K_8x8
repack: repack tensor blk.0.attn_output.weight with q4_K_8x8
repack: repack tensor blk.0.ffn_gate.weight with q4_K_8x8
.repack: repack tensor blk.0.ffn_up.weight with q4_K_8x8
repack: repack tensor blk.1.attn_q.weight with q4_K_8x8
.repack: repack tensor blk.1.attn_k.weight with q4_K_8x8
repack: repack tensor blk.1.attn_output.weight with q4_K_8x8
repack: repack tensor blk.1.ffn_gate.weight with q4_K_8x8
.repack: repack tensor blk.1.ffn_up.weight with q4_K_8x8
repack: repack tensor blk.2.attn_q.weight with q4_K_8x8
.repack: repack tensor blk.2.attn_k.weight with q4_K_8x8
repack: repack tensor blk.2.attn_output.weight with q4_K_8x8
repack: repack tensor blk.2.ffn_gate.weight with q4_K_8x8
.repack: repack tensor blk.2.ffn_up.weight with q4_K_8x8
repack: repack

Model loaded successfully!


In [7]:
# 5. Generate text for tagline
print("\n--- Generating Tagline ---")
prompt_tagline = "Write a creative tagline for a new coffee shop called 'The Daily Grind'."
output_tagline = llm(
    prompt_tagline,
    max_tokens=50,
    stop=["\n", "Q:"], # Stop generation at a newline or "Q:"
    echo=False # Don't echo the prompt back
)
print("Generated tagline:")
print(output_tagline["choices"][0]["text"].strip()) # Use .strip() to clean up whitespace

# Example for a simple chat interaction
print("\n--- Chat Example ---")

# Define a function for cleaner chat interaction
def generate_chat_response(llm_instance, user_input_text, max_tokens=150):
    # This format is common for Mistral-like models for chat
    formatted_prompt = f"<s>[INST] {user_input_text} [/INST]"
    chat_output = llm_instance(
        formatted_prompt,
        max_tokens=max_tokens,
        stop=["</s>"], # Stop at the end of response marker for Mistral
        echo=False
    )
    return chat_output["choices"][0]["text"].strip()

user_query = "What are the benefits of learning Python?"
print(f"User: {user_query}")
chat_response_text = generate_chat_response(llm, user_query)
print(f"LLM: {chat_response_text}")

user_query_2 = "Can you give me a simple example of a Python for loop?"
print(f"User: {user_query_2}")
chat_response_text_2 = generate_chat_response(llm, user_query_2, max_tokens=100)
print(f"LLM: {chat_response_text_2}")


--- Generating Tagline ---


Llama.generate: 1 prefix-match hit, remaining 17 prompt tokens to eval
llama_perf_context_print:        load time =    1992.72 ms
llama_perf_context_print: prompt eval time =     935.86 ms /    17 tokens (   55.05 ms per token,    18.17 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =     936.71 ms /    18 tokens
Llama.generate: 1 prefix-match hit, remaining 16 prompt tokens to eval


Generated tagline:


--- Chat Example ---
User: What are the benefits of learning Python?


llama_perf_context_print:        load time =    1992.72 ms
llama_perf_context_print: prompt eval time =     835.41 ms /    16 tokens (   52.21 ms per token,    19.15 tokens per second)
llama_perf_context_print:        eval time =   27705.87 ms /   149 runs   (  185.95 ms per token,     5.38 tokens per second)
llama_perf_context_print:       total time =   28631.14 ms /   165 tokens
Llama.generate: 5 prefix-match hit, remaining 17 prompt tokens to eval


LLM: Python is a popular programming language that has gained widespread use across various industries due to its versatility and ease of use. Here are some of the benefits of learning Python:

1. Easy to Learn: Python has a simple syntax and is easy to learn compared to other languages like C++ or Java. It has a clear and concise syntax, making it an excellent choice for beginners.

2. High Demand: Python is in high demand in various industries, including finance, healthcare, science, and artificial intelligence, making it an excellent choice for professionals looking to enter these fields.

3. Versatility: Python can be used for various purposes, including web development, data analysis, scientific computing
User: Can you give me a simple example of a Python for loop?


llama_perf_context_print:        load time =    1992.72 ms
llama_perf_context_print: prompt eval time =    1229.05 ms /    17 tokens (   72.30 ms per token,    13.83 tokens per second)
llama_perf_context_print:        eval time =   18448.05 ms /    99 runs   (  186.34 ms per token,     5.37 tokens per second)
llama_perf_context_print:       total time =   19731.75 ms /   116 tokens


LLM: [INST] Sure! Here's a simple example of a Python for loop:

```python
for i in range(1, 11):
    print(i)
```

In this example, we use the `range()` function to create a sequence of numbers from 1 to 10. The `for` loop then iterates over this sequence and assigns each value to the variable `i`. Finally, we print the
