To convert existing GGML models to GGUF you can run the following in llama.cpp:

https://github.com/ggml-org/llama.cpp


python ./convert-llama-ggmlv3-to-gguf.py --eps 1e-5 --input models/openorca-platypus2-13b.ggmlv3.q4_0.bin --output models/openorca-platypus2-13b.gguf.q4_0.bin

pip install --upgrade --quiet  llama-cpp-python

In [4]:
from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler
from langchain_core.prompts import PromptTemplate

In [5]:
# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

In [14]:
# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="../LLM-Quantize-Model/qwen2.5-1.5b-instruct-q5_k_m.gguf",
    temperature=0.75,
    max_tokens=4096,
    top_p=1,
    callback_manager=callback_manager,
    verbose=False,  # Verbose is required to pass to the callback manager
)

llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
llama_new_context_with_model: n_ctx_per_seq (512) < n_ctx_train (32768) -- the full capacity of the model will not be utilized


In [15]:
question = """
Question: A rap battle between Stephen Colbert and John Oliver
"""
llm.invoke(question)

Answer: A rap battle between Stephen Colbert and John Oliver was never actually held. The idea of a "rap battle" as described in the question is not a form of music, but rather an event where two people perform rhyming lyrics for entertainment.

The events Stephen Colbert and John Oliver are known for engaging in a variety of formats including debates on current issues, television interviews, and public appearances such as the 2014 World Cup. These engagements are meant to engage in dialogue about different topics like politics, sports, culture, etc.

In this context, we cannot hold a rap battle between Stephen Colbert and John Oliver as described in the question because they are not artists who perform rhyming lyrics for entertainment. Rather, they are hosts of political talk shows where they debate current issues.
  
So to conclude: A rap battle between Stephen Colbert and John Oliver is an idea that does not exist because their roles as hosts on political talk shows do not involve p

'Answer: A rap battle between Stephen Colbert and John Oliver was never actually held. The idea of a "rap battle" as described in the question is not a form of music, but rather an event where two people perform rhyming lyrics for entertainment.\n\nThe events Stephen Colbert and John Oliver are known for engaging in a variety of formats including debates on current issues, television interviews, and public appearances such as the 2014 World Cup. These engagements are meant to engage in dialogue about different topics like politics, sports, culture, etc.\n\nIn this context, we cannot hold a rap battle between Stephen Colbert and John Oliver as described in the question because they are not artists who perform rhyming lyrics for entertainment. Rather, they are hosts of political talk shows where they debate current issues.\n  \nSo to conclude: A rap battle between Stephen Colbert and John Oliver is an idea that does not exist because their roles as hosts on political talk shows do not in

In [16]:
# pip install llama-cpp-python 
# huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct-GGUF qwen2.5-1.5b-instruct-q5_k_m.gguf --local-dir . --local-dir-use-symlinks False

from llama_cpp import Llama

# Initialize the Llama model
llm = Llama(
    model_path='../LLM-Quantize-Model/whiterabbitneo.gguf',  # Ensure model is optimized for CPU
    n_gpu_layers=4,  # Specify GPU layers or modify based on your setup
    temperature=1.1,
    top_p=0.5,
    n_ctx=32768,
    max_tokens=1500,
    repeat_penalty=1.178,
    stop=['<|im_end|>'],
    verbose=False
)

# Define the user input (prompt)
user_input = "Explain the importance of AI in education."

# Prepare the message
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Generate a response from the model
response = llm.create_chat_completion(
    messages=messages,
    temperature=1.2,
    repeat_penalty=1.178,
    stop=['<|im_end|>'],
    max_tokens=1500)

# Extract and print the response in one line
response_content = response['choices'][0]['message']['content']
print(response_content)

llama_new_context_with_model: n_ctx_pre_seq (32768) > n_ctx_train (8192) -- possible training context overflow


[{'type': 'image', 'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, {'type': 'text', 'text': "This is an example image for AI explanation."}]
