# 🦙 Local LLM inference with LlamaCpp

Download the model in GGUF format (~15G):

```bash
wget https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/mixtral-8x7b-instruct-v0.1.Q2_K.gguf
```

Make sure to pick up a model already fine-tuned for chat (they should have `instruct` or `chat` in the name)

In [1]:
import sys
!{sys.executable} -m pip install langchain langchain-community llama-cpp-python

from langchain.prompts import ChatPromptTemplate
from langchain_community.llms import LlamaCpp
from langchain_core.output_parsers import StrOutputParser



In [2]:
prompt = ChatPromptTemplate.from_template("""Your are an assistant, answer the question briefly.

    User: {input}
    AI Assistant:"""
)

llm = LlamaCpp(
    model_path="../models/mixtral-8x7b-instruct-v0.1.Q2_K.gguf",
    temperature=0.01,
    max_tokens=2000,
    top_p=1,
    n_threads=8,
    n_ctx=2048,
    f16_kv=True,
    # n_gpu_layers=40,  # Change this value based on your model and your GPU VRAM pool.
    # n_batch=512,  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
)

output_parser = StrOutputParser()

# Basic chain https://python.langchain.com/docs/expression_language/get_started
chain = prompt | llm | output_parser

llama_model_loader: loaded meta data with 26 key-value pairs and 995 tensors from ../models/mixtral-8x7b-instruct-v0.1.Q2_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mixtral-8x7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:      

In [3]:
# Query the chain and stream the response
for chunk in chain.stream({"input": "What is the capital of the Netherlands?"}):
    print(chunk, end="", flush=True)

# Query the LLM directly:
# for chunk in llm.stream("What is the capital of the Netherlands?"):
#     print(chunk, end="", flush=True)

# Not streaming:
# print(chain.invoke({"input": "What is the capital of the Netherlands?"}))

 The capital of the Netherlands is Amsterdam.


llama_print_timings:        load time =    7223.89 ms
llama_print_timings:      sample time =       8.27 ms /     9 runs   (    0.92 ms per token,  1087.74 tokens per second)
llama_print_timings: prompt eval time =   29350.15 ms /    31 tokens (  946.78 ms per token,     1.06 tokens per second)
llama_print_timings:        eval time =   24805.58 ms /     8 runs   ( 3100.70 ms per token,     0.32 tokens per second)
llama_print_timings:       total time =   54238.11 ms
