# Llama CPP Python: Run LLMs on Local Machine

## Table of Contents

1. **Generate Text**
2. **Stream Response**
3. **Pulling Model from HuggingFace**
4. **Chat Completion**
5. **Generate Embeddings**

## Installation


* **pip install llama-cpp-python**

In [2]:
pip install llama-cpp-python

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.3.7.tar.gz (66.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.7/66.7 MB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.3.7-cp311-cp311-linux_x86_64.whl size=4552831 sha256=c372501a4a2eed047552

In [3]:
import llama_cpp

llama_cpp.__version__

'0.3.7'

## 1. Generate Text

* **TheBloke/Llama-2-7B-Chat-GGUF**

In [5]:
from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="TheBloke/Llama-2-7B-Chat-GGUF",
	filename="llama-2-7b-chat.Q2_K.gguf",
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


llama-2-7b-chat.Q2_K.gguf:   0%|          | 0.00/2.83G [00:00<?, ?B/s]

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /root/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-Chat-GGUF/snapshots/191239b3e26b2882fb562ffccdd1cf0f65402adb/./llama-2-7b-chat.Q2_K.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_co

In [6]:
resp = llm("Q: Write a short paragraph introducing Elon Musk. A: ",
           max_tokens=256,
           stop=["Q:", "\n"])

llama_perf_context_print:        load time =    5861.05 ms
llama_perf_context_print: prompt eval time =    5860.81 ms /    17 tokens (  344.75 ms per token,     2.90 tokens per second)
llama_perf_context_print:        eval time =   86805.93 ms /   165 runs   (  526.10 ms per token,     1.90 tokens per second)
llama_perf_context_print:       total time =   92761.04 ms /   182 tokens


In [7]:
resp

{'id': 'cmpl-3d3be2d2-ad94-4844-b13d-78c3769d6ac5',
 'object': 'text_completion',
 'created': 1738937069,
 'model': '/root/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-Chat-GGUF/snapshots/191239b3e26b2882fb562ffccdd1cf0f65402adb/./llama-2-7b-chat.Q2_K.gguf',
 'choices': [{'text': " Elon Musk is a South African-born entrepreneur, inventor, and business magnate who has made a significant impact in various industries, including transportation, energy, and space exploration. He is best known for his innovative ideas and technological advancements in the electric car industry through his company Tesla, where he has revolutionized the way people think about and interact with electric vehicles. He is also the CEO of SpaceX, a private aerospace company that has developed cutting-edge technology to make human spaceflight more accessible and affordable. Musk's vision and leadership have transformed the way we think about transportation, energy, and space travel, and he continues to push t

In [8]:
resp["choices"][0]["text"]

" Elon Musk is a South African-born entrepreneur, inventor, and business magnate who has made a significant impact in various industries, including transportation, energy, and space exploration. He is best known for his innovative ideas and technological advancements in the electric car industry through his company Tesla, where he has revolutionized the way people think about and interact with electric vehicles. He is also the CEO of SpaceX, a private aerospace company that has developed cutting-edge technology to make human spaceflight more accessible and affordable. Musk's vision and leadership have transformed the way we think about transportation, energy, and space travel, and he continues to push the boundaries of innovation and technological advancement in his various ventures."

## 2. Streaming Response

In [9]:
resp = llm("Q: Write a short paragraph introducing Elon Musk. A: ",
           max_tokens=256,
           stop=["Q:", "\n"],
           stream=True)

resp

<generator object Llama._create_completion at 0x3b63680>

In [10]:
for r in resp:
    print(r["choices"][0]["text"], end="")

Llama.generate: 16 prefix-match hit, remaining 1 prompt tokens to eval


 Elon Musk is a South African-born entrepreneur, inventor, and business magnate who has made a significant impact in the fields of technology and engineering. He is best known for his involvement in the development of electric cars, solar energy, and space exploration. As the CEO of Tesla and SpaceX, Musk has revolutionized the automotive and aerospace industries through innovative products and cutting-edge technology. He is also known for his vision of a sustainable future, where renewable energy and advanced transportation systems play a critical role in reducing humanity's reliance on fossil fuels. With a reputation for being a visionary leader and a driving force behind some of the most exciting and innovative companies in the world, Elon Musk is undoubtedly one of the most influential figures in modern technology.

llama_perf_context_print:        load time =    5861.05 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   98303.90 ms /   185 runs   (  531.37 ms per token,     1.88 tokens per second)
llama_perf_context_print:       total time =   98497.89 ms /   186 tokens


## 3. Pulling Model from HuggingFace

* **pip install huggingface-hub**

By default from_pretrained will download the model to the **huggingface cache directory**, you can then manage installed model files with the **huggingface-cli tool**.

* **TheBloke/Mistral-7B-v0.1-GGUF**
* **TheBloke/Mistral-7B-Instruct-v0.2-GGUF**

In [11]:
from llama_cpp import Llama

mistral = Llama.from_pretrained(repo_id="TheBloke/Mistral-7B-v0.1-GGUF",
                                filename="*Q4_K_S.gguf",
                                verbose=False
                               )

resp = mistral("Q: Write a short paragraph introducing Elon Musk. A: ",
           max_tokens=256,
           stop=["Q:", "\n"])

mistral-7b-v0.1.Q4_K_S.gguf:   0%|          | 0.00/4.14G [00:00<?, ?B/s]

llama_init_from_model: n_ctx_per_seq (512) < n_ctx_train (32768) -- the full capacity of the model will not be utilized


In [12]:
resp

{'id': 'cmpl-7b7a2f7b-0b44-401f-b4c6-1f9cb5273fe9',
 'object': 'text_completion',
 'created': 1738937405,
 'model': '/root/.cache/huggingface/hub/models--TheBloke--Mistral-7B-v0.1-GGUF/snapshots/d4ae605152c8de0d6570cf624c083fa57dd0d551/./mistral-7b-v0.1.Q4_K_S.gguf',
 'choices': [{'text': '2018, 2018, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk, Elon Musk,',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'length'}],
 'usage': {'prompt_tokens': 16, 'completion_tokens': 256, 'total_tokens': 272}}

## 4. Chat Completion

In [14]:
from llama_cpp import Llama

# llama2_chat = Llama(model_path="./llama-2-7b-chat.Q4_K_S.gguf", verbose=False)

llama2_chat = Llama.from_pretrained(
	repo_id="TheBloke/Llama-2-7B-Chat-GGUF",
	filename="llama-2-7b-chat.Q2_K.gguf",
  verbose = False,
)

resp = llama2_chat.create_chat_completion(
      messages = [
          {"role": "system", "content": "You are an assistant who perfectly describes individuals."},
          {
              "role": "user",
              "content": "Write a short paragraph introducing Elon Musk."
          }
      ]
)

resp

llama_init_from_model: n_ctx_per_seq (512) < n_ctx_train (4096) -- the full capacity of the model will not be utilized


{'id': 'chatcmpl-f118b813-8e44-422d-91b7-d803aa340da1',
 'object': 'chat.completion',
 'created': 1738937696,
 'model': '/root/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-Chat-GGUF/snapshots/191239b3e26b2882fb562ffccdd1cf0f65402adb/./llama-2-7b-chat.Q2_K.gguf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': "  Ah, Elon Musk! The visionary entrepreneur and innovator who has left an indelible mark on the tech and automotive industries. With a keen eye for detail and a boundless enthusiasm for disrupting traditional models, Elon has a unique ability to identify and capitalize on emerging trends. His unwavering commitment to sustainability and technological advancement has inspired countless others to join him in his quest to redefine the future. From revolutionizing the electric car industry with Tesla to making space travel accessible with SpaceX, Elon's unparalleled vision and leadership have cemented his status as a true pioneer in his field. Meet 

In [15]:
resp["choices"][0]["message"]

{'role': 'assistant',
 'content': "  Ah, Elon Musk! The visionary entrepreneur and innovator who has left an indelible mark on the tech and automotive industries. With a keen eye for detail and a boundless enthusiasm for disrupting traditional models, Elon has a unique ability to identify and capitalize on emerging trends. His unwavering commitment to sustainability and technological advancement has inspired countless others to join him in his quest to redefine the future. From revolutionizing the electric car industry with Tesla to making space travel accessible with SpaceX, Elon's unparalleled vision and leadership have cemented his status as a true pioneer in his field. Meet Elon Musk – a true original, and a force to be reckoned with!"}

In [16]:
new_resp = llama2_chat.create_chat_completion(
      messages = [
          {"role": "system", "content": "You are an assistant who perfectly describes individuals."},
          {
              "role": "user",
              "content": "Write a short paragraph introducing Elon Musk."
          },
          resp["choices"][0]["message"],
          {
              "role": "user",
              "content": "Can you please rephrase your previous response?"
          }
      ]
)

new_resp

{'id': 'chatcmpl-c8d9d8d4-fca5-41fc-8eab-5ea20530b3dc',
 'object': 'chat.completion',
 'created': 1738937832,
 'model': '/root/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-Chat-GGUF/snapshots/191239b3e26b2882fb562ffccdd1cf0f65402adb/./llama-2-7b-chat.Q2_K.gguf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': "  Of course! Here is a rephrased version of my previous response:\nElon Musk is a visionary entrepreneur and innovator who has had a profound impact on the tech and automotive industries. With a keen eye for detail and an unwavering commitment to disrupting traditional models, he has consistently identified and capitalized on emerging trends. His unwavering commitment to sustainability and technological advancement has inspired countless others to join him in his quest to redefine the future. From revolutionizing the electric car industry with Tesla to making space travel accessible with SpaceX, Elon's unparalleled vision and leadership have so

In [17]:
new_resp["choices"][0]["message"]

{'role': 'assistant',
 'content': "  Of course! Here is a rephrased version of my previous response:\nElon Musk is a visionary entrepreneur and innovator who has had a profound impact on the tech and automotive industries. With a keen eye for detail and an unwavering commitment to disrupting traditional models, he has consistently identified and capitalized on emerging trends. His unwavering commitment to sustainability and technological advancement has inspired countless others to join him in his quest to redefine the future. From revolutionizing the electric car industry with Tesla to making space travel accessible with SpaceX, Elon's unparalleled vision and leadership have solidified his status as a true pioneer in his field. Meet Elon Musk – a true original and a force to be reckoned with!"}

## 5. Generate Embeddings

In [19]:
from llama_cpp import Llama

llama2_chat = Llama.from_pretrained(
	repo_id="TheBloke/Llama-2-7B-Chat-GGUF",
	filename="llama-2-7b-chat.Q2_K.gguf",
  verbose = False,
  embedding=True,
)

embeddings = llama2_chat.create_embedding("Hello, world!")

embeddings

llama_init_from_model: n_ctx_per_seq (512) < n_ctx_train (4096) -- the full capacity of the model will not be utilized


{'object': 'list',
 'data': [{'object': 'embedding',
   'embedding': [[0.13589385151863098,
     0.07271996140480042,
     0.23335610330104828,
     0.451901912689209,
     -0.245610773563385,
     -0.0956047773361206,
     0.5299339294433594,
     0.15107591450214386,
     0.11003810912370682,
     0.07717527449131012,
     0.03652452677488327,
     0.1150326132774353,
     -0.02587956190109253,
     0.150251105427742,
     -0.2494143396615982,
     0.300008624792099,
     -0.14846795797348022,
     0.06267750263214111,
     -0.0946870744228363,
     -0.14687997102737427,
     -0.05173931643366814,
     0.17650100588798523,
     0.15701404213905334,
     0.30206722021102905,
     0.1083223968744278,
     0.0253159049898386,
     0.0018567458027973771,
     0.5550348162651062,
     -0.11904758214950562,
     -0.21553106606006622,
     -0.008001069538295269,
     0.31545400619506836,
     -0.15748965740203857,
     -0.14883887767791748,
     -0.13407330214977264,
     0.1013902872800827

In [20]:
len(embeddings["data"][0]["embedding"])

5

In [21]:
embeddings = llama2_chat.create_embedding(["Hello, world!", "Goodbye, world!"])

len(embeddings["data"]), len(embeddings["data"][0]["embedding"]), len(embeddings["data"][1]["embedding"])

(2, 5, 6)

## Summary

In this video, I explained how you can access **open source LLMs** on **Local Machine** using Python library **llama-cpp-python**. Its a wrapper around **llama.cpp** library.