This line uses pip to install the llama-cpp-python package. The CMAKE_ARGS and FORCE_CMAKE options are used to build the package from source code rather than installing a pre-built binary.

-DLLAMA_CUBLAS=on enables GPU acceleration using the cuBLAS library. This allows the models to leverage NVIDIA GPUs for faster inference.

--force-reinstall forces a fresh install even if the package is already installed.

--upgrade gets the latest version.

--no-cache-dir avoids using cached package files.

--verbose prints more details about the installation process.

In [None]:
# GPU llama-cpp-python
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose
# For download the models


This installs the huggingface_hub package using pip. This package provides access to Hugging Face's model hub for easy downloading of pretrained models like Llama 2.

In [None]:
!pip install huggingface_hub

Here we specify the identifier for the pretrained Llama 2 model we want to download from the model hub.

model_name_or_path provides the username and model name in the format username/model_name.

model_basename is the actual file name of the model file we want to download.

In [2]:
model_name_or_path = "TheBloke/Llama-2-13B-chat-GGML"
model_basename = "llama-2-13b-chat.ggmlv3.q5_1.bin" # the model is in bin format

This imports the hf_hub_download function from the huggingface_hub module. This function will handle downloading the model file from the hub.

In [None]:
from huggingface_hub import hf_hub_download

Import the Llama class from the llama_cpp package. This provides the API for loading and running Llama models.

In [4]:
from llama_cpp import Llama

Call hf_hub_download to download the model file specified by model_name_or_path and model_basename from the hub.

The path to the downloaded model file is returned and stored in model_path for later use.

In [None]:
model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)

This creates an instance of the Llama class and loads the model.

model_path points to the downloaded model file.

n_threads controls how many CPU threads to use.

n_batch sets the batch size for GPU processing. Should be tuned based on GPU memory.

n_gpu_layers determines how much of the model to run on the GPU vs CPU. More layers on GPU means faster inference.

In [None]:
# GPU
lcpp_llm = None
lcpp_llm = Llama(
    model_path=model_path,
    n_threads=2, # CPU cores
    n_batch=512, # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
    n_gpu_layers=32 # Change this value based on your model and your GPU VRAM pool.
    )

See the number of layers in GPU

In [None]:

lcpp_llm.params.n_gpu_layers

Here we define a prompt with the user's query. Then construct a prompt_template which provides instructions for the model and places the user prompt within the template.

In [8]:
prompt = "write a SQL query to fetch data from a patient table who is been in hospital for more than a month"
prompt_template=f'''SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: {prompt}

ASSISTANT:
'''

Pass the prompt_template to the model's lcpp_llm method to generate a response.

We set parameters like max_tokens, temperature, top_p etc. to control the response generation.

The model's generated text is returned and stored in response.

In [None]:
response=lcpp_llm(prompt=prompt_template, max_tokens=256, temperature=0.5, top_p=0.95,
                  repeat_penalty=1.2, top_k=150,
                  echo=True)

Print out the full response object, and then specifically just the generated text.

In [None]:
print(response)

This shows how to access the output text from the model's response object.

In [None]:
print(response["choices"][0]["text"])