<a href="https://colab.research.google.com/github/AarifCha/RAG-HF-Langchain/blob/main/GGUF_Model_with_Langchain_LlamaCpp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installing the Dependencies

In [6]:
# This might take a few minutes!
!pip install -q langchain langchain_community
!pip install -q llama-cpp-python

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 MB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone


# Downloading the model to Google Drive

In [3]:
# First we mount our Google Drive
from google.colab import drive
drive.mount("/content/drive")

# The followling line downloads the model's gguf file from huggingface. You can also download it manually and upload it to your Google Drive.
!huggingface-cli download bartowski/UNA-ThePitbull-21.4B-v2-GGUF --include "UNA-ThePitbull-21.4B-v2-Q6_K.gguf" --local-dir /content/drive/MyDrive/Colab_Notebooks/NLP_Projects/

Mounted at /content/drive


# Loading the Model

We will load a Llama model and port it to C++ for faster inference.

In [10]:
# IGNORE THIS: Doesn't seem to work well with colab (to potentially use the free GPU they provide)
# !set FORCE_CMAKE=1
# !set CMAKE_ARGS=-DLLAMA_CUBLAS=ON
# !CMAKE_ARGS="-DLLAMA_CUBLAS=ON" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

In [4]:
READER_MODEL_PATH = "/content/drive/MyDrive/Colab_Notebooks/NLP_Projects/UNA-ThePitbull-21.4B-v2-Q6_K.gguf"

In [7]:
from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler

# We can simply load the model using langchain's LlamaCpp. LlamaCpp esentially
# ports the models to C++ so the inference can be made faster.

callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

# This will take almost 5 minutes!
llm = LlamaCpp(
    model_path=READER_MODEL_PATH,
    temperature=0.75,
    max_tokens=2000,
    n_batch=512,
    top_p=1,
    callback_manager=callback_manager,
    verbose=True,  # Verbose is required to pass to the callback manager
)

llama_model_loader: loaded meta data with 30 key-value pairs and 471 tensors from /content/drive/MyDrive/Colab_Notebooks/NLP_Projects/UNA-ThePitbull-21.4B-v2-Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = UNA-ThePitbull-21.4B-v2
llama_model_loader: - kv   2:                          llama.block_count u32              = 52
llama_model_loader: - kv   3:                       llama.context_length u32              = 32768
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 6144
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 16384
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 48
llama_mode

# Running the Model

Now we just create a prompt and test our model! Since this is run on CPU, expect it to take a long time for you to get a result.

In [8]:
prompt = "<|im_start|>system\n{system_prompt}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant"
final_propmpt = prompt.format(system_prompt="Answer the question asked and only what is asked.", prompt="What is 2+2?")
llm.invoke(final_propmpt)
# Because of the higher temperature, expect to get different outputs.


The result of 2 + 2 is 4.


llama_print_timings:        load time =  164302.08 ms
llama_print_timings:      sample time =      25.56 ms /    14 runs   (    1.83 ms per token,   547.77 tokens per second)
llama_print_timings: prompt eval time =  164301.94 ms /    53 tokens ( 3100.04 ms per token,     0.32 tokens per second)
llama_print_timings:        eval time =  984532.46 ms /    13 runs   (75733.27 ms per token,     0.01 tokens per second)
llama_print_timings:       total time = 1149037.47 ms /    66 tokens


'\nThe result of 2 + 2 is 4.'