## ***Unleashing the Power of the Quantized Models from the Hugging Face Community: A Step-by-Step Guide*** 💻

This notebook empowers you to learn how to utilize the Quantized Llama-2-13B model. The code demonstrates how to install necessary libraries, download the Quantized model, configure the Llama LLM for GPU acceleration, and interact with the model to solve a sample problem.

### 🔲 ***Goal :***

*This notebook aims to :*

* Install required libraries for managing the Llama LLM and interacting with Hugging Face Hub.

* Download the Quantized Llama-2-13B model specifically fine-tuned for chat.

* Set up the Llama LLM for GPU processing to maximize performance.
Craft a prompt template incorporating user input and assistant role definition.

* Generate a response using the model and extract the answer.

### 🔲 ***Steps :***

✅ ***Install and Upgrade Libraries :***

* The code starts by installing (or upgrading) necessary libraries:

 * **llama-cpp-python**: Provides core functionalities to interact with the Llama LLM.

 * **numpy**: Used for numerical computations (potentially involved in the model's processing).

 * **huggingface_hub**: Enables downloading pre-trained models from the Hugging Face Hub.
* The --force-reinstall and --upgrade flags ensure you have the latest versions.

In [None]:
# GPU llama-cpp-python
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.78 numpy==1.23.4 --force-reinstall --upgrade --no-cache-dir --verbose
%pip install huggingface_hub
%pip install llama-cpp-python==0.1.78
%pip install numpy==1.23.4

✅ ***Define Model Details:***

The code specifies the model name `model_name` and the filename `model_basename` of the specific Quantized Llama-2-13B variant optimized for chat interactions. Get `model_name` and `model_basename` from the hugging-face's model card.

In [None]:
model_name = "TheBloke/Llama-2-13B-chat-GGML"
model_basename = "llama-2-13b-chat.ggmlv3.q5_1.bin"

✅ Download Quantinized Model:

* The `hf_hub_download` function from `huggingface_hub` is used to download the Quantized model from the Hugging Face Hub repository specified by `model_name`.

* The downloaded file is saved to model_path.

In [None]:
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
model_path = hf_hub_download(repo_id=model_name, filename=model_basename)

✅ ***Initialize Llama LLM (GPU Configuration):***:

* A Llama object (lcpp_llm) is created to represent the Llama LLM instance.

* Key arguments for initialization include:

 * `model_path`: Path to the downloaded model.

 * `n_threads`: Number of CPU threads to use (set to 2 in this case).

 * `n_batch`: Batch size for processing text sequences (consider GPU VRAM limitations).

 * `n_gpu_layers`: Number of GPU layers to leverage for computations. *This value should be adjusted based on your specific model and available GPU VRAM.*

In [None]:
N_THREADS = 2
N_BATCH = 512
N_GPU_LAYERS = 32

# GPU
lcpp_llm = None
lcpp_llm = Llama(
    model_path=model_path,
    n_threads=N_THREADS, # CPU cores
    n_batch=N_BATCH, # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
    n_gpu_layers=N_GPU_LAYERS # Change this value based on your model and your GPU VRAM pool.
    )

✅ ***Craft Prompt Template:***

* A variable `prompt_template` is defined as a formatted string. This template serves two purposes:

 * Defines roles: "SYSTEM" represents the helpful assistant, and "USER" represents the user's prompt.

 * Incorporates the actual user prompt (prompt) within the template.

In [None]:
prompt = "What the trends in AI/ML field ?"
prompt_template=f'''SYSTEM: You are a helpful, mathematical and reasonable assistant. Always answer as accurately.The answer should be given as a non-negative modulo 1000.

USER: {prompt}

ASSISTANT:
'''

✅ ***Generate Response and Extract Answer:***

* The `lcpp_llm` object is called with the `prompt_template` as input.

* Additional arguments for generation control:

 * `max_tokens`: Maximum number of tokens to generate (225 in this case).

 * `temperature`: Controls randomness in the generated text (0.5 for some variation).

 * `top_p`: Focuses generation on high-probability tokens (0.95 for more likely outputs).

 * `repeat_penalty`: Discourages repetitive tokens (1.2 for diversity).

 * `top_k`: Considers only the top k most likely tokens at each step (150 for a balance).

 * `echo`: Prints the prompt and generated response together for clarity.

* The `response` variable holds the generated text by the LLM.

* The code then extracts the user's question and the model's answer from the `response` using response["choices"][0]["text"].

In [None]:
params = {
    'prompt': prompt_template,
    'max_tokens': 225,
    'temperature': 0.5,
    'top_p': 0.95,
    'repeat_penalty': 1.2,
    'top_k': 150,
    'echo': True
}
response = lcpp_llm(**params)

In [None]:
print(response)

In [None]:
print(response["choices"][0]["text"])

### ***THANKS*** and good luck !!
