**Llama 2**

The Llama 2 is a collection of pretrained and fine-tuned generative text models, ranging from 7 billion to 70 billion parameters, designed for dialogue use cases.

 It outperforms open-source chat models on most benchmarks and is on par with popular closed-source models in human evaluations for helpfulness and safety.

[Llama 2 13B-chat](https://huggingface.co/meta-llama/Llama-2-13b-chat)

`llama.cpp`'s objective is to run the LLaMA model with 4-bit integer quantization on MacBook. It is a plain C/C++ implementation optimized for Apple silicon and x86 architectures, supporting various integer quantization and BLAS libraries. Originally a web chat example, it now serves as a development playground for ggml library features.

`GGML`, a C library for machine learning, facilitates the distribution of large language models (LLMs). It utilizes quantization to enable efficient LLM execution on consumer hardware. GGML files contain binary-encoded data, including version number, hyperparameters, vocabulary, and weights. The vocabulary comprises tokens for language generation, while the weights determine the LLM's size. Quantization reduces precision to optimize resource usage.

#  Quantized Models from the Hugging Face Community

The Hugging Face community provides quantized models, which allow us to efficiently and effectively utilize the model on the T4 GPU. It is important to consult reliable sources before using any model.

There are several variations available, but the ones that interest us are based on the GGLM library.

We can see the different variations that Llama-2-13B-GGML has [here](https://huggingface.co/models?search=llama%202%20ggml).



In this case, we will use the model called [Llama-2-13B-chat-GGML](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML).

#**Step 1: Install All the Required Packages**

In [None]:
# GPU llama-cpp-python
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.78 numpy==1.23.4 --force-reinstall --upgrade --no-cache-dir --verbose
!pip install huggingface_hub
!pip install llama-cpp-python==0.1.78
!pip install numpy==1.23.4

In [1]:
model_name_or_path = "TheBloke/Llama-2-13B-chat-GGML"
model_basename = "llama-2-13b-chat.ggmlv3.q5_1.bin" # the model is in bin format

#**Step 2: Import All the Required Libraries**

In [2]:
import os

In [3]:
from huggingface_hub import hf_hub_download

In [4]:
from llama_cpp import Llama

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6


#**Step 3: Download the Model**

In [5]:
model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)

#**Step 4: Loading the Model**

In [6]:
# GPU
lcpp_llm = None
lcpp_llm = Llama(
    model_path=model_path,
    n_threads=os.cpu_count(), # CPU cores
    n_batch=512, # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
    n_gpu_layers=-1 # Change this value based on your model and your GPU VRAM pool.
    )

llama.cpp: loading model from /home/xodiec/.cache/huggingface/hub/models--TheBloke--Llama-2-13B-chat-GGML/snapshots/3140827b4dfcb6b562cd87ee3d7f07109b014dd0/llama-2-13b-chat.ggmlv3.q5_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_head_kv  = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.11 MB
llama_model_load_inte

In [7]:
# See the number of layers in GPU
lcpp_llm.params.n_gpu_layers

2147483647

#**Step 5: Create a Prompt Template**

In [8]:
prompt = "Tell me about the best films of 2021"
prompt_template=f'''SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: {prompt}

ASSISTANT:
'''

#**Step 6: Generating the Response**

In [None]:
'''
Parametri LLM:
prompt: È il testo di input che fornisci al modello per generare una risposta. In questo caso, prompt_template sarà il modello che fornirà il
contesto per la generazione della risposta.
max_tokens: Indica il massimo numero di token (parole o caratteri) che il modello può generare come risposta. Qui è impostato su 256 token,
il che significa che la risposta avrà al massimo 256 parole o caratteri.
temperature: È un parametro che controlla la casualità delle predizioni del modello durante la generazione del testo.
Valori più bassi rendono il testo più deterministico, mentre valori più alti aumentano la casualità. Qui è impostato su 0.5, quindi il testo
generato sarà moderatamente casuale.
top_p: È un'altra tecnica di campionamento che controlla la distribuzione di probabilità delle predizioni del modello.
Impostando un valore di top_p alto come 0.95, si assicura che il modello tenga conto delle probabilità cumulative delle parole
fino a raggiungere il 95%, limitando le predizioni a un insieme di parole con alta probabilità.
repeat_penalty: È un parametro che penalizza il modello quando genera ripetizioni consecutive di parole o frasi. Un valore di 1.2
indica una penalità moderata per le ripetizioni.
top_k: È un'altra tecnica di campionamento che limita le scelte del modello ai primi k token con la più alta probabilità nella distribuzione
di probabilità delle parole. Qui è impostato su 150, il che significa che il modello considererà solo le 150 parole più probabili durante
la generazione del testo.
'''

In [9]:
response=lcpp_llm(prompt=prompt_template, max_tokens=256, temperature=0.5, top_p=0.95,
                  repeat_penalty=1.2, top_k=150,
                  echo=True)


llama_print_timings:        load time =   808.75 ms
llama_print_timings:      sample time =    93.89 ms /   256 runs   (    0.37 ms per token,  2726.62 tokens per second)
llama_print_timings: prompt eval time =   808.68 ms /    45 tokens (   17.97 ms per token,    55.65 tokens per second)
llama_print_timings:        eval time = 14269.65 ms /   255 runs   (   55.96 ms per token,    17.87 tokens per second)
llama_print_timings:       total time = 15612.41 ms


In [10]:
print(response)

{'id': 'cmpl-0d1010f0-f19d-45e6-9ad3-32c5701f7bb0', 'object': 'text_completion', 'created': 1713270779, 'model': '/home/xodiec/.cache/huggingface/hub/models--TheBloke--Llama-2-13B-chat-GGML/snapshots/3140827b4dfcb6b562cd87ee3d7f07109b014dd0/llama-2-13b-chat.ggmlv3.q5_1.bin', 'choices': [{'text': 'SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.\n\nUSER: Tell me about the best films of 2021\n\nASSISTANT:\n\nThere were many great films released in 2021! Here are some that received high praise from critics and audiences alike:\n\n1. "The Power of the Dog" - A psychological drama directed by Jane Campion, starring Benedict Cumberbatch and Kirsten Dunst. It tells the story of a wealthy family in 1920s Montana and their struggles with love, greed, and power.\n2. "The Matrix Resurrections" - A science fiction action film directed by Lana Wachowski, starring Keanu Reeves, Carrie-Anne Moss, and Yahya Abdul-Mateen II. It is the fourth installment in The Mat

In [11]:
print(response["choices"][0]["text"])

SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: Tell me about the best films of 2021

ASSISTANT:

There were many great films released in 2021! Here are some that received high praise from critics and audiences alike:

1. "The Power of the Dog" - A psychological drama directed by Jane Campion, starring Benedict Cumberbatch and Kirsten Dunst. It tells the story of a wealthy family in 1920s Montana and their struggles with love, greed, and power.
2. "The Matrix Resurrections" - A science fiction action film directed by Lana Wachowski, starring Keanu Reeves, Carrie-Anne Moss, and Yahya Abdul-Mateen II. It is the fourth installment in The Matrix series and follows Neo as he navigates a new reality.
3. "The French Dispatch" - A comedy-drama film directed by Wes Anderson, set in a fictional French city. It features an all-star cast, including Timothée Chalamet, Benicio del Toro, Adrien Brody, and Tilda Swinton. The story revolves around the staf