<a href="https://colab.research.google.com/github/CognitiveByte/learn_langchain/blob/main/LLaMA_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Llama 2

The LLaMA 2 is a collection of pretrained and fine-tuned generative text models, ranging from 7 billion to 70 billion parameters, designed for dialogue use cases.

It outperforms open-source chat models on most benchmarks and is on par with popular closed-source models in human evaluations for helpfulness and safety.

[Llama 2 13B-chat](https://huggingface.co/meta-llama/Llama-2-13b-chat)

`llama.cpp`'s objective is to run the LLaMA model with 4-bit integer quantization on MacBook. It is a plain C/C++ implementation optimized for Apple silicon and x86 architectures, supporting various integer quantization and BLAS libraries. Originally a web chat example, it now serves as a development playground for ggml library features.

`GGML`, a C library for machine learning, facilitates the distribution of large language models (LLMs). It utilizes quantization to enable efficient LLM execution on consumer hardware. GGML files contain binary-encoded data, including version number, hyperparameters, vocabulary, and weights. The vocabulary comprises tokens for language generation, while the weights determine the LLM's size. Quantization reduces precision to optimize resource usage.

##  Quantized Models from the Hugging Face Community

(Note: In this notebook, `llama.cpp` refers to a C/C++ implementation related to the LLaMA model.)

The Hugging Face community provides quantized models, which allow us to efficiently and effectively utilize the model on various hardware platforms. It is important to consult reliable sources before using any model.

## Loading the Model

### Step 1: Install All the Required Packages

In [None]:
# GPU llama-cpp-python
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose
# For download the models
%pip install huggingface_hub

In [2]:
model_name_or_path = "TheBloke/Llama-2-13B-chat-GGML"
model_basename = "llama-2-13b-chat.ggmlv3.q5_1.bin" # the model is in bin format

### Step 2: Import All the Required Libraries

In [3]:
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

### Step 3: Download the model

In [4]:
model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)

Downloading (…)chat.ggmlv3.q5_1.bin:   0%|          | 0.00/9.76G [00:00<?, ?B/s]

### Step 4: Loading the model

In [5]:
# Load the LLaMA model with specified parameters
lcpp_llm = Llama(
    model_path=model_path,
    n_threads=2,       # CPU cores
    n_batch=512,       # Should be between 1 and n_ctx, considering the amount of VRAM in your GPU
    n_gpu_layers=32    # Change this value based on your model and your GPU VRAM pool
)

AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 


In [6]:
# See the number of layers in GPU
lcpp_llm.params.n_gpu_layers

32

## Create a Prompt Template

In [7]:
prompt = "Write a linear regression in python"
prompt_template=f'''SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: {prompt}

ASSISTANT:
'''

## Generate a Response with the Model

### Step 5: Generating the Response

In [8]:
# Generate a response using the loaded model and the prompt template
response = lcpp_llm(
    prompt=prompt_template,
    max_tokens=256,
    temperature=0.5,
    top_p=0.95,
    repeat_penalty=1.2,
    top_k=150,
    echo=True
)

In [9]:
print(response)

{'id': 'cmpl-397994ae-f112-488d-80bc-aa30594ce797', 'object': 'text_completion', 'created': 1692126145, 'model': '/root/.cache/huggingface/hub/models--TheBloke--Llama-2-13B-chat-GGML/snapshots/47d28ef5de4f3de523c421f325a2e4e039035bab/llama-2-13b-chat.ggmlv3.q5_1.bin', 'choices': [{'text': 'SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.\n\nUSER: Write a linear regression in python\n\nASSISTANT:\n\nTo write a linear regression in Python, you can use scikit-learn library. Here is an example of how to do it:\n```\nfrom sklearn.linear_model import LinearRegression\nimport numpy as np\n\n# Generate some sample data\nX = np.random.rand(100, 5)\ny = np.random.randint(0, 2, size=100)\n\n# Create a Linear Regression object and fit the data\nreg = LinearRegression()\nreg.fit(X, y)\n\n# Print the coefficients\nprint(reg.coef_)\n```\nThis will output the coefficients of the linear regression model. You can also use the `predict()` method to make predictions 

In [10]:
print(response["choices"][0]["text"])

SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: Write a linear regression in python

ASSISTANT:

To write a linear regression in Python, you can use scikit-learn library. Here is an example of how to do it:
```
from sklearn.linear_model import LinearRegression
import numpy as np

# Generate some sample data
X = np.random.rand(100, 5)
y = np.random.randint(0, 2, size=100)

# Create a Linear Regression object and fit the data
reg = LinearRegression()
reg.fit(X, y)

# Print the coefficients
print(reg.coef_)
```
This will output the coefficients of the linear regression model. You can also use the `predict()` method to make predictions on new data:
```
# Generate some new data to predict
new_x = np.random.rand(5, 5)

# Make predictions using the trained model
preds = reg.predict(new_x)
print(preds)
```
This will output the predicted values for the new data.

Note: This is just a simple example to illustrate how to use linear regression in Pyth