# This Notebook Shows How to load a LLM and run locally.

## Common Knowledge

- Large Language Models (LLMs) are foundational machine learning models that use deep learning algorithms to process and understand natural language.
- Llama 2 is Meta's open source large language model (LLM). It's basically the Facebook parent company's response to OpenAI's GPT models and Google's AI models like PaLM 2—but with one key difference: it's freely available for almost anyone to use for research and commercial purposes.

- The Llama 2 is a collection of pretrained and fine-tuned generative text models, ranging from 7 billion to 70 billion parameters, designed for dialogue use cases.

- It outperforms open-source chat models on most benchmarks and is on par with popular closed-source models in human evaluations for helpfulness and safety.

- LangChain is an open source framework that lets software developers working with artificial intelligence (AI) and its machine learning subset combine large language models with other external components to develop LLM-powered applications.


- llama.cpp's

    Its objective is to run the LLaMA model with 4-bit integer quantization on MacBook. It is a plain C/C++ implementation optimized for Apple silicon and x86 architectures, supporting various integer quantization and BLAS libraries. Originally a web chat example, it now serves as a development playground for ggml library features.

- GGML

    A C library for machine learning, facilitates the distribution of large language models (LLMs). It utilizes quantization to enable efficient LLM execution on consumer hardware. GGML files contain binary-encoded data, including version number, hyperparameters, vocabulary, and weights. The vocabulary comprises tokens for language generation, while the weights determine the LLM's size. Quantization reduces precision to optimize resource usage.

## Quantized Models from the Hugging Face Community
- The Hugging Face community provides quantized models, which allow us to efficiently and effectively utilize the model on the T4 GPU. It is important to consult reliable sources before using any model.

- There are several variations available, but the ones that interest us are based on the GGLM library.

- We can see the different variations that [Llama-2-13B-GGML](https://huggingface.co/models?search=llama%202%20ggml) has here.

- In this case, we will use the model called [Llama-2-13B-chat-GGML](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML).

## Other models you can use
- If you are having problems with loading 13b parameter model you can still choose a model with 7b params.
- Some Popular 7b models are :-
  - [Llama-2-7B-Chat-GGM](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML)

# Note -
- I am using google colab with T4 GPU
- Using 13b params will give much better response, but will take more time in execution.
- 70b param models are the most refined ones


### Install libraries

## if you have problem executing this below code, You can use these lines of codes, then again try to install
```console
  import locale
  locale.getpreferredencoding = lambda: "UTF-8"

```

In [6]:
! CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.78 numpy==1.23.4 --force-reinstall --upgrade --no-cache-dir --verbose
! pip install huggingface_hub

Collecting huggingface_hub
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: huggingface_hub
Successfully installed huggingface_hub-0.17.3


### Helper methods
#### Set ROOT_PATH variable to save your model

In [3]:
import os
import json
import pandas as pd
ROOT_PATH = "/content/"
csv_folder = f"{ROOT_PATH}data/"
MODELS_PATH = f"{ROOT_PATH}models/"
if os.path.exists(MODELS_PATH) == False:
    os.mkdir(MODELS_PATH)
if os.path.exists(csv_folder) == False:
    os.mkdir(csv_folder)
print('models available : ', os.listdir(MODELS_PATH))

def make_folder_for_model_name(model_name):
    model_path = f"{MODELS_PATH}/{model_name}/"
    if os.path.isdir(model_path) is False:
       print('creating a new folder for this model')
       os.mkdir(model_path)
    else:
       print('folder is already present for model')
    return model_path


def split_text(text, max_length=512):
    chunks = []
    for i in range(0, len(text), max_length):
        chunks.append(text[i:i+max_length])
    return chunks

models available :  []


## Load Model

In [5]:
model_name_or_path = "TheBloke/Llama-2-13B-chat-GGML"
model_basename = "llama-2-13b-chat.ggmlv3.q5_1.bin" # the model is in bin format
model_name = "Llama-2-13B-chat-GGML"

# model_name_or_path = "TheBloke/Llama-2-7B-Chat-GGML"
# model_basename = 'llama-2-7b-chat.ggmlv3.q4_1.bin'
# model_name = "Llama-2-7B-chat-GGML"

model_folder_in_drive = make_folder_for_model_name(model_name)
model_bin_path = f"{model_folder_in_drive}{model_basename}"

creating a new folder for this model


### Load Model with hf_hub_download api (HUGGINGFACE HUB)

In [7]:
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

# In case we need to download model from HF. Model size is ~9.7GB
model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)

# save this model to path
command_to_transfer_model = f"""{model_path} '{model_folder_in_drive}'"""
! cp $command_to_transfer_model

Downloading (…)chat.ggmlv3.q5_1.bin:   0%|          | 0.00/9.76G [00:00<?, ?B/s]

In [8]:
print(f'{model_name_or_path} model is saved to -- {model_folder_in_drive} in your file system / google drive')

TheBloke/Llama-2-13B-chat-GGML model is saved to -- /content/models//Llama-2-13B-chat-GGML/ in your file system / google drive


### If you saved the bin file one time, you can update 'model_path' to local bin file directly

In [9]:
# GPU
lcpp_llm = None
lcpp_llm = Llama(
    model_path=model_path,
    n_threads=2, # CPU cores
    n_batch=512, # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
    n_gpu_layers=32 # Change this value based on your model and your GPU VRAM pool.
    )

AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 


## Model is loaded for action

### prompt - 1

In [12]:
prompt = "Who was 11th indian prime minister"
prompt_template=f'''SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: {prompt}

ASSISTANT:
'''
response = lcpp_llm(prompt=prompt_template,
                    max_tokens=256,
                    temperature=0.5,
                    top_p=0.95,
                    repeat_penalty=1.2,
                    top_k=150,
                    echo=True)

response

{'id': 'cmpl-d1c80041-a9c2-4058-a502-8c6af164a8b1',
 'object': 'text_completion',
 'created': 1696926448,
 'model': '/root/.cache/huggingface/hub/models--TheBloke--Llama-2-13B-chat-GGML/snapshots/3140827b4dfcb6b562cd87ee3d7f07109b014dd0/llama-2-13b-chat.ggmlv3.q5_1.bin',
 'choices': [{'text': 'SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.\n\nUSER: Who was 11th indian prime minister\n\nASSISTANT:\nThe 11th Prime Minister of India was Manmohan Singh. He served from 2004 to 2014, leading the country through a period of rapid economic growth and social reform. Prior to his tenure as Prime Minister, Dr. Singh held various positions in government and academia, including serving as Finance Minister under Prime Minister Rajiv Gandhi.',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 43, 'completion_tokens': 80, 'total_tokens': 123}}

In [15]:
print(response["choices"][0]["text"])

SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: Who was 11th indian prime minister

ASSISTANT:
The 11th Prime Minister of India was Manmohan Singh. He served from 2004 to 2014, leading the country through a period of rapid economic growth and social reform. Prior to his tenure as Prime Minister, Dr. Singh held various positions in government and academia, including serving as Finance Minister under Prime Minister Rajiv Gandhi.


### prompt - 2

In [16]:
prompt = "write an sql query to get all the unique brands from table 'SALES' , Column name is 'BRAND'"
prompt_template=f'''SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: {prompt}

ASSISTANT:
'''
response = lcpp_llm(prompt=prompt_template,
                    max_tokens=256,
                    temperature=0.5,
                    top_p=0.95,
                    repeat_penalty=1.2,
                    top_k=150,
                    echo=True)

Llama.generate: prefix-match hit


In [17]:
response

{'id': 'cmpl-6222887f-c1dd-4fae-8d9b-faef5d06cf3c',
 'object': 'text_completion',
 'created': 1696926647,
 'model': '/root/.cache/huggingface/hub/models--TheBloke--Llama-2-13B-chat-GGML/snapshots/3140827b4dfcb6b562cd87ee3d7f07109b014dd0/llama-2-13b-chat.ggmlv3.q5_1.bin',
 'choices': [{'text': "SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.\n\nUSER: write an sql query to get all the unique brands from table 'SALES' , Column name is 'BRAND'\n\nASSISTANT:\nHi there! I can definitely help you with that. Here's a SQL query that will retrieve all the unique brands from the 'SALES' table based on the 'BRAND' column:\n```sql\nSELECT DISTINCT BRAND FROM SALES;\n```\nThis query uses the `DISTINCT` keyword to return only unique values in the 'BRAND' column. It should give you the list of all distinct brands present in the 'SALES' table. Let me know if you have any further questions or need more help!",
   'index': 0,
   'logprobs': None,
   'finish_reason'

In [18]:
print(response["choices"][0]["text"])

SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: write an sql query to get all the unique brands from table 'SALES' , Column name is 'BRAND'

ASSISTANT:
Hi there! I can definitely help you with that. Here's a SQL query that will retrieve all the unique brands from the 'SALES' table based on the 'BRAND' column:
```sql
SELECT DISTINCT BRAND FROM SALES;
```
This query uses the `DISTINCT` keyword to return only unique values in the 'BRAND' column. It should give you the list of all distinct brands present in the 'SALES' table. Let me know if you have any further questions or need more help!


### prompt - 3

In [19]:
prompt = "write me a linkedin recomendation in 150 words for a person , who is senior to me , \
has expertise in data science , use keywords enthusiastic , problem solver , great team player , passionate "
prompt_template=f'''SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: {prompt}

ASSISTANT:
'''
response = lcpp_llm(prompt=prompt_template,
                    max_tokens=256,
                    temperature=0.5,
                    top_p=0.95,
                    repeat_penalty=1.2,
                    top_k=150,
                    echo=True)
response

Llama.generate: prefix-match hit


{'id': 'cmpl-38844fca-9d9a-49e6-af5d-1e25cea6a519',
 'object': 'text_completion',
 'created': 1696926870,
 'model': '/root/.cache/huggingface/hub/models--TheBloke--Llama-2-13B-chat-GGML/snapshots/3140827b4dfcb6b562cd87ee3d7f07109b014dd0/llama-2-13b-chat.ggmlv3.q5_1.bin',
 'choices': [{'text': 'SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.\n\nUSER: write me a linkedin recomendation in 150 words for a person , who is senior to me , has expertise in data science , use keywords enthusiastic , problem solver , great team player , passionate \n\nASSISTANT:\n\nI\'d be happy to help you draft a LinkedIn recommendation for your colleague! Here\'s a sample recommendation that highlights their strengths as a senior data scientist, problem solver, and great team player:\n\n"I have had the pleasure of working with [Name] for several years now, and I can confidently say that they are one of the most enthusiastic and talented data scientists I know. With expe

In [20]:
print(response["choices"][0]["text"])

SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: write me a linkedin recomendation in 150 words for a person , who is senior to me , has expertise in data science , use keywords enthusiastic , problem solver , great team player , passionate 

ASSISTANT:

I'd be happy to help you draft a LinkedIn recommendation for your colleague! Here's a sample recommendation that highlights their strengths as a senior data scientist, problem solver, and great team player:

"I have had the pleasure of working with [Name] for several years now, and I can confidently say that they are one of the most enthusiastic and talented data scientists I know. With expertise in machine learning and statistics, they have consistently demonstrated a passion for solving complex problems and driving business results. As a senior member of our team, [Name] has been an invaluable resource to their colleagues, offering guidance and support whenever needed. They are truly a gr

## Use with langchain

In [21]:
! pip install langchain

Collecting langchain
  Downloading langchain-0.0.311-py3-none-any.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m29.5 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.1-py3-none-any.whl (27 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langsmith<0.1.0,>=0.0.43 (from langchain)
  Downloading langsmith-0.0.43-py3-none-any.whl (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.0/40.0 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain)
  Downloading marshmallow-3.20.1-py3-none-any.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langch

In [22]:
from langchain import PromptTemplate

In [23]:
first_input_prompt = PromptTemplate(
    input_variables=['name'],
    template="Tell me about celebrity {name}"
)

In [30]:
response = lcpp_llm(
    first_input_prompt.format(name = 'Shahrukh Khan')
    )
response

Llama.generate: prefix-match hit


{'id': 'cmpl-ba959e71-acea-494e-965b-f03c6df4b70b',
 'object': 'text_completion',
 'created': 1696927614,
 'model': '/root/.cache/huggingface/hub/models--TheBloke--Llama-2-13B-chat-GGML/snapshots/3140827b4dfcb6b562cd87ee3d7f07109b014dd0/llama-2-13b-chat.ggmlv3.q5_1.bin',
 'choices': [{'text': '.\nShahrukh Khan is an Indian film actor, producer and television personality who works in Bollywood films. He has been referred to in the media as the "Badshah of Bollywood" and has been called the "King of Romance." He began his career in the late 1980s and has since become one of the most successful actors in Bollywood history, with an estimated net worth of over $600 million.\nHere are some interesting facts about Shahrukh Khan:\n\n1. Early life: Shahrukh was born on November 2, 19',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'length'}],
 'usage': {'prompt_tokens': 10, 'completion_tokens': 128, 'total_tokens': 138}}