# Bringing it in: Running Large Language Models Locally

This notebook aims to show how to run quantized LLM models locally. The open source LLMs provided by <a href='https://github.com/TheBloke'>TheBloke</a> and are downloaded from his huggingface <a href='https://huggingface.co/TheBloke'>page</a>.

Since we are going to illustrate a <a href='https://ai.meta.com/blog/retrieval-augmented-generation-streamlining-the-creation-of-intelligent-natural-language-processing-models/'>RAG</a> solution in this notebook, we will use Streamlit's publicly accessible <a href='https://github.com/streamlit/docs.git'>documentation</a> for the RAG's knowledge base. This example will show an conversation agent that can answer questions on Streamlit documentation.

### CPU Cores
As quantized model's performance scales with CPU cores, we check how many cores are available in this environment.

In [1]:
import multiprocessing

cores = multiprocessing.cpu_count()
cores

16

### Dependencies
Installing the required libraries

In [1]:
!pip install faiss-cpu llama-cpp-python llama-index langchain unstructured[md] sentence-transformers

## LLM Setup
Download the appropriate quantized file for the LLM and load it to memory.

##### PS: These are large files, expect some wait times.

In [None]:
BASE_FOLDER = "/home/notebooks/storage/"

In [4]:
import requests

#model_url = "https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/blob/main/llama-2-7b-chat.ggmlv3.q2_K.bin"
#model_url = "https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf?download=true"
#model_url = "https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf?download=true"
model_url = "https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf?download=true"
r = requests.get(model_url,allow_redirects=True,)
open(BASE_FOLDER+"new_llama.gguf", "wb").write(r.content)


4081004224

Since some of the models are text completion models, we will use the prompt template format shown below.

In [5]:
from langchain.llms import LlamaCpp
from langchain import LLMChain, PromptTemplate
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import (
    StreamingStdOutCallbackHandler
)

template = """Question: {question} Answer:"""

prompt = PromptTemplate(template=template, input_variables=["question"])
# callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

llm = LlamaCpp(
    model_path=BASE_FOLDER+'new_llama.gguf',
    n_batch=60,
    temperature = 0,
    max_tokens = 4095,
    n_parts=1,
    n_ctx=2048
    #n_gpu_layers=512,
    #callback_manager=callback_manager,  
)

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /home/notebooks/storage/new_llama.gguf (version GGUF V2)
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q4_K     [  4096,  4096,     1,     1

### Knowledge Base
Downloading the streamlit documentation.

In [None]:
import shutil
import requests

r = requests.get(
    "https://s3.amazonaws.com/datarobot_public_datasets/ai_accelerators/streamlit_md.zip",
    allow_redirects=True,
)
open(BASE_FOLDER+"streamlit_md.zip", "wb").write(r.content)

shutil.unpack_archive(BASE_FOLDER+"streamlit_md.zip", BASE_FOLDER)

Using <a href='https://www.langchain.com/'>Langchain</a>'s markdown text splitter, we create our document chunks for our vector database

In [7]:
import re
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import MarkdownTextSplitter

SOURCE_DOCUMENTS_DIR = BASE_FOLDER+"streamlit_md"
SOURCE_DOCUMENTS_FILTER = "**/*.md"

loader = DirectoryLoader(f"{SOURCE_DOCUMENTS_DIR}", glob=SOURCE_DOCUMENTS_FILTER)
splitter = MarkdownTextSplitter(
    chunk_size=2000,
    chunk_overlap=1000,
)

print(f"Loading {SOURCE_DOCUMENTS_DIR} directory")
data = loader.load()
print(f"Splitting {len(data)} documents")
docs = splitter.split_documents(data)
#for doc in docs:
    #doc.metadata['source'] = re.sub(r'storage/datarobot_docs/en/(.+)\.md', r'https://docs.datarobot.com/en/docs/\1.html', doc.metadata['source'])
print(f"Created {len(docs)} documents")

Loading /home/notebooks/storage/streamlit_md directory
  backends.update(_get_backends("networkx.backends"))
[nltk_data] Downloading package punkt to /home/notebooks/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/notebooks/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Splitting 278 documents
Created 781 documents


Now it's time to create vector embeddings for the document chunks and load into a vector database. This notebook uses <a href='https://ai.meta.com/tools/faiss/'>FAISS</a> as the vector database.

In [8]:
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.vectorstores.faiss import FAISS
from langchain.docstore.document import Document
import torch

EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2"

embedding_function = SentenceTransformerEmbeddings(
    model_name=EMBEDDING_MODEL_NAME,
    cache_folder=BASE_FOLDER+"sentencetransformers",
)

texts = [doc.page_content for doc in docs]
metadatas = [doc.metadata for doc in docs]
db = FAISS.from_texts(texts, embedding_function, metadatas=metadatas)
db.save_local(BASE_FOLDER+"faiss-db")

print(f"FAISS VectorDB has {db.index.ntotal} documents")

  from .autonotebook import tqdm as notebook_tqdm


FAISS VectorDB has 781 documents


Testing the Vector DataBase

In [9]:
db.similarity_search("Does streamlit connect to sql server?", k=2)

[Document(page_content="title: Connect Streamlit to Microsoft SQL Server\nslug: /knowledge-base/tutorials/databases/mssql\n\nConnect Streamlit to Microsoft SQL Server\n\nIntroduction\n\nThis guide explains how to securely access a remote Microsoft SQL Server database from Streamlit Community Cloud. It uses the pyodbc library and Streamlit's Secrets management.\n\nCreate an SQL Server database\n\nIf you already have a remote database that you want to use, feel free\nto skip to the next step.\n\nFirst, follow the Microsoft documentation to install SQL Server and the sqlcmd Utility. They have detailed installation guides on how to:\n\nInstall SQL Server on Windows\n\nInstall on Red Hat Enterprise Linux\n\nInstall on SUSE Linux Enterprise Server\n\nInstall on Ubuntu\n\nRun on Docker\n\nProvision a SQL VM in Azure\n\nOnce you have SQL Server installed, note down your SQL Server name, username, and password during setup.\n\nConnect locally\n\nIf you are connecting locally, use sqlcmd to conn

### LLM Endpoint
Using Langchain's chain object, an endpoint to the LLM is created. 

In [None]:
from langchain.chains import ConversationalRetrievalChain
from langchain.vectorstores.base import VectorStoreRetriever


retriever = VectorStoreRetriever(vectorstore=db)
chain = ConversationalRetrievalChain.from_llm(llm, retriever=retriever,
                                              return_source_documents=True)


In [None]:
pretext = "You are the technical support for Streamlit product. Using \
context provided answer the following question from a streamlit user: "

### LLM Inference
We ask the LLM a question and request an answer.

In [12]:
from datetime import datetime
start_ts_ = datetime.now()
rv = {
    'completion': chain(
        inputs={
            'question': pretext + "Does streamlit connect to SQL Server",
            'chat_history':[]
            },)
    }
end_ts = datetime.now()
delta = end_ts - start_ts_


llama_print_timings:        load time =    5308.41 ms
llama_print_timings:      sample time =       9.05 ms /    17 runs   (    0.53 ms per token,  1879.08 tokens per second)
llama_print_timings: prompt eval time =   97539.66 ms /  1012 tokens (   96.38 ms per token,    10.38 tokens per second)
llama_print_timings:        eval time =    2879.18 ms /    16 runs   (  179.95 ms per token,     5.56 tokens per second)
llama_print_timings:       total time =  100525.64 ms


In [13]:
print('\n Difference is seconds:', delta.total_seconds())
print(rv['completion']['answer'])


 Difference is seconds: 100.566378
 Yes, you can use Streamlit to connect to Microsoft SQL Server databases.
