# How to Run Llama 2 Locally with Python (Quickstart)

This Jupyter Notebook is part of a Blog Post on https://swharden.com

https://swharden.com/blog/2023-07-29-ai-chat-locally-with-python/

In [4]:

from llama_cpp import Llama

from IPython.display import display, HTML
import json
import time
import pathlib

In [15]:
# load the large language model file

LLM = Llama(model_path="models/llama-2-7b-chat.ggmlv3.q8_0.bin")

# create a text prompt
prompt = "Q: What are the names of the days of the week? A:"

# generate a response (takes several seconds)
output = LLM(prompt)

# display the response
print(output["choices"][0]["text"])

llama.cpp: loading model from models/llama-2-7b-chat.ggmlv3.q8_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 7 (mostly Q8_0)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 6828.73 MB (+  256.00 MB per state)
llama_new_context_with_model: kv self size  =  256.00 MB
llama_n

 The names of the days of the week, in order, are: Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, and Sunday.



llama_print_timings:        load time =   580.29 ms
llama_print_timings:      sample time =    24.99 ms /    36 runs   (    0.69 ms per token,  1440.63 tokens per second)
llama_print_timings: prompt eval time =   580.26 ms /    16 tokens (   36.27 ms per token,    27.57 tokens per second)
llama_print_timings:        eval time =  2374.99 ms /    35 runs   (   67.86 ms per token,    14.74 tokens per second)
llama_print_timings:       total time =  3022.81 ms


In [7]:
# create a text prompt
prompt = "Q: Why are Jupyter notebooks difficult to maintain? A:"

# set max_tokens to 0 to remove the response size limit
output = LLM(prompt, max_tokens=0)

# display the response
print(output["choices"][0]["text"])

Llama.generate: prefix-match hit


 Jupyter notebooks can be difficult to maintain for several reasons. Here are some of the common challenges that developers face when working with Jupyter notebooks:

1. Lack of structure: Jupyter notebooks are flexible and allow users to organize their code in any way they like. However, this flexibility can also lead to a lack of structure, making it difficult to navigate and maintain the notebook over time.
2. Multiple versions of code: Since Jupyter notebooks are dynamic and allow users to edit and run code interactively, it's easy for multiple versions of code to exist within a single notebook. This can make it challenging to track changes and maintain consistency across different sections of the notebook.
3. Inconsistent formatting: Jupyter notebooks use Markdown syntax for formatting, which can lead to inconsistencies in formatting across different sections of the notebook. This can make it difficult to maintain a consistent look and feel throughout the notebook.
4. Lack of vers


llama_print_timings:        load time =  3983.67 ms
llama_print_timings:      sample time =   359.94 ms /   496 runs   (    0.73 ms per token,  1378.00 tokens per second)
llama_print_timings: prompt eval time =   379.48 ms /    13 tokens (   29.19 ms per token,    34.26 tokens per second)
llama_print_timings:        eval time = 34094.93 ms /   495 runs   (   68.88 ms per token,    14.52 tokens per second)
llama_print_timings:       total time = 35494.21 ms


In [11]:
"""
This script creates a database of information gathered from local text files. 
"""

from langchain.document_loaders import DirectoryLoader, TextLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS


# define what documents to load
loader = DirectoryLoader("./", glob="*.pdf", loader_cls=PyPDFLoader)

# interpret information in the documents
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500,
                                          chunk_overlap=50)
texts = splitter.split_documents(documents)
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={'device': 'cpu'})

# create and save the local database
db = FAISS.from_documents(texts, embeddings)
db.save_local("faiss")

In [16]:
"""
This script reads the database of information from local text files
and uses a large language model to answer questions about their content.
"""

from langchain.llms import CTransformers
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain import PromptTemplate
from langchain.chains import RetrievalQA

# prepare the template we will use when prompting the AI
template = """Use the following pieces of information to answer the user's question.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Context: {context}
Question: {question}
Only return the helpful answer below and nothing else.
Helpful answer:
"""

# load the language model
llm = CTransformers(model='models/llama-2-7b-chat.ggmlv3.q8_0.bin',
                    model_type='llama',
                    config={'max_new_tokens': 256, 'temperature': 0.01})

# load the interpreted information from the local database
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={'device': 'cpu'})
db = FAISS.load_local("faiss", embeddings)

# prepare a version of the llm pre-loaded with the local content
retriever = db.as_retriever(search_kwargs={'k': 2})
prompt = PromptTemplate(
    template=template,
    input_variables=['context', 'question'])
qa_llm = RetrievalQA.from_chain_type(llm=llm,
                                     chain_type='stuff',
                                     retriever=retriever,
                                     return_source_documents=True,
                                     chain_type_kwargs={'prompt': prompt})

# ask the AI chat about information in our local files
prompt = "what is the key results from the pdf file?"
output = qa_llm({'query': prompt})
print(output["result"])

The key results on page 3 of the PDF file are:

* Total value of production (£ million)
* Volume of output (thousand tons)
* Value added (£ million)


In [17]:
# ask the AI chat about information in our local files
prompt = "give me information about Purposes and uses of the pdf"
output = qa_llm({'query': prompt})
print(output["result"])

The purposes and uses of the PDF are to provide information on how the statistics are produced, managed, and used in accordance with the Statistics and Registration Service Act 2007 and the Code of Practice for Official Statistics. The PDF also explains how the data is collected, processed, and analyzed to meet identified user needs and to ensure that the statistics are well explained, readily accessible, produced according to sound methods, and managed impartially and objectively in the public interest.


In [14]:
# ask the AI chat about information in our local files
prompt = "What is included in this release?"
output = qa_llm({'query': prompt})
print(output["result"])

The release includes data relating to delivery for the six-month period ending 30 September 2021, covering all current and historical programmes delivered by Homes England, including the acquisition of existing land or property as well as new house building, and some programmes provide a mix of affordable and market housing.
