<img src="../images/obt-banner.png" width=1200>

# Build Your Own Private ChatGPT Super-Assistant Using Streamlit, LangChain, Chroma & Llama 2
## Chroma Demo
**Questions?** contact@coefficient.ai / [@CoefficientData](https://twitter.com/CoefficientData)

---

## 0. Imports

In [2]:
import chromadb
from dotenv import load_dotenv

from utils import scrape_page

## 1. Chroma Basics

In [3]:
# Get the Chroma client
chroma_client = chromadb.Client()

In [4]:
# Create a collection
collection = chroma_client.create_collection(name="my_collection")

Collections are where you'll store your embeddings, documents, and any additional metadata. 

In [5]:
# Add some text documents to the collection
collection.add(
    documents=["This is a document", "This is another document"],
    metadatas=[{"source": "my_source"}, {"source": "my_source"}],
    ids=["id1", "id2"],
)

Chroma will store your text, and handle tokenization, embedding, and indexing automatically.

In [6]:
collection2 = chroma_client.create_collection(name="another_collection")

In [7]:
# Load in pre-generated embeddings
collection2.add(
    embeddings=[[1.2, 2.3, 4.5], [6.7, 8.2, 9.2]],
    documents=["This is a document", "This is another document"],
    metadatas=[{"source": "my_source"}, {"source": "my_source"}],
    ids=["id1", "id2"],
)

In [8]:
# Query the collection
results = collection.query(query_texts=["This is a query document"], n_results=2)

In [9]:
results

{'ids': [['id1', 'id2']],
 'distances': [[0.7111212015151978, 1.0109771490097046]],
 'metadatas': [[{'source': 'my_source'}, {'source': 'my_source'}]],
 'embeddings': None,
 'documents': [['This is a document', 'This is another document']],
 'uris': None,
 'data': None}

- **Where is data stored?** By default data stored in Chroma is ephemeral making it easy to prototype scripts.
- **Can data be persisted?** It's easy to make Chroma persistent so you can reuse every collection you create and add more documents to it later. It will load your data automatically when you start the client, and save it automatically when you close it.

Check out the [Usage Guide](https://docs.trychroma.com/usage-guide) for more info.

In [10]:
persistent_client = chromadb.PersistentClient(path=".")
persistent_collection = persistent_client.create_collection(name="persistent_collection")

---

## 2. Create embeddings with LangChain

### Create embeddings with Llama

In [11]:
from langchain.embeddings.llamacpp import LlamaCppEmbeddings

In [12]:
# Make sure the model path is correct!
llama_embedder = LlamaCppEmbeddings(model_path="../models/llama-2-7b-chat.Q4_K_M.gguf")

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ../models/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q4_K     [  4096,  4096,     1,     1 

In [13]:
text = "This is a test document."
query_result = llama_embedder.embed_query(text)


llama_print_timings:        load time =    8155.61 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =    8155.23 ms /     7 tokens ( 1165.03 ms per token,     0.86 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =    8156.12 ms


In [14]:
len(query_result)

4096

In [15]:
query_result[:10]

[-0.07228467613458633,
 0.524428129196167,
 1.821649193763733,
 -0.3776538670063019,
 0.24015973508358002,
 -0.5863308906555176,
 -1.637908935546875,
 1.1341742277145386,
 0.24986322224140167,
 0.5149620175361633]

In [16]:
doc_result = llama_embedder.embed_documents([text])


llama_print_timings:        load time =    8155.61 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =     986.80 ms /     7 tokens (  140.97 ms per token,     7.09 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =     988.22 ms


In [17]:
len(doc_result)

1

In [18]:
doc_result[0][:10]

[-0.07228467613458633,
 0.524428129196167,
 1.821649193763733,
 -0.3776538670063019,
 0.24015973508358002,
 -0.5863308906555176,
 -1.637908935546875,
 1.1341742277145386,
 0.24986322224140167,
 0.5149620175361633]

### Create embeddings using LangChain

In [85]:
# Let's get some more interesting data
url = "https://www.gov.uk/government/publications/frontier-ai-capabilities-and-risks-discussion-paper/frontier-ai-capabilities-and-risks-discussion-paper"
paper = scrape_page(url)

In [88]:
# Take a peek
print(f"{len(paper)=}\n\nExtract:")
print(paper[10000:15000])

len(paper)=128444

Extract:
ful tasks Simply from being trained to predict the next word across diverse datasets, models develop sophisticated capabilities. [footnote 21] For example, frontier AI can (with varying degrees of success and reliability): Converse fluently and at length, drawing on extensive information contained in training data. Write long sequences of well-functioning code from natural language instructions,including making new apps. [footnote 22] Score highly on high-school and undergraduate examinations in many subjects. [footnote 23] Generate plausible news articles. [footnote 24] Creatively combine ideas together from very different domains. [footnote 25] Explain why novel sophisticated jokes are funny. [footnote 26] Translate between multiple languages. [footnote 27] Direct the activities of robots via reasoning, planning and movement control. [footnote 28] Analyse data by plotting graphs and calculating key quantities. [footnote 29] Answer questions about images th

In [87]:
# Save it to disk - we only do 5000 characters as Llama is very slow at embedding
with open("frontier-ai-paper.txt", "w") as f:
    f.write(paper[10000:15000])

In [72]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load the document
raw_documents = TextLoader("frontier-ai-paper.txt").load()

# Split it into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)

In [73]:
len(documents)

11

In [74]:
documents[:2]

[Document(page_content='ful tasks\nSimply from being trained to predict the next word across diverse datasets, models develop sophisticated capabilities.\n[footnote 21]\nFor example, frontier\nAI\ncan (with varying degrees of success and reliability):\nConverse fluently and at length, drawing on extensive information contained in training data.\nWrite long sequences of well-functioning code from natural language instructions,including making new apps.\n[footnote 22]', metadata={'source': 'frontier-ai-paper.txt'}),
 Document(page_content='Score highly on high-school and undergraduate examinations in many subjects.\n[footnote 23]\nGenerate plausible news articles.\n[footnote 24]\nCreatively combine ideas together from very different domains.\n[footnote 25]\nExplain why novel sophisticated jokes are funny.\n[footnote 26]\nTranslate between multiple languages.\n[footnote 27]\nDirect the activities of robots via reasoning, planning and movement control.\n[footnote 28]\nAnalyse data by plott

In [55]:
%%time
from langchain.vectorstores import Chroma

# Embed each chunk and load it into the vector store
db = Chroma.from_documents(documents, llama_embedder)


llama_print_timings:        load time =    8155.61 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =    9091.22 ms /    99 tokens (   91.83 ms per token,    10.89 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =    9110.24 ms

llama_print_timings:        load time =    8155.61 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   10842.17 ms /   131 tokens (   82.76 ms per token,    12.08 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   10861.88 ms

llama_print_timings:        load time =    8155.61 ms
l

CPU times: user 4min 53s, sys: 36.8 s, total: 5min 30s
Wall time: 1min 56s



llama_print_timings:        load time =    8155.61 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =    9688.22 ms /   106 tokens (   91.40 ms per token,    10.94 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =    9706.77 ms


### Similarity search

In [98]:
query = "What are scaffolds in AI?"
docs = db.similarity_search(query)
print(docs[0].page_content.replace("\n", " "))

Score highly on high-school and undergraduate examinations in many subjects. [footnote 23] Generate plausible news articles. [footnote 24] Creatively combine ideas together from very different domains. [footnote 25] Explain why novel sophisticated jokes are funny. [footnote 26] Translate between multiple languages. [footnote 27] Direct the activities of robots via reasoning, planning and movement control. [footnote 28] Analyse data by plotting graphs and calculating key quantities.



llama_print_timings:        load time =    8155.61 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =    1092.65 ms /    10 tokens (  109.26 ms per token,     9.15 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =    1097.11 ms


## Using SentenceTransformerEmbeddings

In [70]:
# Initialise the new embedder
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

st_embedder = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

In [90]:
%%time
# Compare this with SentenceTransformerEmbeddings
db2 = Chroma.from_documents(documents, st_embedder, collection_name="st_embeddings")

CPU times: user 5.46 s, sys: 1.77 s, total: 7.23 s
Wall time: 3.26 s


**Note: It takes `SentenceTransformerEmbeddings` <1 second, and Llama 2 several minutes!**

In [92]:
# Save the whole paper this time, Sentence-Transformers can handle it
print(f"{len(paper)=}")
with open("frontier-ai-paper.txt", "w") as f:
    f.write(paper)

len(paper)=128444


In [93]:
raw_documents = TextLoader("frontier-ai-paper.txt").load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)

In [94]:
len(documents)

131

In [95]:
%%time
db2 = Chroma.from_documents(documents, st_embedder, collection_name="st_embeddings")

CPU times: user 5.27 s, sys: 1.71 s, total: 6.98 s
Wall time: 2.79 s


In [97]:
docs = db2.similarity_search("What are scaffolds in AI?")
print(docs[0].page_content.replace("\n", " "))

. Scaffolding software programs (‘scaffolds’) structure the information flow of an AI model, leaving the model itself unchanged. [footnote 53] Better scaffolds could, for example, help an AI agent self-correct when they have made a mistake, [footnote 54] or improve their long-term memory. New fine-tuning data . Fine-tuning on high-quality data can significantl


In [100]:
docs = db2.similarity_search("What are the top risks of frontier models?")
print(docs[0].page_content.replace("\n", " "), "\n\n")
print(docs[1].page_content.replace("\n", " "))

the current state and key trends relating to frontier AI capabilities, and then explores how frontier AI capabilities might evolve in the future and reviews some key risks. There is significant uncertainty around both the capabilities and risks from AI , including some experts who believe that some of these risks are overstated. This report focuses on evidence for risks and concludes that doing further research is necessary. 


In this section, we first review several cross-cutting risk factors – technical and societal conditions that could aggravate a number of particular risks. We then discuss individual risks under 3 headings: societal harms misuse loss of control We do not comprehensively cover all important AI risks and only highlight some salient examples. Cross cutting risk factors There are many long-standing technical challenges [footnote 104] to building safe AI systems, evaluating whether they are safe, and understanding how they make decisions. They exhibit unexpected failu

### Maximum marginal relevance search (MMR)
Maximal marginal relevance optimizes for similarity to query and diversity among selected documents. It is also supported in async API.

In [102]:
query = "What are the top risks of frontier models?"
retriever = db2.as_retriever(search_type="mmr")
docs = retriever.get_relevant_documents(query)

print(docs[0].page_content.replace("\n", " "), "\n\n")
print(docs[1].page_content.replace("\n", " "))

the current state and key trends relating to frontier AI capabilities, and then explores how frontier AI capabilities might evolve in the future and reviews some key risks. There is significant uncertainty around both the capabilities and risks from AI , including some experts who believe that some of these risks are overstated. This report focuses on evidence for risks and concludes that doing further research is necessary. 


systems, especially open release models, could also be used privately, and such use would likely remain undetected. Frontier AI models embody extremely valuable intellectual property. Even if frontier developers intend to limit deployment, the information security practices of frontier developers will influence the likelihood that the full model is exfiltrated by employees or external actors. Much more investment in security would be needed for frontier AI developers to defend against attacks from the most well-resourced actors. [footnote 135] After exfiltration

### Deep linking

In [105]:
docs = db2.similarity_search("Which model has the best benchmark?")
result = docs[0].page_content
print(result.replace("\n", " "))

on code and mathematics data, it is more likely to be good at solving programming puzzles and mathematics problems. Figure 6. Performance on broad benchmarks such as BIG-Bench and MMLU improves with more training compute. This figure was taken from Owen 2023. See figure 6 in an accessible format. Although average performance, aggregated across many downstream tasks, improves fairly predictably with scale, it is much harder to predict performance improvements at specific real-world problems. The development of frontier AI systems has involved many examples of surprising capabilities, unanticipated by model developers before training and often only discovered by users after deployment. There are documented examples of unexpected capabilities where models were not showing any signs of improvement before a certain scale and then rapidly improved suddenly [footnote 84] – though the interpretation of these examples is contested. [footnote 85] In any case, we cannot currently reliably


In [106]:
import urllib.parse

In [113]:
encoded_result = urllib.parse.quote(result[:50])
encoded_result

'on%20code%20and%20mathematics%20data%2C%20it%20is%20more%20likely%20to'

In [114]:
deeplink = f"{url}#:~:text={encoded_result}"
deeplink

'https://www.gov.uk/government/publications/frontier-ai-capabilities-and-risks-discussion-paper/frontier-ai-capabilities-and-risks-discussion-paper#:~:text=on%20code%20and%20mathematics%20data%2C%20it%20is%20more%20likely%20to'

---

## 3. Exercise: Q&A bot with vector database

> Combine the Chroma vector database with a Llama-based LangChain LLM to create a Q&A bot for the provided (or any other) URL.
> Tips:
> - Encode your queries using the Sentence-Transformer embedding & return the top documents
> - Include the question alongside the top N documents into your LangChain LLM's context window
> - Use Llama 2 to synthesise a coherent answer
>
> This approach enables LLMs to answer questions to things they haven't been pre-trained on by using the vector database as an "encyclopedia" that it can reference as needed. This is known as "retrieval-augmented generation" or "RAG".

---

## Where next?

LangChain is far more powerful than we've seen so far! Here's an idea of what else you can do:
- [Learn to use agents and tools with LangChain](https://python.langchain.com/docs/modules/agents/tools/) such as searching the web, querying APIs, reading papers on ArXiv, checking the weather, digesting articles on Wikipedia, making (and transcribing) calls with Twilio, accessing financial data and much more. Check out the [list of integrations here](https://python.langchain.com/docs/integrations/tools).
- [Query a SQL database](https://python.langchain.com/docs/expression_language/cookbook/sql_db) with LangChain Runnables
- [Write Python code](https://python.langchain.com/docs/expression_language/cookbook/code_writing) with LangChain
- [Learn more about RAG](https://python.langchain.com/docs/expression_language/cookbook/retrieval) or use [this example to combine agents with the Chroma vector store](https://python.langchain.com/docs/modules/agents/how_to/agent_vectorstore)