### Notebook Summary

This notebook demonstrates how to build a question-answering system using Retrieval-Augmented Generation (RAG). It involves the following steps:

1.  **Setting up the environment**: Installing necessary libraries like `chromadb`, `openai`, and `langchain`.
2.  **Getting the dataset**: Downloading and unzipping a collection of articles.
3.  **Loading and processing data**: Loading the articles, splitting them into smaller chunks, and creating embeddings for these chunks.
4.  **Creating and loading a vector database**: Using ChromaDB to store the document chunks and their embeddings.
5.  **Setting up a retriever**: Configuring a retriever to fetch relevant document chunks based on a query.
6.  **Using an LLM for structured answers**: Integrating an OpenAI language model (LLM) with the retriever to generate answers based on the retrieved document chunks.
7.  **Exploring different models**: Showing how to use different OpenAI models like `gpt-3.5-turbo-instruct` (default) and `gpt-4` for question answering.
8.  **Zipping and Unzipping Database for resuse**: Showing how to zip and then unzip the database for reusing the embeddings.

In essence, the notebook shows how to build a system that can answer questions by first finding relevant information in a large set of documents and then using an LLM to synthesize that information into a coherent answer.

### **Installing necessary packages**

In [None]:
!pip install -q chromadb openai langchain tiktoken langchain-community langchain-chroma langchain-openai --upgrade

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.7/19.7 MB[0m [31m77.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m786.8/786.8 kB[0m [31m51.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m89.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.2/284.2 kB[0m [31m27.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m76.8 MB/s[0m eta [36m0:00:0

In [None]:
!pip show chromadb

Name: chromadb
Version: 1.0.17
Summary: Chroma.
Home-page: https://github.com/chroma-core/chroma
Author: 
Author-email: Jeff Huber <jeff@trychroma.com>, Anton Troynikov <anton@trychroma.com>
License: 
Location: /usr/local/lib/python3.11/dist-packages
Requires: bcrypt, build, grpcio, httpx, importlib-resources, jsonschema, kubernetes, mmh3, numpy, onnxruntime, opentelemetry-api, opentelemetry-exporter-otlp-proto-grpc, opentelemetry-sdk, orjson, overrides, posthog, pybase64, pydantic, pypika, pyyaml, rich, tenacity, tokenizers, tqdm, typer, typing-extensions, uvicorn
Required-by: 


### **Getting the dataset of articles**

In [None]:
!wget -q https://www.dropbox.com/s/vs6ocyvpzzncvwh/new_articles.zip

In [None]:
!unzip -q new_articles.zip -d new_articles

### **Setting up environment**

In [None]:
from langchain.llms import OpenAI
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import TextLoader

In [None]:
import os
import openai

# include your openai api key in the secrets section
# of this notebook and turn the notebook access on.
from google.colab import userdata
openai.api_key = userdata.get('OPENAI_API_KEY')

### **Load Data**

In [None]:
from pickle import load

loader = DirectoryLoader("/content/new_articles/", glob = "./*.txt",
                         loader_cls = TextLoader)
document = loader.load()

### **Splitting into Chunks**

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 200)
chunks = text_splitter.split_documents(document)

In [None]:
print(f"Chunk 1: {chunks[0].page_content}")
print(f"\n==================================================================\n")
print(f"Chunk 2: {chunks[1].page_content}")

Chunk 1: Welcome back to This Week in Apps, the weekly TechCrunch series that recaps the latest in mobile OS news, mobile applications and the overall app economy.

The app economy in 2023 hit a few snags, as consumer spending last year dropped for the first time by 2% to $167 billion, according to data.ai’s “State of Mobile” report. However, downloads are continuing to grow, up 11% year-over-year in 2022 to reach 255 billion. Consumers are also spending more time in mobile apps than ever before. On Android devices alone, hours spent in 2022 grew 9%, reaching 4.1 trillion.

This Week in Apps offers a way to keep up with this fast-moving industry in one place with the latest from the world of apps, including news, updates, startup fundings, mergers and acquisitions, and much more.

Do you want This Week in Apps in your inbox every Saturday? Sign up here: techcrunch.com/newsletters

Top Stories

Dorsey criticizes Twitter, Musk on the alternative social networks he’s backing


Chunk 2: Do

In [None]:
len(chunks)

233

### **Creating DB object**

In [None]:
from langchain import embeddings
from langchain_openai import OpenAIEmbeddings

persist_directory = 'chromaDB'

embeddings = OpenAIEmbeddings(openai_api_key=openai.api_key)

  embeddings = OpenAIEmbeddings(openai_api_key=openai.api_key)


In [None]:
vectordb = Chroma.from_documents(documents=chunks,
                                 embedding=embeddings,
                                 persist_directory=persist_directory)

### **Loading Data from the DB**

In [None]:
# Now we can load the persisted database from disk, and use it as normal.
from langchain_chroma import Chroma

vectordb = Chroma(persist_directory=persist_directory,
                  embedding_function=embeddings)

### **Make a retriever**

In [None]:
# default for search_kwargs is 4
retriever = vectordb.as_retriever(search_kwargs={"k":3})

In [None]:
docs = retriever.invoke("How much money did Microsoft raise?")

In [None]:
len(docs)

2

In [None]:
import textwrap

def wrap_text(text, width):
  return textwrap.fill(text, width=width)

for index in range(len(docs)):
  print(f"Document_{index+1}\n\n{wrap_text(docs[index].page_content, 80)}")
  print("\n================================================================\n")

Document_1

April 28, 2023  VC firms including Sequoia Capital, Andreessen Horowitz, Thrive
and K2 Global are picking up new shares, according to documents seen by
TechCrunch. A source tells us Founders Fund is also investing. Altogether the
VCs have put in just over $300 million at a valuation of $27 billion to $29
billion. This is separate to a big investment from Microsoft announced earlier
this year, a person familiar with the development told TechCrunch, which closed
in January. The size of Microsoft’s investment is believed to be around $10
billion, a figure we confirmed with our source.  April 25, 2023  Called ChatGPT
Business, OpenAI describes the forthcoming offering as “for professionals who
need more control over their data as well as enterprises seeking to manage their
end users.”


Document_2

The amount that Google invested in the project was never disclosed, nor was the
valuation of the exit to the parent company from the incubator, but the company
has confirmed that the

In [None]:
retriever.search_type
# its using cosine similarity to match the chunks

'similarity'

### **Using LLM to get structured answer**

From vector DB i am getting the chunks of highest similarity to my question but not a definitive answer. So we will provide the question and chunks returned by vector DB to an LLM and ask it to provide a simple answer.

#### **Making a chain**

In [None]:
from langchain_openai import OpenAI
from langchain.chains import RetrievalQA

# create the chain to answer questions
qa_chain = RetrievalQA.from_chain_type(llm=OpenAI(openai_api_key=openai.api_key),
                                       chain_type="stuff",
                                       retriever=retriever,
                                       return_source_documents=True)

In [None]:
# Full example
query = "How much money did Microsoft raise?"
result = qa_chain.invoke({"query": query})

In [None]:
result

In [None]:
def query_output(result):
  """
  Prints the result and source metadata from a RetrievalQA chain result.

  Args:
    result: A dictionary containing the result from a RetrievalQA chain.
            Expected keys are 'result' and 'source_documents'.
  """
  print("Answer:")
  print(result['result'])
  print("\nSource Documents:")
  for doc in result['source_documents']:
    print(f"- {doc.metadata['source']}")

In [None]:
query_output(result)

Answer:
 $10 billion

Source Documents:
- /content/new_articles/05-03-chatgpt-everything-you-need-to-know-about-the-ai-powered-chatbot.txt
- /content/new_articles/05-03-checks-the-ai-powered-data-protection-project-incubated-in-area-120-officially-exits-to-google.txt
- /content/new_articles/05-07-3one4-capital-driven-by-contrarian-bets-raises-200-million-new-fund.txt


### **Using different models**

In [None]:
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

# Create the GPT-5 LLM instance
llm = ChatOpenAI(
    model="gpt-5",               # Use GPT-5
    openai_api_key=openai.api_key,
    temperature=0                # Optional: deterministic responses
)

# create the chain to answer questions with a different model
qa_chain_gpt5 = RetrievalQA.from_chain_type(llm=llm,
                                           chain_type="stuff",
                                           retriever=retriever,
                                           return_source_documents=True)

# For using models 3.5 and higher, ChatCompletionAI is necessary

In [None]:
# Full example
query = "How much money did Microsoft raise?"
result_gpt5 = qa_chain_gpt5.invoke({"query": query})

In [None]:
query_output(result_gpt5)

### **Zipping and Deleting the DB**

In [None]:
!zip -r chromaDB.zip ./chromaDB

  adding: chromaDB/ (stored 0%)
  adding: chromaDB/chroma.sqlite3 (deflated 41%)
  adding: chromaDB/3a72c731-8121-4b82-9c79-fa4932685bd0/ (stored 0%)
  adding: chromaDB/3a72c731-8121-4b82-9c79-fa4932685bd0/data_level0.bin (deflated 100%)
  adding: chromaDB/3a72c731-8121-4b82-9c79-fa4932685bd0/header.bin (deflated 61%)
  adding: chromaDB/3a72c731-8121-4b82-9c79-fa4932685bd0/length.bin (deflated 100%)
  adding: chromaDB/3a72c731-8121-4b82-9c79-fa4932685bd0/link_lists.bin (stored 0%)


In [None]:
# delete the directory
!rm -rf chromaDB/

### **Reloading the DB from Zip File**

In [None]:
!unzip chromaDB.zip

Archive:  chromaDB.zip
   creating: chromaDB/
  inflating: chromaDB/chroma.sqlite3  
   creating: chromaDB/3a72c731-8121-4b82-9c79-fa4932685bd0/
  inflating: chromaDB/3a72c731-8121-4b82-9c79-fa4932685bd0/data_level0.bin  
  inflating: chromaDB/3a72c731-8121-4b82-9c79-fa4932685bd0/header.bin  
  inflating: chromaDB/3a72c731-8121-4b82-9c79-fa4932685bd0/length.bin  
 extracting: chromaDB/3a72c731-8121-4b82-9c79-fa4932685bd0/link_lists.bin  


After unzipping we can restart the runtime using:

In [None]:
#from langchain_chroma import Chroma

#vectordb = Chroma(persist_directory=persist_directory,
#                  embedding_function=embeddings)