# **04. Retrieval**

In [1]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

In [2]:
! pip install lark

Collecting lark
  Downloading lark-1.2.2-py3-none-any.whl.metadata (1.8 kB)
Downloading lark-1.2.2-py3-none-any.whl (111 kB)
Installing collected packages: lark
Successfully installed lark-1.2.2


In [4]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings

persist_directory = 'docs/chroma/'


In [6]:
embedding = OpenAIEmbeddings()
vectordb = Chroma (
    embedding_function=embedding,
    persist_directory=persist_directory
)

  vectordb = Chroma (


In [7]:
print(vectordb._collection.count())

6087


## **Maximal Marginal Relevance (MRR)**
In addition to similarity search, we can also do maximal marginal relevance search. This is a search that tries to return a diverse set of documents. It does this by not only considering the most similar documents, but also documents that are relevant.

`fetch_k` retrieves the most similar documents to the query, while `k` selects and returns the most diverse results.

In [15]:
question = "what are the prerequisites to study an accounting postgraduate degree?"
docs_ss = vectordb.similarity_search(question,k=4)

In [16]:
docs_ss[0].page_content[:100]

'20    RULES FOR POSTGRADUATE DIPLOMAS \nEntrance requirements:  \n1. A graduate of this Un iversity wh'

In [17]:
docs_ss[1].page_content[:100]

'40    RULES FOR POSTGRADUATE DEGREES\nBachelor of Commerce Honours \nspecialising in ACCOUNTI NG [CH00'

In [60]:
docs_mmr = vectordb.max_marginal_relevance_search(question, k=3)

In [61]:
docs_mmr[0].page_content[:100]

'and professions. Topics covered include: exploratory data analysis and summary statistics; probabili'

In [62]:
# response shows diversity
docs_mmr[1].page_content[:100]

'Science or Life Sciences.   NOTE: Preference will be given to students registered in the Science \nFa'

## **LLM aided retrieval**
Use an LLM to convert part of the user's query into a search query. This is relevant where the query has a semantic component as well as a filter parameter (like a date, or a specific type of document).

Many vectorstores support operations on `metadata`, which provides context for each embedded chunk.

### **Manual implementation using filtering**

In [21]:
question = "what 5 courses can i take if i am an engineering postgrad degree?"

In [22]:
docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"assets/4.Engineering-postgrad.pdf"}
)

In [23]:
for d in docs:
    print(d.metadata)

{'page': 99, 'source': 'assets/4.Engineering-postgrad.pdf'}
{'page': 99, 'source': 'assets/4.Engineering-postgrad.pdf'}
{'page': 258, 'source': 'assets/4.Engineering-postgrad.pdf'}


### **Using self-query retriever**

We can use `SelfQueryRetriever`, which uses an LLM to extract:
 
1. The `query` string to use for vector search
2. A metadata filter to pass in as well

In [30]:
! pip install -U langchain-openai

Collecting langchain-openai
  Downloading langchain_openai-0.2.1-py3-none-any.whl.metadata (2.6 kB)
Downloading langchain_openai-0.2.1-py3-none-any.whl (49 kB)
Installing collected packages: langchain-openai
Successfully installed langchain-openai-0.2.1


In [31]:
from langchain_openai import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [25]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `assets/1.Commerce-undergrad.pdf` or `assets/2.Commerce-postgrad.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

In [32]:
document_content_description = "Lecture notes"
llm = OpenAI(model='gpt-3.5-turbo-instruct', temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

In [33]:
question = "what postgrad course can i take after studying an undergrad in accounting and finance?"

In [34]:
docs = retriever.get_relevant_documents(question)

In [35]:
for d in docs:
    print(d.metadata)

{'page': 254, 'source': 'assets/2.Commerce-postgrad.pdf'}
{'page': 23, 'source': 'assets/2.Commerce-postgrad.pdf'}
{'page': 350, 'source': 'assets/2.Commerce-postgrad.pdf'}
{'page': 36, 'source': 'assets/2.Commerce-postgrad.pdf'}


## **Compression**
Compression is the process of reducing the size of a document. Increase the number of results you can put in the context window by shrinking the responses to only the most relevant parts. Uses a compression LLM during an intermediate step.

In [36]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [37]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))

In [38]:
# The compressor reduces the size of retrieved documents by extracting only the most relevant information
llm = OpenAI(temperature=0, model="gpt-3.5-turbo-instruct")
compressor = LLMChainExtractor.from_llm(llm)

In [39]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [40]:
question = "what postgrad course can i take after studying an undergrad in accounting and finance?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

- Bachelor of Commerce Honours specialising in ACCOUNTI NG [CH001ACC0 1]
- Postgraduate Diplo ma in Accounting (PGDA)
- Financial Reporting IV, ACC4023, Corporate Governance III, ACC4025, and Specialised Topics in Accounting and Research Report, ACC4050
- Initial Test of Competence (ITC) of the South African Institute of Chartered Accountants (SAICA)
- Entrance requirements: A graduate of this University who has completed the prescribed courses for the BCom degree (Financial Accounting CA option) or the BBusSc degree (Finance with Accounting option) and who has obtained: a minimum average mark of 65% for the following courses: ACC3009 Financial Reporting III, ACC3022 Corporate Governance II, ACC3004 Taxation II and ACC3023
----------------------------------------------------------------------------------------------------
Document 2:

1. A graduate of this Un iversity who has completed the prescribed courses for the BCom 
degree (Financial Accounting CA option) or the BBus

## **Combining various techniques**

In [41]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

In [42]:
question = "what postgrad course can i take after studying an undergrad in accounting and finance?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

- Bachelor of Commerce Honours specialising in ACCOUNTI NG [CH001ACC0 1]
- Postgraduate Diplo ma in Accounting (PGDA)
- Financial Reporting IV, ACC4023, Corporate Governance III, ACC4025, and Specialised Topics in Accounting and Research Report, ACC4050
- Initial Test of Competence (ITC) of the South African Institute of Chartered Accountants (SAICA)
- Entrance requirements: A graduate of this University who has completed the prescribed courses for the BCom degree (Financial Accounting CA option) or the BBusSc degree (Finance with Accounting option) and who has obtained: a minimum average mark of 65% for the following courses: ACC3009 Financial Reporting III, ACC3022 Corporate Governance II, ACC3004 Taxation II and ACC3023
----------------------------------------------------------------------------------------------------
Document 2:

- GSB3004Z FINANCE AND ACCOUNTING MANAGEMENT
- Course outline: This course focuses on developing literacy in matters pertaining to accountin

## **Other types of retrieval**
These don't use vector stores, but instead use more traditional machine learning models.

In [46]:
! pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.5.2-cp312-cp312-macosx_12_0_arm64.whl.metadata (13 kB)
Collecting scipy>=1.6.0 (from scikit-learn)
  Downloading scipy-1.14.1-cp312-cp312-macosx_14_0_arm64.whl.metadata (60 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.5.2-cp312-cp312-macosx_12_0_arm64.whl (11.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.0/11.0 MB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading joblib-1.4.2-py3-none-any.whl (301 kB)
Downloading scipy-1.14.1-cp312-cp312-macosx_14_0_arm64.whl (23.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.1/23.1 MB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading threadpoolctl-3.5.0-py3-none-any.whl (18 kB)
Install

In [43]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [44]:
# Load PDF
loader = PyPDFLoader("assets/1.Commerce-undergrad.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)


In [47]:
# Retrieve
svm_retriever = SVMRetriever.from_texts(splits,embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)

In [48]:
question = "What are major topics for this class?"
docs_svm=svm_retriever.get_relevant_documents(question)
docs_svm[0]

Document(metadata={}, page_content="surfaces. Partial derivatives, chain rule, maxima and minima, Lagrange multipliers. Gradient, divergence and curl.  Taylor's theorem for one \nand several variables, Jacobians, Newton's method for several variables. Multiple integrals and change of variable. Surface i ntegrals. Line \nintegrals, work done by a force, potentials. Green's theorem, divergence theorem, and Stok es' theorem.  \nLecture times: Monday -Friday 1st period,  1 afternoon tutorial, optional additional mini -tutorials in 2nd or 3rd period  \nDP requirements: 35% class record; attendance of tutorials  \nAssessment: One paper written in June or November no longer than 2.5 hours: up to 80%, year mark: up to 40%.  \n \nMAM2085S     VECTOR CALCULUS FOR ASPECT  \n16 NQF credits at NQF level 6  \nConvener: Associate Professor P Padayachee  \nCourse entry requirements: MAM1023 and MAM1024  \nCourse outline:  \nThis course aims to develop an understanding of vector calculus. Topics includ

In [49]:
question = "What are major topics for this class?"
docs_tfidf=tfidf_retriever.get_relevant_documents(question)
docs_tfidf[0]

Document(metadata={}, page_content="Mathematics.  \n   Departments offering courses to the Faculty of Commerce  164 \nDEPARTMENT OF PHILOSOPHY  \n \nPHI1010S     ETHICS  \nThis course may also be offered in Summer/ Winter Term for limited numbers of students - please consult the department.  \n18 NQF credits at NQF level 5  \nConvener: O Mogomotsi  \nCourse entry requirements: None  \nCourse outline:  \nThis course introduces students to moral philosophy and to the questions it asks. These may include: What makes an action rig ht? Is morality \nrelative (to one's own views or to one's culture) or is it objective? What is the relationship between religion and  ethics? What is it to be a good \nperson?  \nLecture times: Monday, Tuesday, Wednesday, 5th period.  \nDP requirements: Regular attendance at lectures and tutorials; completion of all tests, submission of all essays and assignments by due dates,  \nand an average mark of at least 35% for the coursework.  \nAssessment: Coursework c