## Retrieval


Retrieval Augmented Generation (RAG) is a model architecture that combines elements of both retrieval-based and generation-based approaches in natural language processing (NLP). It is designed to enhance the performance of language models in generating coherent and contextually relevant responses.

In traditional generation-based models, such as sequence-to-sequence models or transformers, the model generates responses from scratch based on the input context. However, these models often struggle with generating accurate or contextually appropriate responses, especially when dealing with complex queries or rare topics.

RAG addresses this limitation by incorporating a retrieval component into the generation process. It combines a language model with a retrieval model, typically based on dense vector representations (embeddings), such as the ones stored in a vector store. The retrieval model helps retrieve relevant passages or documents from a large knowledge source, such as a collection of documents or a search engine index.

RAG models have shown promising results in various NLP tasks, including question-answering, dialogue systems, and document summarization, where the combination of generation and retrieval proves beneficial in producing high-quality outputs.

### Maximum Marginal Relevance (MMR)

MMR in NLP stands for **Maximal Marginal Relevance**. It is a technique used in information retrieval and text summarization to select and rank documents or sentences based on their relevance and diversity.

The goal of MMR is to create a summary or a set of results that maximizes both the relevance of the selected items to a given query and the diversity among the selected items. It aims to strike a balance between including highly relevant items and avoiding redundancy or duplication.

Here's a general overview of how MMR works:

1. **Relevance Scoring**, each document or sentence in the collection is initially scored for relevance to the query using a similarity measure, such as cosine similarity, BM25, or other relevance scoring methods. The higher the score, the more relevant the item is to the query.

2. **Selecting the Most Relevant Item**, the item with the highest relevance score is initially selected as the most relevant item and included in the summary or result set.

3. **Calculating Diversity**, to ensure diversity, MMR takes into account the similarity between the selected item and the remaining items. It calculates a diversity score by subtracting the similarity between the selected item and each remaining item from the similarity between the selected item and itself.

4. **Selecting the Next Item**, The next item to be included in the summary or result set is the one with the highest MMR score, which combines relevance and diversity scores. It aims to maximize relevance while minimizing redundancy with previously selected items.

5. **Iterative Selection**, the process is repeated iteratively until a predefined number of items is selected or until a stopping criterion is met. At each iteration, MMR dynamically adjusts the balance between relevance and diversity based on the selected items so far.

By using MMR, text summarization systems can produce summaries that are not only informative and relevant but also diverse and non-redundant. It can help avoid repetition of information and provide a more comprehensive overview of the content. Similarly, in information retrieval systems, MMR can help present a diverse set of results that cover different aspects of the query while maintaining high relevance.

MMR has been widely used in various NLP applications, including text summarization, document clustering, and result diversification in search engines, to enhance the quality and diversity of the generated summaries or selected items.

**Ensuring diversity is the main importance of MMR**. The set of responses will include diversity in the semantic similarities.

### LLM Aided Retrieval

LLM (Language Model)-aided retrieval is an approach that combines the power of language models with traditional retrieval methods to improve the effectiveness of information retrieval systems. It leverages the contextual understanding and language generation capabilities of large pre-trained language models, such as GPT-3, to enhance the retrieval process.

By incorporating language models into the retrieval process, LLM-aided retrieval aims to capture the semantic meaning and contextual relevance of queries and documents. It allows for a more sophisticated understanding of the information needs and provides more accurate and contextually relevant results to users.

LLM-aided retrieval has been successfully applied in various information retrieval tasks, including web search, question answering, and document retrieval. It enables better matching of user intent and provides improved retrieval performance by leveraging the advanced language understanding capabilities of large pre-trained models.

LLM aided retrieval uses LL models to come up with best queries to find the most accurate answers possible.

### Compression

In the context of NLP, compression refers to the process of reducing the size or complexity of textual data while retaining its essential information. It involves techniques that aim to represent text in a more compact form, typically by removing redundancy, exploiting patterns, or utilizing specialized algorithms.

However, it is important to note that compression techniques involve trade-offs. Lossy compression can lead to the loss of certain details or nuances in the text, while lossless compression may not achieve the same level of reduction in size as lossy methods.

Overall, compression techniques in NLP enable more efficient storage, transmission, and processing of textual data, facilitating tasks such as data management, summarization, and information retrieval in resource-constrained environments.

**Compression involves multiple calls to the LLM model which can be expensive but, optimaly leads to great results in the final run.**

In [21]:
from dotenv import load_dotenv
import os

%load_ext dotenv
%dotenv

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [22]:
openai_api_key  = os.environ['OPANAI_API_KEY']

In [27]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = './vecstores/chroma'

In [28]:
embedding = OpenAIEmbeddings(openai_api_key=openai_api_key)

In [29]:
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

In [30]:
print(vectordb._collection.count())

26


In [42]:
toy_text = [
    "I like football matches, they entertain me alot this days.",
    "I like the fact that football matches have fans and excitements around it, its very entertaining",
    "Injuries in football can be dangerous events",
    "Football is good but I hate the violence that comes with it when fans clash"
]

In [43]:
toy_db = Chroma.from_texts(texts=toy_text, embedding=embedding)

In [44]:
question = "Tell me about football matches."

In [45]:
toy_db.similarity_search(query=question, k=2)

[Document(page_content='I like football matches, they entertain me alot this days.', metadata={}),
 Document(page_content='I like the fact that football matches have fans and excitements around it, its very entertaining', metadata={})]

This only return to us the good things about football,not the injuries about it. Let's use MMR to try and get a moe diverse answer.

### MMR

In [47]:
toy_db.max_marginal_relevance_search(query=question, k=2)

Number of requested results 20 is greater than number of elements in index 4, updating n_results = 4


[Document(page_content='I like football matches, they entertain me alot this days.', metadata={}),
 Document(page_content='Injuries in football can be dangerous events', metadata={})]

Now we can see we do not only get back the good things about football but, also the bad side of things.

In [53]:
question = "Is ther a point where we go over function parameters"

In [54]:
docs = vectordb.similarity_search(question,k=6)

In [55]:
for i, doc in enumerate(docs):
    doc.page_content = doc.page_content.replace("\n", " ")
    docs[i] = doc

In [56]:
for i, doc in enumerate(docs):
    print(f"Doc {i}: {doc.page_content} \n {doc.metadata}", end="\n\n")

Doc 0: #TypeScriptRestParameters #TypeScriptFunctionOverloading #TypeScriptArrowFunctions 
 {'source': './datasets/example_doc.pdf', 'page': 0}

Doc 1: #TypeScriptFunctionParameters #TypeScriptOptionalParameters #TypeScriptDefaultParameters 
 {'source': './datasets/example_doc.pdf', 'page': 0}

Doc 2: variable number of arguments. 5. Function Overloading: Understand function overloading in 
 {'source': './datasets/example_doc.pdf', 'page': 0}

Doc 3: which can accept functions as parameters or return functions, enabling powerful abstractions in 
 {'source': './datasets/example_doc.pdf', 'page': 0}

Doc 4: range of topics related to functions in TypeScript: 1. Introduction to Functions: Understand the 
 {'source': './datasets/example_doc.pdf', 'page': 0}

Doc 5: and Parameters: Learn how to declare functions, define parameters, and specify return types in 
 {'source': './datasets/example_doc.pdf', 'page': 0}



In [57]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=6)

In [58]:
for i, doc in enumerate(docs_mmr):
    doc.page_content = doc.page_content.replace("\n", " ")
    docs[i] = doc

In [59]:
for i, doc in enumerate(docs_mmr):
    print(f"Doc {i}: {doc.page_content} \n {doc.metadata}", end="\n\n")

Doc 0: #TypeScriptRestParameters #TypeScriptFunctionOverloading #TypeScriptArrowFunctions 
 {'source': './datasets/example_doc.pdf', 'page': 0}

Doc 1: variable number of arguments. 5. Function Overloading: Understand function overloading in 
 {'source': './datasets/example_doc.pdf', 'page': 0}

Doc 2: which can accept functions as parameters or return functions, enabling powerful abstractions in 
 {'source': './datasets/example_doc.pdf', 'page': 0}

Doc 3: the fundamentals of functions and their significance in programming. 2. Function Declaration and 
 {'source': './datasets/example_doc.pdf', 'page': 0}

Doc 4: for the same function name. 6. Arrow Functions: Learn about the concise syntax and benefits of 
 {'source': './datasets/example_doc.pdf', 'page': 0}

Doc 5: 4. Rest Parameters: Explore the rest parameter syntax, enabling functions to accept a variable 
 {'source': './datasets/example_doc.pdf', 'page': 0}



Responses are more diverse with MMR

## Self-query Retriever

This use metadata to query the documents

The SelfQueryRetriever is a component in the Langchain library that automates the process of prompt tuning for vector database retrieval. It uses an LLM (Language Model) to generate multiple queries from different perspectives for a given user input query. By generating multiple perspectives on the same question, the SelfQueryRetriever aims to overcome some of the limitations of distance-based retrieval and provide a richer set of results.

To use the SelfQueryRetriever, you need to specify the LLM to use for query generation. The retriever will then generate multiple queries and retrieve a set of relevant documents for each query. It takes the unique union across all queries to get a larger set of potentially relevant documents.

In [78]:
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers import SelfQueryRetriever
from langchain.chat_models import ChatOpenAI

In [79]:
metadata_field_info = [
    AttributeInfo(
        name="original",
        description="The data is from `./datasets/example_doc.pdf'`",
        type="string",
    )
]

In [80]:
document_content_description = "YouTube Video Description Data"

# creating LLM
llm = ChatOpenAI(openai_api_key=openai_api_key)



In [81]:
# Create the retriever
retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vectordb,
    document_contents=document_content_description,
    metadata_field_info=metadata_field_info,
    verbose=True
)

TypeError: 'NoneType' object is not callable

## Contextual Compression Technique

In [83]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [95]:
llm = OpenAI(temperature=0, openai_api_key=openai_api_key)

# Wrap vector store in a compressor
compressor = LLMChainExtractor.from_llm(llm)

In [88]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [92]:
question = "Is ther a point where we go over function parameters"

Get the compressed documents

In [93]:
compressed_docs = compression_retriever.get_relevant_documents(question)

In [94]:
compressed_docs

[Document(page_content='#TypeScriptFunctionOverloading', metadata={'source': './datasets/example_doc.pdf', 'page': 0}),
 Document(page_content='#TypeScriptFunctionParameters #TypeScriptOptionalParameters #TypeScriptDefaultParameters', metadata={'source': './datasets/example_doc.pdf', 'page': 0}),
 Document(page_content='5. Function Overloading: Understand function overloading in', metadata={'source': './datasets/example_doc.pdf', 'page': 0}),
 Document(page_content='accept functions as parameters', metadata={'source': './datasets/example_doc.pdf', 'page': 0})]

In [96]:
for i, doc in enumerate(compressed_docs):
    print(f"Doc {i}: {doc.page_content} \n {doc.metadata}", end="\n\n")

Doc 0: #TypeScriptFunctionOverloading 
 {'source': './datasets/example_doc.pdf', 'page': 0}

Doc 1: #TypeScriptFunctionParameters #TypeScriptOptionalParameters #TypeScriptDefaultParameters 
 {'source': './datasets/example_doc.pdf', 'page': 0}

Doc 2: 5. Function Overloading: Understand function overloading in 
 {'source': './datasets/example_doc.pdf', 'page': 0}

Doc 3: accept functions as parameters 
 {'source': './datasets/example_doc.pdf', 'page': 0}



Setting search type to MMR

In [97]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type="mmr")
)

In [98]:
compressed_docs = compression_retriever.get_relevant_documents(question)



In [99]:
for i, doc in enumerate(compressed_docs):
    print(f"Doc {i}: {doc.page_content} \n {doc.metadata}", end="\n\n")

Doc 0: #TypeScriptFunctionOverloading 
 {'source': './datasets/example_doc.pdf', 'page': 0}

Doc 1: 5. Function Overloading: Understand function overloading in 
 {'source': './datasets/example_doc.pdf', 'page': 0}

Doc 2: accept functions as parameters 
 {'source': './datasets/example_doc.pdf', 'page': 0}

Doc 3: "rest parameter syntax, enabling functions to accept a variable" 
 {'source': './datasets/example_doc.pdf', 'page': 0}



### Other Types Of Retrieval Techniques We Can Use

Other than the vectorstore and the retrievers we have used with it, there are mote traditional NLP retrievers we can use. This include:

#### SVM Retriever

The SVM (Support Vector Machine) retriever is a component in natural language processing (NLP) systems that utilizes Support Vector Machines for information retrieval tasks. It is designed to retrieve relevant documents or passages from a large collection of text based on a given query.

The SVM retriever is particularly effective in tasks such as question-answering, document retrieval, and passage retrieval. It can handle complex queries and large document collections while providing accurate and relevant results. The SVM algorithm, in combination with appropriate feature engineering and training data, allows the retriever to effectively discriminate between relevant and irrelevant instances.

#### TFID Retriever


The TFIDF (Term Frequency-Inverse Document Frequency) Retriever is a component used in natural language processing (NLP) systems for information retrieval tasks. It leverages the TF-IDF weighting scheme to retrieve relevant documents or passages from a collection of text based on a given query.

The TFIDF Retriever is a simple yet effective method for information retrieval tasks. It assigns higher importance to terms that are both frequent in a document and rare in the overall corpus. By considering both term frequency and document frequency, it aims to identify documents that are likely to be relevant to the query based on the content overlap.

In [100]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [103]:
# Load PDF
loader = PyPDFLoader("./datasets/example_doc.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

In [105]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)

In [106]:
splits = text_splitter.split_text(joined_page_text)

In [107]:
splits

['Mastering\nFunctions\nin\nTypeScript:\nA\nComprehensive\nGuide\n|\nCode\nwith\nPrince\nDescription:\nWelcome\nback\nto\n"Code\nwith\nPrince"!\nIn\nthis\nsecond\nvideo\nof\nour\nTypeScript\ntutorial\nseries,\nwe\'ll\ndelve\ninto\nthe\npowerful\nworld\nof\nfunctions\nin\nTypeScript.\nFunctions\nare\nthe\nbackbone\nof\nany\nprogramming\nlanguage,\nand\nTypeScript\nprovides\nadditional\nfeatures\nand\nenhancements\nto\nmake\nyour\ncode\nmore\nmaintainable\nand\nscalable.\nIn\nthis\ntutorial,\nwe\'ll\nexplore\na\nwide\nrange\nof\ntopics\nrelated\nto\nfunctions\nin\nTypeScript:\n1.\nIntroduction\nto\nFunctions:\nUnderstand\nthe\nfundamentals\nof\nfunctions\nand\ntheir\nsignificance\nin\nprogramming.\n2.\nFunction\nDeclaration\nand\nParameters:\nLearn\nhow\nto\ndeclare\nfunctions,\ndefine\nparameters,\nand\nspecify\nreturn\ntypes\nin\nTypeScript.\n3.\nOptional\nand\nDefault\nParameters:\nDiscover\nTypeScript\'s\nsupport\nfor\noptional\nand\ndefault\nparameters,\nallowing\nfor\nmore\nflexibl

In [108]:
svm_retriever = SVMRetriever.from_texts(splits,embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)

In [109]:
question = "Is ther a point where we go over function parameters"
docs_svm=svm_retriever.get_relevant_documents(question)
docs_svm[0]

Document(page_content='Mastering\nFunctions\nin\nTypeScript:\nA\nComprehensive\nGuide\n|\nCode\nwith\nPrince\nDescription:\nWelcome\nback\nto\n"Code\nwith\nPrince"!\nIn\nthis\nsecond\nvideo\nof\nour\nTypeScript\ntutorial\nseries,\nwe\'ll\ndelve\ninto\nthe\npowerful\nworld\nof\nfunctions\nin\nTypeScript.\nFunctions\nare\nthe\nbackbone\nof\nany\nprogramming\nlanguage,\nand\nTypeScript\nprovides\nadditional\nfeatures\nand\nenhancements\nto\nmake\nyour\ncode\nmore\nmaintainable\nand\nscalable.\nIn\nthis\ntutorial,\nwe\'ll\nexplore\na\nwide\nrange\nof\ntopics\nrelated\nto\nfunctions\nin\nTypeScript:\n1.\nIntroduction\nto\nFunctions:\nUnderstand\nthe\nfundamentals\nof\nfunctions\nand\ntheir\nsignificance\nin\nprogramming.\n2.\nFunction\nDeclaration\nand\nParameters:\nLearn\nhow\nto\ndeclare\nfunctions,\ndefine\nparameters,\nand\nspecify\nreturn\ntypes\nin\nTypeScript.\n3.\nOptional\nand\nDefault\nParameters:\nDiscover\nTypeScript\'s\nsupport\nfor\noptional\nand\ndefault\nparameters,\nallowin

In [113]:
for i, doc in enumerate(docs_svm):
    doc.page_content = doc.page_content.replace("\n", " ")
    docs[i] = doc

In [114]:
for i, doc in enumerate(docs_svm):
    print(f"Doc {i}: {doc.page_content} \n {doc.metadata}", end="\n\n")

Doc 0: Mastering Functions in TypeScript: A Comprehensive Guide | Code with Prince Description: Welcome back to "Code with Prince"! In this second video of our TypeScript tutorial series, we'll delve into the powerful world of functions in TypeScript. Functions are the backbone of any programming language, and TypeScript provides additional features and enhancements to make your code more maintainable and scalable. In this tutorial, we'll explore a wide range of topics related to functions in TypeScript: 1. Introduction to Functions: Understand the fundamentals of functions and their significance in programming. 2. Function Declaration and Parameters: Learn how to declare functions, define parameters, and specify return types in TypeScript. 3. Optional and Default Parameters: Discover TypeScript's support for optional and default parameters, allowing for more flexible function signatures. 4. Rest Parameters: Explore the rest parameter syntax, enabling functions to accept a variable num

In [115]:
docs_tfidf=tfidf_retriever.get_relevant_documents(question)
docs_tfidf[0]

Document(page_content='Mastering\nFunctions\nin\nTypeScript:\nA\nComprehensive\nGuide\n|\nCode\nwith\nPrince\nDescription:\nWelcome\nback\nto\n"Code\nwith\nPrince"!\nIn\nthis\nsecond\nvideo\nof\nour\nTypeScript\ntutorial\nseries,\nwe\'ll\ndelve\ninto\nthe\npowerful\nworld\nof\nfunctions\nin\nTypeScript.\nFunctions\nare\nthe\nbackbone\nof\nany\nprogramming\nlanguage,\nand\nTypeScript\nprovides\nadditional\nfeatures\nand\nenhancements\nto\nmake\nyour\ncode\nmore\nmaintainable\nand\nscalable.\nIn\nthis\ntutorial,\nwe\'ll\nexplore\na\nwide\nrange\nof\ntopics\nrelated\nto\nfunctions\nin\nTypeScript:\n1.\nIntroduction\nto\nFunctions:\nUnderstand\nthe\nfundamentals\nof\nfunctions\nand\ntheir\nsignificance\nin\nprogramming.\n2.\nFunction\nDeclaration\nand\nParameters:\nLearn\nhow\nto\ndeclare\nfunctions,\ndefine\nparameters,\nand\nspecify\nreturn\ntypes\nin\nTypeScript.\n3.\nOptional\nand\nDefault\nParameters:\nDiscover\nTypeScript\'s\nsupport\nfor\noptional\nand\ndefault\nparameters,\nallowin

In [118]:
for i, doc in enumerate(docs_tfidf):
    doc.page_content = doc.page_content.replace("\n", " ")
    docs[i] = doc

In [119]:
for i, doc in enumerate(docs_tfidf):
    print(f"Doc {i}: {doc.page_content} \n {doc.metadata}", end="\n\n")

Doc 0: Mastering Functions in TypeScript: A Comprehensive Guide | Code with Prince Description: Welcome back to "Code with Prince"! In this second video of our TypeScript tutorial series, we'll delve into the powerful world of functions in TypeScript. Functions are the backbone of any programming language, and TypeScript provides additional features and enhancements to make your code more maintainable and scalable. In this tutorial, we'll explore a wide range of topics related to functions in TypeScript: 1. Introduction to Functions: Understand the fundamentals of functions and their significance in programming. 2. Function Declaration and Parameters: Learn how to declare functions, define parameters, and specify return types in TypeScript. 3. Optional and Default Parameters: Discover TypeScript's support for optional and default parameters, allowing for more flexible function signatures. 4. Rest Parameters: Explore the rest parameter syntax, enabling functions to accept a variable num

More or less same results.