# Retrieval Augmentation

**L**arge **L**anguage **M**odels (LLMs) have a data freshness problem. The most powerful LLMs in the world, like GPT-4, have no idea about recent world events.

The world of LLMs is frozen in time. Their world exists as a static snapshot of the world as it was within their training data.

A solution to this problem is *retrieval augmentation*. The idea behind this is that we retrieve relevant information from an external knowledge base and give that information to our LLM. In this notebook we will learn how to do that.

In [1]:
print("hello world")

hello world


In [2]:
%pwd

'd:\\VisualCode\\chatbot_ai\\chatbot_ai\\research'

In [3]:
import os
os.chdir("../")

In [4]:
%pwd

'd:\\VisualCode\\chatbot_ai\\chatbot_ai'

## Building The Knowledge Base

In [5]:
from langchain.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [47]:
# Extract text from PDFs using langchain
def load_pdf_file(data):
    loader = DirectoryLoader(data,
                             glob="*.pdf",
                             loader_cls=PyPDFLoader)

    documents = loader.load()
    
    return documents

In [48]:
extracted_text = load_pdf_file("data/")

In [33]:
# extracted_text

In [49]:
# Split the text into chunks 
def text_split(extracted_text):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=512,
                                                   chunk_overlap=128)
    text_chunks = text_splitter.split_documents(extracted_text)
    return text_chunks
    

In [50]:
text_chunks = text_split(extracted_text)
print("Length of Text Chunks: ", len(text_chunks))

Length of Text Chunks:  18761


In [51]:
text_chunks[0]

Document(metadata={'producer': 'Antenna House PDF Output Library 2.6.0 (Linux64)', 'creator': 'AH CSS Formatter V6.0 MR2 for Linux64 : 6.0.2.5372 (2012/05/16 18:26JST)', 'creationdate': '2017-03-01T18:01:20+00:00', 'author': 'Martin Kleppmann', 'moddate': '2017-03-01T13:07:54-05:00', 'title': 'Designing Data-Intensive Applications', 'trapped': '/False', 'source': 'data\\Designing_Data-Intensive_Applications_TH.pdf', 'total_pages': 613, 'page': 0, 'page_label': 'Cover'}, page_content='Martin Kleppmann\nDesigning \nData-Intensive \nApplications\nTHE BIG IDEAS BEHIND RELIABLE, SCALABLE,  \nAND MAINTAINABLE SYSTEMS')

## Creating embedding model and querying

In [52]:
from langchain_huggingface.embeddings import HuggingFaceEmbeddings

In [79]:
# Download the embeddings from Hugging Face
embed_model = "sentence-transformers/all-MiniLM-L6-v2"
model_kwargs = {"device": "cuda"}
def download_embeddings():
    embeddings = HuggingFaceEmbeddings(model_name=embed_model, model_kwargs=model_kwargs)
    return embeddings

In [54]:
embeddings = download_embeddings()

In [55]:
query = "This is an example sentence"
query_embedding = embeddings.embed_query(query)
print("Query length: ", len(query_embedding)) # dimension of the embedding

Query length:  384


In [56]:
query_embedding[:10]

[0.06765691936016083,
 0.0634959414601326,
 0.0487130731344223,
 0.07930496335029602,
 0.037448056042194366,
 0.0026527722366154194,
 0.039374940097332,
 -0.007098493631929159,
 0.05936146154999733,
 0.03153702989220619]

## Creating Index and Initializing Connection to Pinecone

In [78]:
from dotenv import load_dotenv
load_dotenv()

True

In [77]:
PINECONE_API_KEY = os.environ.get("PINECONE_API_KEY")
GROQ_API_KEY = os.environ.get("GROQ_API_KEY")

In [None]:
from pinecone import Pinecone

# configure client
pc = Pinecone(api_key=PINECONE_API_KEY)

In [59]:
from pinecone import ServerlessSpec

cloud = os.environ.get('PINECONE_CLOUD') or 'aws'
region = os.environ.get('PINECONE_REGION') or 'us-east-1'

spec = ServerlessSpec(cloud=cloud, region=region)

In [60]:
index_name = 'asura'

In [68]:
import time

if index_name in pc.list_indexes().names():
    pc.delete_index(index_name)

# we create a new index
pc.create_index(
        index_name,
        dimension=384,  # dimensionality of all-MiniLM-L6-v2
        metric='cosine',
        spec=spec
    )

# wait for index to be initialized
while not pc.describe_index(index_name).status['ready']:
    time.sleep(1)

In [64]:
import time
index = pc.Index(index_name)
# wait a moment for connection
time.sleep(1)

index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 20529}},
 'total_vector_count': 20529}

## Creating and Initializing VectoreStore

In [62]:
os.environ["PINECONE_API_KEY"] = PINECONE_API_KEY

In [63]:
# Embed each chunk and upsert the embeddings into your Pinecone index

from langchain_pinecone import PineconeVectorStore


vectorestore_from_docs = PineconeVectorStore.from_documents(
    documents=text_chunks,
    index_name=index_name,
    embedding=embeddings,
)

In [65]:
# Load Existing Index

vectorestore_from_docs = PineconeVectorStore.from_existing_index(
    index_name=index_name,
    embedding=embeddings
)

In [None]:
# Retriever

retriever = vectorestore_from_docs.as_retriever(
    search_type="similarity", 
    search_kwargs={"k":3}
    )

In [73]:
retrieved_docs1 = retriever.invoke("What is Data Flow")
retrieved_docs2 = retriever.invoke("What is Neural Network")
retrieved_docs3 = retriever.invoke("Who is Data Engineer")


In [74]:
retrieved_docs1

[Document(id='acf8490e-3b47-4751-9bf7-138a127690f4', metadata={'author': 'Martin Kleppmann', 'creationdate': '2017-03-01T18:01:20+00:00', 'creator': 'AH CSS Formatter V6.0 MR2 for Linux64 : 6.0.2.5372 (2012/05/16 18:26JST)', 'moddate': '2017-03-01T13:07:54-05:00', 'page': 150.0, 'page_label': '129', 'producer': 'Antenna House PDF Output Library 2.6.0 (Linux64)', 'source': 'data\\Designing_Data-Intensive_Applications_TH.pdf', 'title': 'Designing Data-Intensive Applications', 'total_pages': 613.0, 'trapped': '/False'}, page_content='That’s a fairly abstract idea—there are many ways data can flow from one process to\nanother. Who encodes the data, and who decodes it? In the rest of this chapter we\nwill explore some of the most common ways how data flows between processes:\n• Via databases (see “Dataflow Through Databases” on page 129)\n• Via service calls (see “Dataflow Through Services: REST and RPC” on page 131)\n• Via asynchronous message passing (see “Message-Passing Dataflow” on pag

In [75]:
retrieved_docs2

[Document(id='ca0e5a25-291b-4c9b-a997-2e02af8388cb', metadata={'author': 'Rabunal, Juan Ramon; Dorado, Julian; Pazos Sierra, Alejandro.', 'creationdate': '2008-07-02T14:52:44-04:00', 'creator': 'Adobe InDesign CS2 (4.0.5)', 'moddate': '2008-09-23T21:53:54+02:00', 'page': 70.0, 'page_label': '35', 'producer': 'Adobe PDF Library 7.0', 'source': 'data\\encyclopedia-of-artificial-intelligence.pdf', 'title': 'Encyclopedia of Artificial Intelligence', 'total_pages': 1677.0}, page_content='and Applications ISDA ‘05, 141 – 146. Institute of \nElectrical & Electronics Engineering Publisher.\nKEy TERMS\nArtificial Neural Networks (ANN): An artificial \nneural network, often just called a “neural network” \n(NN), is an interconnected group of artificial neurons \nthat uses a mathematical model or computational model \nfor information processing based on a connectionist \napproach to computation. Knowledge is acquired by \nthe network from its environment through a learning'),
 Document(id='c702a7

In [76]:
retrieved_docs3

[Document(id='013b1f02-80f2-409c-8353-106b350048e0', metadata={'author': 'Martin Kleppmann', 'creationdate': '2017-03-01T18:01:20+00:00', 'creator': 'AH CSS Formatter V6.0 MR2 for Linux64 : 6.0.2.5372 (2012/05/16 18:26JST)', 'moddate': '2017-03-01T13:07:54-05:00', 'page': 565.0, 'page_label': '544', 'producer': 'Antenna House PDF Output Library 2.6.0 (Linux64)', 'source': 'data\\Designing_Data-Intensive_Applications_TH.pdf', 'title': 'Designing Data-Intensive Applications', 'total_pages': 613.0, 'trapped': '/False'}, page_content='data breaches, and we may find that a well-intentioned use of data has unintended\nconsequences.\nAs software and data are having such a large impact on the world, we engineers must\nremember that we carry a responsibility to work toward the kind of world that we\nwant to live in: a world that treats people with humanity and respect. I hope that we\ncan work together toward that goal. \n544 | Chapter 12: The Future of Data Systems'),
 Document(id='a7a53198-b4

## Generative Question-Answering
**Setup Groq Model**

All of these are good, relevant results. But what can we do with this? There are many tasks, one of the most interesting (and well supported by LangChain) is called _"Generative Question-Answering"_ or GQA.

## Generative Question-Answering

In GQA we take the query as a question that is to be answered by a LLM, but the LLM must answer the question based on the information it is seeing being returned from the `vectorstore`.

To do this we initialize a `RetrievalQA` object like so:

In [82]:
# !pip install groq
from langchain_groq import ChatGroq

llm_model = "llama3-8b-8192"
max_tokens = 512
temperature = 0.4
llm = ChatGroq(temperature=temperature, model_name=llm_model, max_tokens=max_tokens)

In [83]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an expert-level Data Engineering and Artificial Intelligence Consultant tasked with providing precise, comprehensive, and actionable insights on complex technical queries."
    "Your domain expertise encompasses advanced data architectures, cloud-based data solutions, ETL processes, big data frameworks, and cutting-edge machine learning and deep learning methodologies."
    "In every response, adhere strictly to corporate communication standards—employ a formal, respectful tone and use industry-specific terminology."
    "Your explanations must be methodically reasoned, grounded in current best practices, and aligned with the highest standards of technical accuracy and ethical responsibility." 
    "Ensure that each recommendation is both innovative and pragmatic, supporting strategic decision-making in the realms of data engineering and AI."
    "/n/n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

In [84]:
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

In [90]:
response = rag_chain.invoke({"input": "What is Data Pipeline?"})
print(response["answer"])

A data pipeline, also known as a dataflow or data stream, is a sequence of processes that extract, transform, and load (ETL) data from various sources, transform it into a desired format, and load it into a target system, such as a data warehouse, database, or data lake. The primary goal of a data pipeline is to efficiently and reliably move data from its source to its destination, enabling data-driven decision-making and analytics.

A data pipeline typically consists of several components, including:

1. Data Sources: These are the systems, applications, or devices that generate data, such as databases, files, APIs, or IoT devices.
2. Data Processing: This involves transforming and manipulating the data to ensure it is in the desired format and meets the required quality standards.
3. Data Storage: This is where the processed data is stored, such as a data warehouse, database, or data lake.
4. Data Consumers: These are the applications, services, or users that consume the processed da

In [89]:
response = rag_chain.invoke({"input": "What is Acne?"})
print(response["answer"])

I'm happy to help! However, I notice that the provided text doesn't mention Acne. Instead, it appears to be discussing biometric systems, face detection, and related concepts.

If you meant to ask about acne, I'd be happy to provide information. Acne is a common skin condition characterized by the appearance of pimples, blackheads, and whiteheads on the skin. It occurs when the pores on the skin become clogged with dead skin cells, oil, and bacteria, leading to inflammation and infection.

If you'd like to know more about acne or have any specific questions, please feel free to ask!
