### RAG with langchain and Llama

---
### 1. Introduction

#### Limitation of LLMs

- know nothing outside trainning data, e.g. up-to-date information, classified/private data
- not specialized in specific use cases
- tend to hallucinate confidently, possibly leading to missinformation
- produce black box output: do not clarify what has led to the generation of particular content

#### Fine Tunning
- Enhances model Performance for specific use case through Transfer Learning with additional data.
- Changes model Parameters, enhancing speed and reducing Cost for specific Tasks
- Powerful tool for :
  - Incorporating non-dynamic or past data.
  - Specific Industries with nuances in writting style, and reducing costs for specific Tasks.
- Cut-off issue persists in absence of up-to-date Information

#### Retrieval Augmented Generation
- Increases model capabilities through.
  - **Retrieving** external and up-to-date information.
  - **Augmenting** the original prompt given to the model.
  - **Generating** response using context plus Information.
- Ground LLM Model parameters remain unchanged (no Transfer Learning).
- Powerful Tool for making use of Dynamic up-to-date Information.
- White Box Output: Provides transparency behind the Model without Hallucination.

#### RAG Framework
<img src="../documents/rag.png" width="950"/>


#### Technolgy Stack


##### [LangChain](https://python.langchain.com/docs/introduction/)

> Framework for developing applications powered by LLMs

> [What is LangChain? By IBM Technology](https://www.youtube.com/watch?v=1bUy-1hGZpI)

##### [FAISS (Facebook AI Similarity Search)](https://ai.meta.com/tools/faiss/)

>  Library allowing storage of contextual embedding vectors in vector database and similarity search
>  [acebook AI Similarity Search FAISS | OpenAI's Embeddings Endpoint | Gen AI OpenAI API in Python](https://www.youtube.com/watch?v=Gx3TzYFaCS8)

##### [Groq](https://groq.com/about-us/)

> Engine providing fast AI inference (conclusion from brand new data) in the cloud

### 2. Warm up

#### Load Credentials

In [1]:
from dotenv import load_dotenv
load_dotenv()

True

#### Define LLM

In [2]:
import warnings
warnings.filterwarnings("ignore")
from langchain_groq import ChatGroq

llm = ChatGroq(
    model="llama3-8b-8192",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2
)

#### Define Prompt template

**What is a Prompt?**
>- set of instructions or input for an LLM provided by a user to guide its response
>- helps it understand the context and generate relevant and coherent language-based output



In [3]:
from langchain.prompts.prompt import PromptTemplate

In [4]:
query = """
    given the information {information} about a person I want you to create:
    1. A short summary
    2. two interesting facts about them
    """

In [5]:
prompt_template = PromptTemplate(
    input_variables=["information"],
    template=query
)

#### Define Chain
**What is a Chain?**

> - allows to link the output of one LLM call as the input of another

**Note:**
The `|` symbol chains together the different components, feeding the output from one component as input into the next component.
In this chain the user input is passed to the prompt template, then the prompt template output is passed to the model. 

In [8]:
chain = prompt_template | llm

#### Invoke Chain

In [9]:
text_data ="""
Geoffrey Everest Hinton (born 6 December 1947) is a British-Canadian computer scientist, cognitive scientist, 
cognitive psychologist, known for his work on artificial neural networks which earned him the title as the 
"Godfather of AI". Hinton is University Professor Emeritus at the University of Toronto. From 2013 to 2023, 
he divided his time working for Google (Google Brain) and the University of Toronto, before publicly announcing 
his departure from Google in May 2023, citing concerns about the risks of artificial intelligence (AI) technology.
In 2017, he co-founded and became the chief scientific advisor of the Vector Institute in Toronto.

With David Rumelhart and Ronald J. Williams, Hinton was co-author of a highly cited paper published in 1986 
that popularised the backpropagation algorithm for training multi-layer neural networks, although they were 
not the first to propose the approach. Hinton is viewed as a leading figure in the deep learning community.
The image-recognition milestone of the AlexNet designed in collaboration with his students Alex Krizhevsky 
and Ilya Sutskever for the ImageNet challenge 2012[22] was a breakthrough in the field of computer vision.

Hinton received the 2018 Turing Award, often referred to as the "Nobel Prize of Computing", together with 
Yoshua Bengio and Yann LeCun, for their work on deep learning. They are sometimes referred to as the 
"Godfathers of Deep Learning", and have continued to give public talks together. He was also awarded 
the 2024 Nobel Prize in Physics, shared with John Hopfield.
"""

In [10]:
output = chain.invoke(input={"information": text_data})

In [11]:
print(output.content)

Here is the information about Geoffrey Everest Hinton:

**Summary:** Geoffrey Everest Hinton is a British-Canadian computer scientist, cognitive scientist, and cognitive psychologist known for his work on artificial neural networks and deep learning. He is often referred to as the "Godfather of AI" and has made significant contributions to the field of computer vision and machine learning.

**Interesting Facts:**

1. **Co-author of a highly cited paper:** Hinton co-authored a paper with David Rumelhart and Ronald J. Williams in 1986 that popularized the backpropagation algorithm for training multi-layer neural networks. This paper has had a significant impact on the field of artificial intelligence.
2. **Recipient of prestigious awards:** Hinton has received several prestigious awards, including the 2018 Turing Award, often referred to as the "Nobel Prize of Computing", and the 2024 Nobel Prize in Physics, shared with John Hopfield. He is also known as the "Godfather of AI" and has bee

### 3. Summarazing Text Data

#### Load data

In [12]:
from langchain_community.document_loaders import PyPDFLoader

def load_pdf_data(pdf_path):
    """
    this function loads text data from pdf file
    """
    loader = PyPDFLoader(file_path=pdf_path)
    documents = loader.load()
    return documents

In [13]:
react_docs = load_pdf_data(pdf_path = "../documents/react_paper.pdf")

In [14]:
print(f"number of loaded pages: {len(react_docs)}")

number of loaded pages: 33


In [15]:
print(react_docs[0].page_content)

Published as a conference paper at ICLR 2023
REAC T: S YNERGIZING REASONING AND ACTING IN
LANGUAGE MODELS
Shunyu Yao∗*,1, Jeffrey Zhao2, Dian Yu2, Nan Du2, Izhak Shafran2, Karthik Narasimhan1, Yuan Cao2
1Department of Computer Science, Princeton University
2Google Research, Brain team
1{shunyuy,karthikn}@princeton.edu
2{jeffreyzhao,dianyu,dunan,izhak,yuancao}@google.com
ABSTRACT
While large language models (LLMs) have demonstrated impressive performance
across tasks in language understanding and interactive decision making, their
abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g. action
plan generation) have primarily been studied as separate topics. In this paper, we
explore the use of LLMs to generate both reasoning traces and task-speciﬁc actions
in an interleaved manner, allowing for greater synergy between the two: reasoning
traces help the model induce, track, and update action plans as well as handle
exceptions, while actions allow it to interface with an

#### Split Document into Chunks
>- not possible to feed the whole content into the LLM at once because of finite context window
>- even models with large window sizes may struggle to find information in very long inputs and perform very badly
>- chunk the document into pieces: helps retrieve only the relevant information from the corpus

In [16]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

def split_documents(documents, chunk_size=800, chunk_overlap=80):
    """
    this function splits documents into chunks of given size and overlap
    """
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    chunks = text_splitter.split_documents(documents=documents)
    return chunks

In [17]:
react_chunks = split_documents(react_docs)

In [18]:
print(f"number of chunks created: {len(react_chunks)}")

number of chunks created: 170


#### Create Embeddings
>  finding numerical representations of text chunks

In [19]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
import os

def create_embedding_vector_db(chunks, db_name, target_directory=f"../vector_databases"):
    """
    this function uses the open-source embedding model HuggingFaceEmbeddings 
    to create embeddings and store those in a vector database called FAISS, 
    which allows for efficient similarity search
    """
    # instantiate embedding model
    embedding = HuggingFaceEmbeddings(
        model_name='sentence-transformers/all-mpnet-base-v2'
    )
    # create the vector store 
    vectorstore = FAISS.from_documents(
        documents=chunks,
        embedding=embedding
    )
    # save vector database locally
    if not os.path.exists(target_directory):
        os.makedirs(target_directory)
    vectorstore.save_local(f"{target_directory}/{db_name}_vector_db")

In [20]:
create_embedding_vector_db(chunks=react_chunks, db_name="react")

#### Retrieve from Vector Database

In [21]:
def retrieve_from_vector_db(vector_db_path):
    """
    this function splits out a retriever object from a local vector database
    """
    # instantiate embedding model
    embeddings = HuggingFaceEmbeddings(
        model_name='sentence-transformers/all-mpnet-base-v2'
    )
    react_vectorstore = FAISS.load_local(
        folder_path=vector_db_path,
        embeddings=embeddings,
        allow_dangerous_deserialization=True
    )
    retriever = react_vectorstore.as_retriever()
    return retriever

In [22]:
react_retriever = retrieve_from_vector_db("../vector_databases/react_vector_db")

In [23]:
type(react_retriever)

langchain_core.vectorstores.base.VectorStoreRetriever

#### Generation
**chain passing documents to llm**
[`create_stuff_documents_chain`](https://api.python.langchain.com/en/latest/chains/langchain.chains.combine_documents.stuff.create_stuff_documents_chain.html#langchain.chains.combine_documents.stuff.create_stuff_documents_chain)

- takes a list of documents and formats them all into a prompt, then passes that prompt to an LLM
- passes ALL documents, so you should make sure it fits within the context window of the LLM being used

In [24]:
from langchain import hub
from langchain.chains.combine_documents import create_stuff_documents_chain

[`create_retrieval_chain`](https://api.python.langchain.com/en/latest/chains/langchain.chains.retrieval.create_retrieval_chain.html#langchain.chains.retrieval.create_retrieval_chain)

- takes in a user inquiry, which is then passed to the retriever to fetch relevant documents
- those documents (and original inputs) are then passed to an LLM to generate a response

In [25]:
from langchain.chains.retrieval import create_retrieval_chain

In [26]:
def connect_chains(retriever):
    """
    this function connects stuff_documents_chain with retrieval_chain
    """
    stuff_documents_chain = create_stuff_documents_chain(
        llm=llm,
        prompt=hub.pull("langchain-ai/retrieval-qa-chat")
    )
    retrieval_chain = create_retrieval_chain(
        retriever=retriever,
        combine_docs_chain=stuff_documents_chain
    )
    return retrieval_chain

In [27]:
react_retrieval_chain = connect_chains(react_retriever)

In [28]:
output = react_retrieval_chain.invoke(
    {"input": "Give me the summary of ReAct in 5 sentences"}
)

In [29]:
type(output)

dict

In [30]:
output.keys()

dict_keys(['input', 'context', 'answer'])

In [31]:
print(output['answer'])

ReAct is a framework that augments an agent's action space to include language-based actions, which it calls "thoughts" or "reasoning traces". These thoughts do not affect the external environment, but instead aim to compose useful information by reasoning over the current context. ReAct allows the agent to update its context and support future reasoning or acting. The framework shows strong generalization to new task instances and is robust to prompt selections, and also promises an interpretable sequential decision-making and reasoning process. Additionally, humans can control or correct the agent's behavior on the go by editing its thoughts.


### 4. Chat with PDF

In [32]:
paracetamol_docs = load_pdf_data(pdf_path = "../documents/paracetamol.pdf")

Split Document into Chunks and Create Embeddings

In [33]:
paracetamol_chunks = split_documents(paracetamol_docs)

In [34]:
create_embedding_vector_db(chunks=paracetamol_chunks, db_name="paracetamol")

In [35]:
paracetamol_retriever = retrieve_from_vector_db("../vector_databases/paracetamol_vector_db")

Generation

In [36]:
paracetamol_retrieval_chain = connect_chains(paracetamol_retriever)

In [37]:
def print_output(
    inquiry,
    retrieval_chain=paracetamol_retrieval_chain
):
    result = retrieval_chain.invoke({"input": inquiry})
    print(result['answer'].strip("\n"))

Different questions/Inquiries/Querys

In [38]:
print_output("Give me the summary of Paracetamol in 3 sentences.")

Paracetamol 500mg Tablets are used to relieve pain and reduce fever in adults, adolescents, and children aged 4 years and above. The recommended daily dose for adults and adolescents is 4,000mg (8 tablets) and for children, it is 60mg/kg/day. In case of overdose, symptoms may include nausea, vomiting, loss of appetite, paleness, and abdominal pain, and it is recommended to seek medical help immediately if a larger amount is ingested than recommended.


In [39]:
print_output("Geb mir die Zusammenfassung von Paracetamol in 3 Sätzen.")

Hier ist eine Zusammenfassung von Paracetamol in 3 Sätzen:

Paracetamol 500 mg Die Apotheke hilft Schmerztabletten ist ein Schmerzmittel, das bei Erwachsenen und Jugendlichen ab 12 Jahren verwendet werden kann. Es sollte nicht zusammen mit Alkohol eingenommen werden und kann die Blutzucker- und Harnsäurebestimmung beeinflussen. Wenn eine größere Menge als empfohlen eingenommen wurde, sollte sofort der nächstgelegene Arzt aufgerufen werden, und bei Verschlimmerung der Beschwerden oder Fieber sollte ein Arzt aufgesucht werden.


In [40]:
print_output("Welche chemischen Inhaltsstoffe enthält Paracetamol?")

Nach dem Text enthält Paracetamol 500 mg Die Apotheke hilft Schmerztabletten den Wirkstoff Paracetamol und die sonstigen Bestandteile:

* Vorverkleisterte Stärke [Mais] (Ph. Eur.)
* Stearinsäure (Ph. Eur.)
* Povidon K 30 (Ph. Eur.)


In [41]:
print_output("Welche Inhaltsstoffe enthält Paracetamol?")

Nach dem Text enthält Paracetamol 500 mg Die Apotheke hilft Schmerztabletten den Wirkstoff Paracetamol und die sonstigen Bestandteile:

* Vorverkleisterte Stärke [Mais] (Ph. Eur.)
* Stearinsäure (Ph. Eur.)
* Povidon K 30 (Ph. Eur.)


In [42]:
print_output("Wann sollte man Paracetamol nicht einnehmen?")

Nach dem Text sollte man Paracetamol 500 mg Die Apotheke hilft Schmerztabletten nicht einnehmen:

* zusammen mit Alkohol
* wenn man schwanger ist oder stillt, oder wenn man vermutet, schwanger zu sein oder beabsichtigt, schwanger zu werden (es wird empfohlen, vor der Einnahme einen Arzt oder Apotheker zu fragen)


## References

1. [RAG vs. Fine Tuning](https://www.youtube.com/watch?v=00Q0G84kq3M)
2. [How to Use Langchain Chain Invoke: A Step-by-Step Guide](https://medium.com/@asakisakamoto02/how-to-use-langchain-chain-invoke-a-step-by-step-guide-9a6f129d77d1)
3. [Implementing RAG using Langchain and Ollama](https://medium.com/@imabhi1216/implementing-rag-using-langchain-and-ollama-93bdf4a9027c)