## RAG Fusion Technique implementation.

[RAG Fusion paper for reference](https://arxiv.org/abs/2402.03367)

In [None]:
pip install --upgrade --quiet  langchain langchain-community langchainhub langchain-openai langchain-chroma bs4

In [None]:
pip install unstructured



In [None]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini" , api_key ="enter openai api key")

In [None]:
pip install langsmith



In [None]:
# https://www.paulgraham.com/start.html
# https://www.paulgraham.com/ds.html
# https://www.paulgraham.com/makersschedule.html
# https://www.paulgraham.com/startupmistakes.html
# https://www.paulgraham.com/founders.html

import bs4
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.document_loaders import BSHTMLLoader
from langchain_community.document_loaders import UnstructuredHTMLLoader



import os
os.environ['USER_AGENT'] = 'myagent'
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
os.environ['LANGCHAIN_API_KEY'] = 'Enter your langchain api key'
os.environ['LANGCHAIN_PROJECT'] = 'rag-fusion-implementation'


# i'm using blogs of Paul Graham - A reknowned American Engineer , Entrepreneur , Investor and Founder of Y-Combinator.
# instead of this , any dataset can be used of your choice.
file_paths = ["https://www.paulgraham.com/start.html", "https://www.paulgraham.com/ds.html" , "https://www.paulgraham.com/makersschedule.html" ,"https://www.paulgraham.com/startupmistakes.html" ,"https://www.paulgraham.com/founders.html"]


docs = []
for file_path in file_paths:
    loader = WebBaseLoader(file_path)
    docs.extend(loader.load())

docs

[Document(metadata={'source': 'https://www.paulgraham.com/start.html', 'title': 'How to Start a Startup', 'language': 'No language found.'}, page_content='How to Start a Startup\n\n\n\nWant to start a startup?  Get funded by\nY Combinator.\n\n\n\n\nMarch 2005(This essay is derived from a talk at the Harvard Computer\nSociety.)You need three things to create a successful startup: to start with\ngood people, to make something customers actually want, and to spend\nas little money as possible.  Most startups that fail do it because\nthey fail at one of these.  A startup that does all three will\nprobably succeed.And that\'s kind of exciting, when you think about it, because all\nthree are doable.  Hard, but doable.  And since a startup that\nsucceeds ordinarily makes its founders rich, that implies getting\nrich is doable too.  Hard, but doable.If there is one message I\'d like to get across about startups,\nthat\'s it.  There is no magically difficult step that requires\nbrilliance to so

In [None]:
len(docs[0].page_content)



54553

### Split the documents

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000 , chunk_overlap=200 , add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

In [None]:
len(all_splits)




160

In [None]:
all_splits[159].metadata


{'source': 'https://www.paulgraham.com/founders.html',
 'title': 'What We Look for in Founders',
 'language': 'No language found.',
 'start_index': 4158}

### Create embeddings and store the docs.

In [None]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings(api_key ="enter your openai api key"))
retriever = vectorstore.as_retriever()





  Multiple queries generation -> Retrieval -> Reranking -> Generation

In [None]:
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser


# system prompt for multiple queries generation.
template = """You are a helpful assistant that generates multiple search queries based on a single input query. \n
Generate multiple search queries related to: {question} \n
Output (4 queries):"""

# need to experiment with system prompt template.
prompt_rag_fusion = ChatPromptTemplate.from_template(template)

# query generation.
generate_queries = (
    prompt_rag_fusion
    | ChatOpenAI(temperature=0  , openai_api_key ="enter your open ai api key")
    | StrOutputParser()
    | (lambda x: x.split("\n"))
)

retrieval_chain_rag_fusion = generate_queries | retriever.map()


In [None]:
# original question
original_question = 'traits of a good entrepreneur?'
# retrieve documents
results = retrieval_chain_rag_fusion.invoke({"question": original_question})

print(results)
# we have 4 generated questions

print(len(results[0]))

[[Document(metadata={'language': 'No language found.', 'source': 'https://www.paulgraham.com/founders.html', 'start_index': 104, 'title': 'What We Look for in Founders'}, page_content="(I wrote this for Forbes, who asked me to write something\nabout the qualities we look for in founders.  In print they had to cut\nthe last item because they didn't have room.)1. DeterminationThis has turned out to be the most important quality in startup\nfounders.  We thought when we started Y Combinator that the most\nimportant quality would be intelligence.  That's the myth in the\nValley. And certainly you don't want founders to be stupid.  But\nas long as you're over a certain threshold of intelligence, what\nmatters most is determination.  You're going to hit a lot of\nobstacles.  You can't be the sort of person who gets demoralized\neasily.Bill Clerico and Rich Aberman of WePay \nare a good example.  They're\ndoing a finance startup, which means endless negotiations with big,\nbureaucratic compan

In [None]:
# to get unique number of documents retrieved.
lst=[]
for ddxs in results:
  for ddx in ddxs:
    if ddx.page_content not in lst:
      lst.append(ddx.page_content)
    #print(ddx.page_content)
print(len(lst))
#6
print(lst)

6
["(I wrote this for Forbes, who asked me to write something\nabout the qualities we look for in founders.  In print they had to cut\nthe last item because they didn't have room.)1. DeterminationThis has turned out to be the most important quality in startup\nfounders.  We thought when we started Y Combinator that the most\nimportant quality would be intelligence.  That's the myth in the\nValley. And certainly you don't want founders to be stupid.  But\nas long as you're over a certain threshold of intelligence, what\nmatters most is determination.  You're going to hit a lot of\nobstacles.  You can't be the sort of person who gets demoralized\neasily.Bill Clerico and Rich Aberman of WePay \nare a good example.  They're\ndoing a finance startup, which means endless negotiations with big,\nbureaucratic companies.  When you're starting a startup that depends\non deals with big companies to exist, it often feels like they're\ntrying to ignore you out of existence.  But when Bill Clerico s

In [None]:
from langchain.load import dumps, loads


fused_scores = {}
k=60
for docs in results:
  for rank, doc in enumerate(docs):
    doc_str = dumps(doc)
    # If the document is not yet in the fused_scores dictionary, add it with an initial score of 0
    # print('\n')
    if doc_str not in fused_scores:
      fused_scores[doc_str] = 0
    # Retrieve the current score of the document, if any
    previous_score = fused_scores[doc_str]
    # Update the score of the document using the RRF formula: 1 / (rank + k)
    fused_scores[doc_str] += 1 / (rank + k)

# final reranked result
reranked_results = [
    (loads(doc), score)
    for doc, score in sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
]

In [None]:
for x in reranked_results:
  print(x[0].page_content, x[1])
  #print('\n')

(I wrote this for Forbes, who asked me to write something
about the qualities we look for in founders.  In print they had to cut
the last item because they didn't have room.)1. DeterminationThis has turned out to be the most important quality in startup
founders.  We thought when we started Y Combinator that the most
important quality would be intelligence.  That's the myth in the
Valley. And certainly you don't want founders to be stupid.  But
as long as you're over a certain threshold of intelligence, what
matters most is determination.  You're going to hit a lot of
obstacles.  You can't be the sort of person who gets demoralized
easily.Bill Clerico and Rich Aberman of WePay 
are a good example.  They're
doing a finance startup, which means endless negotiations with big,
bureaucratic companies.  When you're starting a startup that depends
on deals with big companies to exist, it often feels like they're
trying to ignore you out of existence.  But when Bill Clerico starts 0.1475674246

In [None]:
template = """Answer the following question based on this context:

{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

llm = ChatOpenAI(model="gpt-4o-mini" , api_key ="enter your openai api key.")
final_rag_chain = (prompt
    | llm
    | StrOutputParser()
)

final_rag_chain.invoke({"context":reranked_results,"question": original_question})

'Based on the provided context, the traits of a good entrepreneur include:\n\n1. **Determination**: This is highlighted as the most important quality. Entrepreneurs must be resilient and able to persevere through obstacles without getting demoralized.\n\n2. **Imagination**: The ability to think creatively and find ways to "hack" systems to their advantage is crucial for success.\n\n3. **Strong Relationships**: Successful entrepreneurs often work in teams, and having a strong, friendly relationship with co-founders is important for collaboration and support during challenging times.\n\n4. **Commitment**: Fully dedicating oneself to the startup, which often means quitting day jobs, is essential to avoid the common pitfall of half-hearted efforts.\n\n5. **Location Awareness**: Being in a conducive environment for startups (like Silicon Valley) can significantly impact success.\n\n6. **Focus on Customer Needs**: Successful entrepreneurs understand the importance of making something that cu

## Next Steps:
Evaluation and comparison of responses generated by rag-fusion and simple rag over this dataset.