# Overview

This is a self-practice project for building a Hybrid Retrieval Augment Generative pipelie following Haystack and OpenAI tutorials.

## Prepare Environment

In [1]:
# install dependencies
!pip install haystack-ai chroma-haystack

Collecting haystack-ai
  Downloading haystack_ai-2.2.1-py3-none-any.whl (345 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m345.2/345.2 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting chroma-haystack
  Downloading chroma_haystack-0.18.0-py3-none-any.whl (13 kB)
Collecting lazy-imports (from haystack-ai)
  Downloading lazy_imports-0.3.1-py3-none-any.whl (12 kB)
Collecting openai>=1.1.0 (from haystack-ai)
  Downloading openai-1.34.0-py3-none-any.whl (325 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m325.5/325.5 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Collecting posthog (from haystack-ai)
  Downloading posthog-3.5.0-py2.py3-none-any.whl (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.3/41.3 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Collecting chromadb>=0.5.0 (from chroma-haystack)
  Downloading chromadb-0.5.0-py3-none-any.whl (526 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [

## Download Dataset

In [2]:
# Download "History of the Standard Oil Company" from gutenberg porject
import urllib.request
urllib.request.urlretrieve("https://www.gutenberg.org/cache/epub/60692/pg60692.txt", "dataset.txt")

('dataset.txt', <http.client.HTTPMessage at 0x7a6c79a7ef50>)

## Setup OpenAI API Key

In [3]:
import os
os.environ["OPENAI_API_KEY"] = ""

## Build Document Indexing Pipeline

In [4]:
from haystack import Pipeline
from haystack_integrations.document_stores.chroma import ChromaDocumentStore
from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.embedders import OpenAIDocumentEmbedder
from haystack.components.writers import DocumentWriter

# create document store
document_store = ChromaDocumentStore()

# add pipeline componet
indexing_pipeline  = Pipeline()
indexing_pipeline.add_component("converter", TextFileToDocument())
indexing_pipeline.add_component("cleaner", DocumentCleaner())
indexing_pipeline.add_component("splitter", DocumentSplitter())
indexing_pipeline.add_component("embedder", OpenAIDocumentEmbedder())
indexing_pipeline.add_component("writer", DocumentWriter(document_store))

# connect pipeline
indexing_pipeline.connect("converter.documents", "cleaner.documents")
indexing_pipeline.connect("cleaner.documents", "splitter.documents")
indexing_pipeline.connect("splitter.documents", "embedder.documents")
indexing_pipeline.connect("embedder.documents", "writer.documents")

# preprocess dataset
indexing_pipeline.run(data={"sources": ["dataset.txt"]})

Calculating embeddings: 100%|██████████| 48/48 [00:20<00:00,  2.30it/s]


{'embedder': {'meta': {'model': 'text-embedding-ada-002',
   'usage': {'prompt_tokens': 392712, 'total_tokens': 392712}}},
 'writer': {'documents_written': 1512}}

## Build RAG Pipeline

In [5]:
# Build rag pipeline
from haystack.components.embedders import OpenAITextEmbedder
from haystack_integrations.components.retrievers.chroma import ChromaEmbeddingRetriever
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator

# template for promt engineering
template = """Given these documents, answer the question.
              Documents:
              {% for doc in documents %}
                  {{ doc.content }}
              {% endfor %}
              Question: {{query}}
              Answer:"""

# add pipeline componet
rag_pipeline = Pipeline()
rag_pipeline.add_component("text_embedder", OpenAITextEmbedder())
rag_pipeline.add_component("retriever", ChromaEmbeddingRetriever(document_store))
rag_pipeline.add_component("prompt_builder", PromptBuilder(template=template))
rag_pipeline.add_component("llm", OpenAIGenerator())

# connect pipeline
rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
rag_pipeline.connect("retriever.documents", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7a6c780faa70>
🚅 Components
  - text_embedder: OpenAITextEmbedder
  - retriever: ChromaEmbeddingRetriever
  - prompt_builder: PromptBuilder
  - llm: OpenAIGenerator
🛤️ Connections
  - text_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - prompt_builder.prompt -> llm.prompt (str)

In [6]:
def get_rag_reply(query):
    result = rag_pipeline.run(data={"prompt_builder": {"query":query}, "text_embedder": {"text": query}})
    return result["llm"]["replies"][0]

## Query And Get Reply

In [7]:
query = "Who is the founder of the standard oil company?"
reply = get_rag_reply(query)
print(reply)

John D. Rockefeller


In [8]:
query = "Why is standard oil company so successful?"
reply = get_rag_reply(query)
print(reply)

The Standard Oil Company is successful because it possesses qualities such as energy, intelligence, dauntlessness, ability, daring, and address. It has been strong in all great business qualities and has continuously adapted to new conditions as they arise. Additionally, it has secured special privileges and formed alliances with railroads to drive out rivals, control the output of oil, and regulate the price of oil. The company has also manipulated prices, stifled competition, and exercised power over prices with skill. It has a well-centralised authority, manages operations like partners in a business, collects valuable information, and ensures quick adaptability to new conditions. Furthermore, it has built a powerful trust that controls the oil industry almost absolutely and has expanded into various other interests such as railroads, shipping, and finance.
