## Code to Chapter 10 of LangChain for Life Science and Healthcare book, by Dr. Ivan Reznikov

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1j2VckKCbhNzp50nhjsUHyM4L1Oea-Gdn?usp=sharing)

## Haystack Tutorial - Building a RAG System

This tutorial demonstrates how to build a complete Retrieval-Augmented Generation (RAG) system using Haystack 2.0. We'll create a pipeline that can process PDF documents, split them into chunks, embed them, store them in a vector database, and then answer questions based on the content.

## What is Haystack?

Haystack is an open-source framework for building production-ready LLM applications. It provides modular components that can be combined into pipelines for tasks like document processing, retrieval, and generation. In this tutorial, we'll build a RAG system that can answer questions about scientific papers.

## Installing Required Dependencies

First, we need to install Haystack 2.0 and its dependencies. The key packages we're installing are:
- `haystack-ai`: The main Haystack framework
- `datasets`: For data handling utilities
- `sentence-transformers`: For text embeddings
- `openai`: For OpenAI API integration
- `pypdf`: For PDF document processing

In [None]:
#!pip install haystack-ai datasets>=2.6.1 sentence-transformers>=3.0.0 openai pypdf
!pip install -q haystack-ai datasets sentence-transformers openai pypdf

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/530.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m530.1/530.1 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/313.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m313.2/313.2 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m62.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m31.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
!pip freeze | grep "haystack\|openai\|sentence-transformers\|datasets"

datasets==4.0.0
haystack-ai==2.15.2
haystack-experimental==0.12.0
openai==1.97.1
sentence-transformers==4.1.0
tensorflow-datasets==4.9.9
vega-datasets==0.9.0


## Setting Up OpenAI API Configuration

We need to configure the OpenAI API key for embedding generation and LLM queries. In Google Colab, we can securely store API keys using the userdata feature.

In [None]:
import os
import openai
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get("LC4LS_OPENAI_API_KEY")

**Important**: Make sure you've added your OpenAI API key to Colab's secrets (🔑 icon in the left sidebar) before running this cell.


## Downloading Sample Document

We'll download a scientific paper to use as our knowledge base. This example uses a paper about protein generative models from a GitHub repository.

In [None]:
os.makedirs('./data', exist_ok=True)

In [None]:
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
    'Referer': 'https://github.com/IvanReznikov/LangChain4LifeScience/blob/main/data/articles/2410.20354v4.pdf',
}

response = requests.get(
    'https://raw.githubusercontent.com/IvanReznikov/LangChain4LifeScience/refs/heads/main/data/articles/2410.20354v4.pdf',
    headers=headers,
)

pdf_path = "./data/article.pdf"
with open(pdf_path, "wb") as f:
    f.write(response.content)

## Importing Haystack Components

Now we'll import all the necessary Haystack components for building our RAG pipeline. Each component serves a specific purpose in the document processing and retrieval workflow.

**Component Overview**:
- **DocumentWriter**: Stores processed documents in the document store
- **PyPDFToDocument/TextFileToDocument**: Convert different file formats to Haystack documents
- **DocumentSplitter**: Breaks documents into smaller chunks for better retrieval
- **DocumentCleaner**: Removes unwanted characters and normalizes text
- **FileTypeRouter**: Routes files to appropriate converters based on MIME type
- **DocumentJoiner**: Combines documents from multiple sources
- **OpenAIDocumentEmbedder**: Creates vector embeddings for documents
- **InMemoryDocumentStore**: Stores documents and their embeddings in memory

In [None]:
from haystack.components.writers import DocumentWriter
from haystack.components.converters import PyPDFToDocument, TextFileToDocument
from haystack.components.preprocessors import DocumentSplitter, DocumentCleaner
from haystack.components.routers import FileTypeRouter
from haystack.components.joiners import DocumentJoiner
from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack import Pipeline

## Building the Document Processing Pipeline

This section creates a comprehensive pipeline that handles the entire document processing workflow, from file input to storing embedded chunks in the document store.

In [None]:
document_store = InMemoryDocumentStore()
file_type_router = FileTypeRouter(mime_types=["text/plain", "application/pdf"])
text_file_converter = TextFileToDocument()
pdf_converter = PyPDFToDocument()
document_joiner = DocumentJoiner()

**Key Configuration Choices**:
- **Split length (150 words)**: Balances context preservation with retrieval precision
- **Split overlap (50 words)**: Ensures important information isn't lost at chunk boundaries
- **text-embedding-3-large**: OpenAI's most capable embedding model for better semantic understanding

In [None]:
document_cleaner = DocumentCleaner()
document_splitter = DocumentSplitter(split_by="word", split_length=150, split_overlap=50)

In [None]:
document_embedder = OpenAIDocumentEmbedder(model = "text-embedding-3-large")
document_writer = DocumentWriter(document_store)

## Assembling the Pipeline

Now we'll create the pipeline and connect all components in the correct order:

In [None]:
preprocessing_pipeline = Pipeline()
preprocessing_pipeline.add_component(instance=file_type_router, name="file_type_router")
preprocessing_pipeline.add_component(instance=text_file_converter, name="text_file_converter")
preprocessing_pipeline.add_component(instance=pdf_converter, name="pypdf_converter")
preprocessing_pipeline.add_component(instance=document_joiner, name="document_joiner")
preprocessing_pipeline.add_component(instance=document_cleaner, name="document_cleaner")
preprocessing_pipeline.add_component(instance=document_splitter, name="document_splitter")
preprocessing_pipeline.add_component(instance=document_embedder, name="document_embedder")
preprocessing_pipeline.add_component(instance=document_writer, name="document_writer")

**Pipeline Flow Explanation**:
1. **File Type Router** → Routes files to appropriate converters
2. **Converters** → Extract text from PDFs or text files
3. **Document Joiner** → Combines all documents into a single stream
4. **Document Cleaner** → Normalizes and cleans the text
5. **Document Splitter** → Creates overlapping chunks
6. **Document Embedder** → Generates vector embeddings
7. **Document Writer** → Stores everything in the document store

In [None]:
preprocessing_pipeline.connect("file_type_router.text/plain", "text_file_converter.sources")
preprocessing_pipeline.connect("file_type_router.application/pdf", "pypdf_converter.sources")
preprocessing_pipeline.connect("text_file_converter", "document_joiner")
preprocessing_pipeline.connect("pypdf_converter", "document_joiner")
preprocessing_pipeline.connect("document_joiner", "document_cleaner")
preprocessing_pipeline.connect("document_cleaner", "document_splitter")
preprocessing_pipeline.connect("document_splitter", "document_embedder")
preprocessing_pipeline.connect("document_embedder", "document_writer")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7cc17c2147d0>
🚅 Components
  - file_type_router: FileTypeRouter
  - text_file_converter: TextFileToDocument
  - pypdf_converter: PyPDFToDocument
  - document_joiner: DocumentJoiner
  - document_cleaner: DocumentCleaner
  - document_splitter: DocumentSplitter
  - document_embedder: OpenAIDocumentEmbedder
  - document_writer: DocumentWriter
🛤️ Connections
  - file_type_router.text/plain -> text_file_converter.sources (List[Union[str, Path, ByteStream]])
  - file_type_router.application/pdf -> pypdf_converter.sources (List[Union[str, Path, ByteStream]])
  - text_file_converter.documents -> document_joiner.documents (List[Document])
  - pypdf_converter.documents -> document_joiner.documents (List[Document])
  - document_joiner.documents -> document_cleaner.documents (List[Document])
  - document_cleaner.documents -> document_splitter.documents (List[Document])
  - document_splitter.documents -> document_embedder.documents (List[Document

## Processing the Documents

Let's run the preprocessing pipeline to process our downloaded PDF.

**What Happens Here**:
- The pipeline scans the `data` directory
- Identifies `article.pdf` as a PDF file
- Extracts text from the PDF
- Cleans and splits the text into chunks
- Generates embeddings for each chunk
- Stores everything in the in-memory document store

**Expected Output**: You should see processing logs and confirmation that documents have been embedded and stored.

In [None]:
from pathlib import Path

preprocessing_pipeline.run({"file_type_router": {"sources": ["data"]}})

{'file_type_router': {'unclassified': [PosixPath('data')]}}

## Building the Question-Answering Pipeline

Now we'll create a second pipeline for answering questions using the processed documents. This implements the retrieval and generation components of our RAG system.

**Component Choices**:
- **gpt-4o-mini**: Balanced performance and cost for question answering
- **ChatPromptBuilder**: Handles prompt templating with Jinja2 syntax
- **Template Design**: Clearly separates context from question for better LLM performance


In [None]:
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.builders import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage

llm = OpenAIChatGenerator(model="gpt-4o-mini")
prompt_builder = ChatPromptBuilder()
template = [ChatMessage.from_user("""
Answer the questions based on the given context.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{ question }}
Answer:
""")]

## Assembling the QA Pipeline

**QA Pipeline Flow**:
1. **Text Embedder** → Converts question to vector embedding
2. **Retriever** → Finds most relevant document chunks using similarity search
3. **Prompt Builder** → Creates formatted prompt with context and question
4. **LLM** → Generates answer based on retrieved context

In [None]:
pipe = Pipeline()
pipe.add_component("text_embedder", OpenAITextEmbedder(model="text-embedding-3-large"))
pipe.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store))
pipe.add_component("chat_prompt_builder", ChatPromptBuilder(template=template))
pipe.add_component("llm", llm)

pipe.connect("text_embedder.embedding", "retriever.query_embedding")
pipe.connect("retriever", "chat_prompt_builder.documents")
pipe.connect("chat_prompt_builder.prompt", "llm.messages")



<haystack.core.pipeline.pipeline.Pipeline object at 0x7cc17be95190>
🚅 Components
  - text_embedder: OpenAITextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - chat_prompt_builder: ChatPromptBuilder
  - llm: OpenAIChatGenerator
🛤️ Connections
  - text_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> chat_prompt_builder.documents (List[Document])
  - chat_prompt_builder.prompt -> llm.messages (List[ChatMessage])

## Testing the RAG System

Let's test our complete RAG system with a question about the document content:

In [None]:
question = "What are the benefits of watermarking protein generative models?"
response = pipe.run({"text_embedder": {"text": question}, "chat_prompt_builder": {"question": question}})



**Expected Output**: The system should provide a comprehensive answer about watermarking benefits in protein generative models, based on the content from the processed PDF. The answer will be grounded in the retrieved document chunks and should mention specific benefits like:
- Intellectual property protection
- Model attribution
- Prevention of unauthorized use
- Traceability of generated proteins

In [None]:
print(response["llm"]["replies"][0].text)

The benefits of watermarking protein generative models include:

1. **Provenance Tracking**: Watermarking allows for the identification of the source of generated proteins, helping trace their development and ensuring proper attribution to creators.

2. **Ownership Protection**: It provides a means to protect intellectual property by distinguishing original work from unauthorized copies.

3. **Quality Control**: Watermarks can serve as indicators of the model's reliability and performance, enabling users to assess the credibility of the generated proteins.

4. **Detection of Misuse**: By embedding watermarks, developers can identify when their models are used without permission or inappropriately, allowing for accountability.

5. **Enhanced Collaboration**: Watermarked models can foster collaborative efforts by making it easier to credit contributions and share advancements while maintaining ownership rights.

6. **User Trust**: The presence of a watermark can increase user trust in th

## How the System Works

This RAG system demonstrates several key concepts:

1. **Document Processing**: Raw PDFs are converted into searchable, embedded chunks
2. **Semantic Search**: Questions are matched to relevant content using vector similarity
3. **Context-Aware Generation**: The LLM generates answers based on retrieved context
4. **Modular Architecture**: Each component can be modified or replaced independently

## Key Benefits of This Approach

- **Accuracy**: Answers are grounded in your specific documents
- **Transparency**: You can trace answers back to source chunks
- **Scalability**: Can handle multiple documents and file types
- **Flexibility**: Pipeline components can be easily modified or extended