## Code to Chapter 10 of LangChain for Life Science and Healthcare book, by Dr. Ivan Reznikov

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1NVOYHL7FBScEdej2SLbUJuzUgCJ62usO?usp=sharing)

## LlamaIndex Tutorial - Complete Guide

## Introduction to LlamaIndex

LlamaIndex is a powerful data framework designed to connect Large Language Models (LLMs) with external data sources. It provides tools for indexing, storing, and querying documents using vector embeddings, making it easy to build Retrieval-Augmented Generation (RAG) applications.

**Key Features:**
- **Document Loading**: Support for various file formats (PDF, text, etc.)
- **Vector Indexing**: Creates searchable embeddings from your documents
- **Query Engine**: Natural language querying of your data
- **Customizable**: Flexible LLM and prompt customization
- **Persistent Storage**: Save and load indexes for reuse

## 1. Installation and Setup

First, let's install the required packages. We're using specific versions to ensure compatibility.

In [None]:
#!pip install -qU llama-index==0.12.10 openai==1.59.7
!pip install -qU llama-index openai

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m454.8/454.8 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m34.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m247.1/247.1 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.0/298.0 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m37.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.6/49.6 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!pip freeze | grep "llama\|openai"

llama-cloud==0.1.8
llama-index==0.12.10
llama-index-agent-openai==0.4.1
llama-index-cli==0.4.0
llama-index-core==0.12.10.post1
llama-index-embeddings-huggingface==0.5.0
llama-index-embeddings-openai==0.3.1
llama-index-indices-managed-llama-cloud==0.6.3
llama-index-llms-openai==0.3.13
llama-index-multi-modal-llms-openai==0.4.2
llama-index-program-openai==0.3.1
llama-index-question-gen-openai==0.3.0
llama-index-readers-file==0.4.3
llama-index-readers-llama-parse==0.4.0
llama-parse==0.5.19
openai==1.59.7


## 2. API Key Configuration

Setting up OpenAI API access for embeddings and LLM functionality.

**Important Notes:**
- Replace `userdata.get("LC4LS_OPENAI_API_KEY")` with your actual API key if not using Colab
- Keep your API key secure and never commit it to version control
- Alternative: Use `os.environ["OPENAI_API_KEY"] = "your-api-key-here"`

In [None]:
import os
import openai
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get("LC4LS_OPENAI_API_KEY")

## 3. Document Preparation

Creating a directory structure and downloading a sample PDF document.

**What's happening here:**
- We're downloading a research paper about watermarking protein generative models
- The headers help avoid blocking by the server
- The PDF is saved locally for processing

In [None]:
os.makedirs('./data', exist_ok=True)

In [None]:
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
    'Referer': 'https://github.com/IvanReznikov/LangChain4LifeScience/blob/main/data/articles/2410.20354v4.pdf',
}

response = requests.get(
    'https://raw.githubusercontent.com/IvanReznikov/LangChain4LifeScience/refs/heads/main/data/articles/2410.20354v4.pdf',
    headers=headers,
)

pdf_path = "./data/article.pdf"
with open(pdf_path, "wb") as f:
    f.write(response.content)

## 4. Document Loading with PDFReader

LlamaIndex provides specialized loaders for different file types.

**Key Points:**
- `PDFReader` extracts text content from PDF files
- Documents are converted into LlamaIndex's internal format
- Each page typically becomes a separate document object

In [None]:
from llama_index.core import download_loader

PDFReader = download_loader("PDFReader")

loader = PDFReader()
documents = loader.load_data(file=pdf_path)

  PDFReader = download_loader("PDFReader")


## 5. Creating Vector Index and Query Engine

This is the core of RAG - converting documents into searchable vectors.

**Technical Details:**
- **Vector Index**: Creates numerical representations (embeddings) of document chunks
- **text-embedding-3-large**: OpenAI's most capable embedding model
- **Query Engine**: Handles similarity search and response generation

In [None]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(model = "text-embedding-3-large")

documents = SimpleDirectoryReader('data').load_data()
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)
query_engine = index.as_query_engine()

**Expected Behavior:**
- The system will find relevant document sections about protein watermarking
- Generate a comprehensive answer based on the paper's content
- Response should mention benefits like intellectual property protection, model verification, etc.

In [None]:
response = query_engine.query("What are the benefits of watermarking protein generative models?")
print(response)

The benefits of watermarking protein generative models include copyright authentication, tracking of generated structures, protection against unauthorized use, and the ability to identify the rightful owner of the model or generated structures.


## 6. Saving and Loading Index

To avoid rebuilding the index every time, we can persist it to disk.

**Benefits of Persistence:**
- Saves time on subsequent runs
- Preserves expensive embedding computations
- Enables sharing indexes between sessions

In [None]:
index.storage_context.persist("_index")

### Loading a Saved Index

In [None]:
from llama_index.core import StorageContext, load_index_from_storage

# rebuild storage context
storage_context = StorageContext.from_defaults(persist_dir="_index")

# load index
new_index = load_index_from_storage(storage_context, embed_model=embed_model)

**Important Notes:**
- Must use the same embedding model when loading
- The `persist_dir` should match the save location
- Loaded index should provide identical results

In [None]:
new_query_engine = new_index.as_query_engine()
response = new_query_engine.query("What are the benefits of watermarking protein generative models?")
print(response)

The benefits of watermarking protein generative models include copyright authentication, tracking of generated structures, protection against unauthorized use, and the ability to prove ownership of artificially generated structures.


## 7. Customizing LLMs

LlamaIndex allows you to customize the language model used for generating responses.

**Configuration Options:**
- **temperature=0**: Makes responses more deterministic and consistent
- **model_name="gpt-4o-mini"**: Uses a specific GPT model variant
- **chat_mode="context"**: Maintains conversation context

**Chat Engine vs Query Engine:**
- **Query Engine**: Stateless, each query is independent
- **Chat Engine**: Maintains conversation history and context

In [None]:
#from langchain.chat_models import ChatOpenAI
from llama_index.llms.openai import OpenAI

llm=OpenAI(temperature=0, model_name="gpt-4o-mini")
chat_engine = index.as_chat_engine(chat_mode="context", llm=llm)

In [None]:
response = chat_engine.chat("What are the benefits of watermarking protein generative models?")
print(response)

Watermarking protein generative models offers several benefits, including:

1. **Copyright Authentication**: Watermarking allows for the authentication of protein structures generated by the model, helping to protect the intellectual property rights of the original creators.

2. **Tracking Generated Structures**: Watermarking enables the tracking of generated protein structures, which can be useful for auditing the use of protein generative models and identifying the original source of the structures.

3. **User-Specific Information Embedding**: Watermarking can embed user-specific information into protein structures, allowing for personalized identification and ownership verification.

4. **Negligible Impact on Structure Quality**: The watermarking framework has been designed to have a negligible impact on the original protein structure quality, ensuring that the integrity and accuracy of the generated structures are maintained.

5. **Robustness**: The watermarking method is robust ag

## 8. Custom Prompt Engineering

Fine-tune how the system responds by customizing prompts.

**Prompt Template Variables:**
- `{context_str}`: Replaced with relevant document chunks
- `{query_str}`: Replaced with the user's question

**Expected Result:**
- Every response should now start with "According to the article"
- This ensures clear attribution to the source document

In [None]:
from llama_index.core import Prompt

template = (
    "We have provided context information below. \n"
    "---------------------\n"
    "{context_str}"
    "\n---------------------\n"
    "Given this information, please answer the question and each answer should start with 'According to the article': {query_str}\n"
)
qa_template = Prompt(template)

In [None]:
query_engine = index.as_query_engine(text_qa_template=qa_template)
response = query_engine.query("What are the benefits of watermarking protein generative models?")
print(response)

According to the article, the benefits of watermarking protein generative models include:
1. Providing copyright authentication for generated protein structures.
2. Allowing tracking of generated structures to prevent unauthorized use.
3. Ensuring user-specific information can be embedded in protein structures.
4. Preserving generation quality while learning to generate watermarked structures.
5. Exerting a negligible impact on the original protein structure quality.
6. Being robust under potential post-processing and adaptive attacks.
