# Chat With Multiple Documents using AstraDB & LangChain

A professional, open‑source friendly notebook demonstrating how to build a Retrieval‑Augmented Generation (RAG) workflow over multiple document types (PDF, TXT, PPTX, DOC) using:

- LangChain for orchestration (loading, splitting, chaining)
- OpenAI embeddings (can be swapped for any embedding model)
- AstraDB (DataStax) as a managed vector store
- Unstructured & PyPDF loaders for robust document parsing

## What This Notebook Covers
1. Environment & dependency setup
2. Secure API key handling (placeholders only – add your own locally)
3. Loading single & multiple PDFs (Unstructured + PyPDF)
4. Chunking text for vectorization
5. Initializing embeddings & AstraDB vector store
6. Creating a retriever and RAG chain with a prompt template
7. Running a sample philosophical Q&A query

## How To Use
1. Install dependencies (first code cells)
2. Replace all placeholder strings "Write your own password" with your actual keys/tokens (NEVER commit real secrets)
3. Place your documents in the configured folder path
4. Run cells sequentially to build the vector store & query it

## Security & Open Source Notes
- No real credentials are stored here
- Use environment variables or secret managers in production
- You can swap OpenAI for other providers by changing the embedding & Chat model imports

---

In [1]:
!pip install langchain
!pip install unstructured
!pip install openai
!pip install Cython
!pip install tiktoken



**Explanation:** Installs all core libraries required for document loading (Unstructured, PyPDF), embeddings (OpenAI), tokenization, and LangChain framework utilities. Run once per environment.

In [2]:
!pip install --upgrade langchain-astradb

Collecting numpy<2.0.0,>=1.26.0 (from langchain-astradb)
  Using cached numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Using cached numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.2.6
    Uninstalling numpy-2.2.6:
      Successfully uninstalled numpy-2.2.6
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
opencv-python 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you have numpy 1.26.4 which is incompatible.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.4 which is incompatible.[0m[31m
[0mSuccessfully installed numpy-1.26.4


**Explanation:** Upgrades the AstraDB integration package ensuring compatibility with the latest LangChain vector store abstractions.

In [3]:
!pip install langchain langchain-openai datasets pypdf



**Explanation:** Installs core LangChain packages plus `datasets` (optional sample corpora) and `pypdf` for PDF parsing fallback.

In [4]:
!pip install pdf2image



**Explanation:** Installs `pdf2image` enabling image-based PDF page conversion (useful for scanned PDFs or future OCR integration).

In [5]:
!pip install pdfminer.six



**Explanation:** Installs `pdfminer.six` for text extraction from PDFs (alternative backend supporting layout-aware parsing).

In [6]:
!pip install unstructured[pdf]

Collecting numpy (from unstructured[pdf])
  Using cached numpy-2.2.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Using cached numpy-2.2.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-astradb 0.6.0 requires numpy<2.0.0,>=1.26.0, but you have numpy 2.2.6 which is incompatible.[0m[31m
[0mSuccessfully installed numpy-2.2.6


**Explanation:** Installs Unstructured with PDF extras for advanced, element-wise document parsing (titles, narrative text, tables, etc.).

In [7]:
import os
from getpass import getpass

from datasets import (
    load_dataset,
)
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.indexes import VectorstoreIndexCreator

**Explanation:** Imports core libraries for dataset handling, PDF loading (Unstructured & PyPDF), document abstraction, prompt templating, runnable chaining, embeddings, and text splitting utilities used across the RAG pipeline.

In [None]:
import os
# Please provide your OpenAI API key here
OPENAI_API_KEY = "Write your own password"  # <-- Replace with your OpenAI API key
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

embedding = OpenAIEmbeddings()

**Explanation:** Sets the OpenAI API key placeholder (replace locally) and initializes the embedding model used later for vectorization of document chunks.

In [9]:
embedding = OpenAIEmbeddings()

# Using Unstructured for loading Multiple Pdfs

In [10]:
root_dir="/content/"

**Explanation:** Defines the root directory path (Colab-style). Adjust to your local/project path when running outside Colab.

In [16]:
pdf_folder_path = f'{root_dir}/Pakistan_Overview.pdf'

In [22]:
# location of the pdf file/files.
loader = UnstructuredPDFLoader(pdf_folder_path)
loaders = [loader]

**Explanation:** Creates an Unstructured PDF loader for a single PDF and stores it in a list for uniform batch processing.

In [23]:
loaders

[<langchain_community.document_loaders.pdf.UnstructuredPDFLoader at 0x7d0b20e26250>]

In [26]:
documents = []
for loader in loaders:
    documents.extend(loader.load())



**Explanation:** Iterates over each loader, extracts raw document elements, and aggregates them into a single document list.

In [27]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

**Explanation:** Uses a recursive character splitter to chunk documents into manageable pieces (1000 chars, no overlap) for efficient embedding & retrieval.

In [28]:
from langchain_openai import OpenAIEmbeddings

embedding = OpenAIEmbeddings()

# Pypdf loader with Multiple Pdfs.

In [30]:
from langchain_astradb import AstraDBVectorStore

In [None]:
from langchain_astradb import AstraDBVectorStore
ASTRA_DB_API_ENDPOINT = "Write your own password"  # Your Astra DB API endpoint
ASTRA_DB_APPLICATION_TOKEN = "Write your own password"  # Your Astra DB application token
ASTRA_DB_KEYSPACE = "default_keyspace"  # Change if you use a different keyspace

In [None]:
from astrapy import DataAPIClient

# Initialize the client with your endpoint/token placeholders
client = DataAPIClient("Write your own password")  # Replace with actual application token
db = client.get_database_by_api_endpoint(
  "Write your own password"  # Replace with actual API endpoint
)

print(f"Connected to Astra DB: {db.list_collection_names()}")

Connected to Astra DB: []


In [33]:
import os
root_dir="/content/"
pdf_folder_path = f'{root_dir}/data/'

# Create the directory if it doesn't exist
os.makedirs(pdf_folder_path, exist_ok=True)

pdfs=os.listdir(pdf_folder_path)

In [34]:
pdfs

[]

In [35]:
data = []
for pdf in pdfs:
    loader = PyPDFLoader(os.path.join(pdf_folder_path, pdf))
    data.extend(loader.load())

In [36]:
data

[]

In [37]:
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)

In [39]:
texts = splitter.split_documents(data)

In [44]:
texts

[]

In [46]:
docs=[]
for pdf in pdfs:
  data=PyPDFLoader(f"/content/data/{pdf}")
  docs.append(data)

In [None]:
ASTRA_DB_API_ENDPOINT = "Write your own password"
ASTRA_DB_APPLICATION_TOKEN = "Write your own password"

In [None]:
# (Note) Installation line corrected below; keep this comment for clarity.
# Original bare command replaced. Use the next cell with !pip install astrapy if needed.
# pip install astrapy  # <- Not executed (left as documentation)



In [78]:
!pip install astrapy



In [None]:
from astrapy import DataAPIClient

# Initialize the client (second instance)
client = DataAPIClient("Write your own password")  # Token placeholder
db = client.get_database_by_api_endpoint(
  "Write your own password"  # API endpoint placeholder
)

print(f"Connected to Astra DB: {db.list_collection_names()}")

Connected to Astra DB: []


In [82]:
vstore = AstraDBVectorStore(
    embedding=embedding,
    collection_name="astra_vector_demo",
    api_endpoint=ASTRA_DB_API_ENDPOINT,
    token=ASTRA_DB_APPLICATION_TOKEN,
    namespace=ASTRA_DB_KEYSPACE,
)

**Explanation:** Initializes an AstraDB-backed vector store collection to persist embeddings and enable semantic similarity search via the retriever interface.

In [83]:
retriever = vstore.as_retriever(search_kwargs={"k": 3})

**Explanation:** Converts the vector store into a retriever object retrieving top-k (k=3) semantically similar chunks per query.

In [84]:
prompt_template = """
You are a philosopher that draws inspiration from great thinkers of the past
to craft well-thought answers to user questions. Use the provided context as the basis
for your answers and do not make up new reasoning paths - just mix-and-match what you are given.
Your answers must be concise and to the point, and refrain from answering about other topics than philosophy.

CONTEXT:
{context}

QUESTION: {question}

YOUR ANSWER:"""

**Explanation:** Defines a domain-focused prompt template constraining responses to philosophical reasoning grounded strictly in retrieved context.

In [85]:
prompt_template = ChatPromptTemplate.from_template(prompt_template)

**Explanation:** Wraps the raw template with LangChain's `ChatPromptTemplate` enabling variable injection (`{context}`, `{question}`) during chain execution.

In [87]:
llm = ChatOpenAI()

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt_template # Corrected variable name
    | llm
    | StrOutputParser()
)

**Explanation:** Composes the RAG chain: maps query to retriever (context) + passthrough (question), feeds prompt -> LLM -> parses text output for a clean string response.

In [89]:
chain.invoke("How does Russel elaborate on Peirce's idea of the security blanket?")

"I'm sorry, I am a philosopher and my expertise lies in discussing philosophical topics. If you have any questions related to philosophy, I would be happy to help."

**Explanation:** Executes the end-to-end RAG pipeline for a sample philosophical query, returning an answer grounded in retrieved document chunks.

# About the Author

<div style="background-color: #f8f9fa; border-left: 5px solid #28a745; padding: 20px; margin-bottom: 20px; border-radius: 5px;">
  <h2 style="color: #28a745; margin-top: 0; font-family: 'Poppins', sans-serif;">Muhammad Atif Latif</h2>
  <p style="font-size: 16px; color: #495057;">Data Scientist & Machine Learning Engineer</p>
  
  <p style="font-size: 15px; color: #6c757d; margin-top: 15px;">
    Passionate about building AI solutions that solve real-world problems. Specialized in machine learning,
    deep learning, and data analytics with experience implementing production-ready models.
  </p>
</div>

## Connect With Me

<div style="display: flex; flex-wrap: wrap; gap: 10px; margin-top: 15px;">
  <a href="https://github.com/m-Atif-Latif" target="_blank">
    <img src="https://img.shields.io/badge/GitHub-Follow-212121?style=for-the-badge&logo=github" alt="GitHub">
  </a>
  <a href="https://www.kaggle.com/matiflatif" target="_blank">
    <img src="https://img.shields.io/badge/Kaggle-Profile-20BEFF?style=for-the-badge&logo=kaggle" alt="Kaggle">
  </a>
  <a href="https://www.linkedin.com/in/muhammad-atif-latif-13a171318" target="_blank">
    <img src="https://img.shields.io/badge/LinkedIn-Connect-0077B5?style=for-the-badge&logo=linkedin" alt="LinkedIn">
  </a>
  <a href="https://x.com/mianatif5867" target="_blank">
    <img src="https://img.shields.io/badge/Twitter-Follow-1DA1F2?style=for-the-badge&logo=twitter" alt="Twitter">
  </a>
  <a href="https://www.instagram.com/its_atif_ai/" target="_blank">
    <img src="https://img.shields.io/badge/Instagram-Follow-E4405F?style=for-the-badge&logo=instagram" alt="Instagram">
  </a>
  <a href="mailto:muhammadatiflatif67@gmail.com">
    <img src="https://img.shields.io/badge/Email-Contact-D14836?style=for-the-badge&logo=gmail" alt="Email">
  </a>
</div>

---

# Project Summary & Open Source Promotion

This notebook implements a complete Retrieval-Augmented Generation (RAG) pipeline over multi-format documents. You ingest PDFs (and easily extensible to DOC/TXT/PPTX), preprocess them with Unstructured + PyPDF loaders, chunk them with a recursive splitter, embed with OpenAI (pluggable), and store vectors in AstraDB for scalable semantic retrieval. A LangChain retrieval chain then fuses relevant chunks with a concise, instruction‑driven prompt to answer user questions grounded strictly in the provided context.

## Pipeline Flow
1. Dependency setup & imports
2. Secure placeholder credentials (replace locally – never commit secrets)
3. Single & multi‑PDF ingestion examples
4. Text chunking (size & overlap tunable)
5. Embedding generation
6. AstraDB vector store creation & retriever
7. Prompt engineering for focused Q&A
8. Execution of a sample query

## Customization Ideas
- Swap embeddings (e.g., Sentence Transformers, Mistral, Cohere)
- Add metadata filters (dates, doc source)
- Implement caching & batching for large corpora
- Add conversation memory for follow‑up questions

## Why It Matters
RAG reduces hallucinations, keeps answers current, and enables domain adaptation without full model fine‑tuning.

---