## Data Ingestion for Deep RAG

In this notebook, we'll load extracted data into Qdrant vector database:

- **Markdown**: Page-level chunks with metadata
- **Tables**: Separate documents with context and page numbers
- **Images**: Text descriptions embedded (generated in notebook 06-01b)
- **Hybrid Search**: Dense (semantic) + Sparse (keyword) embeddings

**Prerequisites:**
- Run notebook 06-01 first to extract PDFs
- Run notebook 06-01b to generate image descriptions
- Qdrant server running on localhost:6333
- Google API key set in .env file

**Output:**
- Single Qdrant collection with all content types
- Rich metadata for filtering (company, year, quarter, doc_type, page)
- Deduplication using file hashes

**Make Sure You Have Your QDRANT Vector DB Docker Running**

https://qdrant.tech/

| Point            | **Qdrant** | **Chroma**       | **FAISS** | Weaviate     | Milvus | Pinecone |
| ---------------- | ---------- | ---------------- | --------- | ------------ | ------ | -------- |
| Open Source      | ✅ Yes      | ✅ Yes            | ✅ Yes     | ⚠️ Open-core | ✅ Yes  | ❌ No     |
| DB vs Library    | DB         | DB (dev-focused) | Library   | DB           | DB     | Managed  |
| Hybrid Search    | ✅ Native   | ❌                | ❌         | ✅            | ⚠️     | ✅        |
| Metadata Filter  | ✅ Strong   | ⚠️ Basic         | ❌         | ✅            | ✅      | ✅        |
| Production Ready | ✅ Yes      | ❌ (POC)          | ❌         | ✅            | ✅      | ✅        |
| Local / Offline  | ✅ Yes      | ✅ Yes            | ⚠️        | ⚠️           | ⚠️     | ❌        |


### 0. Qdrant API Setup

In [None]:
!pip install qdrant_client

In [None]:
!pip install -U langchain langchain-community sentence-transformers qdrant-client

In [None]:
!pip install -qU langchain-google-genai langchain-qdrant fastembed fastembed-gpu

In [4]:
import os
from dotenv import load_dotenv
from google.colab import userdata

load_dotenv()

from qdrant_client import QdrantClient

from qdrant_client import QdrantClient

qdrant_client = QdrantClient(
    url=userdata.get('QDRANT_URL'),
    api_key=userdata.get('QDRANT_API_KEY'),
)

print(qdrant_client.get_collections())

collections=[CollectionDescription(name='financial_docs')]


In [5]:
# qdrant_client = QdrantClient(
#     url="http://localhost:6333"
# )

# print(qdrant_client.get_collections())

### 1. Setup and Imports

In [6]:
import hashlib
from pathlib import Path

from langchain_google_genai import GoogleGenerativeAIEmbeddings

from langchain_qdrant import QdrantVectorStore, RetrievalMode, FastEmbedSparse

from langchain_core.documents import Document
from qdrant_client import QdrantClient

### 2. Configuration

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
# Paths
MARKDOWN_DIR = "/content/drive/MyDrive/Udemy/KGP-TALKIE/Deep_Agent/resources/data/rag-data/markdown"
TABLES_DIR = "/content/drive/MyDrive/Udemy/KGP-TALKIE/Deep_Agent/resources/data/rag-data/tables"
IMAGES_DESC_DIR = "/content/drive/MyDrive/Udemy/KGP-TALKIE/Deep_Agent/resources/data/rag-data/images_desc"

# Qdrant Configuration
COLLECTION_NAME = "financial_docs"
EMBEDDING_MODEL = "models/gemini-embedding-001"

### 3. Initialize Embeddings and Client

In [9]:
from google.colab import userdata

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("intfloat/e5-large-v2")

texts = ["query: What is dense embedding?", "passage: Dense embeddings represent semantic meaning"]
embeddings = model.encode(texts, normalize_embeddings=True)

In [11]:
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="intfloat/e5-large-v2",
    encode_kwargs={"normalize_embeddings": True}
)

  embeddings = HuggingFaceEmbeddings(


In [None]:
# Embeddings
# embeddings = GoogleGenerativeAIEmbeddings(model=EMBEDDING_MODEL, api_key=userdata.get('GOOGLE_API_KEY'))
sparse_embeddings = FastEmbedSparse(model_name="Qdrant/bm25")

In [13]:
result = embeddings.embed_query('anything')
len(result)

1024

In [14]:
result = sparse_embeddings.embed_query('hi hello')
result

SparseVector(indices=[948991206, 613153351], values=[1.0, 1.0])

In [15]:
result = sparse_embeddings.embed_documents(['hi', 'hello'])
result

[SparseVector(indices=[948991206], values=[1.6877434821696136]),
 SparseVector(indices=[613153351], values=[1.6877434821696136])]

### 4. Create or Recreate Collection

In [16]:
COLLECTION_NAME

'financial_docs'

In [17]:
# # Create vector store at Remote location
vector_store = QdrantVectorStore.from_documents(
    documents=[],
    embedding=embeddings,
    sparse_embedding=sparse_embeddings,
    url=userdata.get("QDRANT_URL"),
    api_key = userdata.get("QDRANT_API_KEY"),
    collection_name = COLLECTION_NAME,
    retrieval_mode=RetrievalMode.HYBRID,
    force_recreate=False
)

In [18]:
# Create vector store at local computer
# vector_store = QdrantVectorStore.from_documents(
#     documents=[],
#     embedding=embeddings,
#     sparse_embedding=sparse_embeddings,
#     url="http://localhost:6333",
#     collection_name = COLLECTION_NAME,
#     retrieval_mode=RetrievalMode.HYBRID,
#     force_recreate=False
# )

In [42]:
vector_store.client.get_collections()

CollectionsResponse(collections=[CollectionDescription(name='financial_docs')])

### 5. Helper Functions

In [20]:
def extract_metadata_from_filename(filename: str):
    """
    Extract metadata from filename.

    Expected format: CompanyName DocType [Quarter] Year.pdf
    Examples:
        - Amazon 10-Q Q1 2024.pdf
        - Microsoft 10-K 2023.pdf
    """

    filename = filename.replace('.pdf', '').replace('.md', '')
    parts = filename.split()

    return {
        'company_name': parts[0],
        'doc_type': parts[1],
        'fiscal_quarter': parts[2] if len(parts)==4 else None,
        'fiscal_year': parts[-1]
    }

extract_metadata_from_filename('apple 10-k 2023.md')

{'company_name': 'apple',
 'doc_type': '10-k',
 'fiscal_quarter': None,
 'fiscal_year': '2023'}

In [21]:
def compute_file_hash(file_path: Path):

    sha256_hash = hashlib.sha256()

    with open(file_path, 'rb') as f:
        for byte_block in iter(lambda: f.read(4096), b""):
            sha256_hash.update(byte_block)

    return sha256_hash.hexdigest()


In [22]:
compute_file_hash(Path(r'/content/drive/MyDrive/Udemy/KGP-TALKIE/Deep_Agent/resources/data/rag-data/markdown/amazon/amazon 10-k 2023.md'))

'fc7817c1b8473b2619bedf24fd8a094d9dbd638ee28546bc9f779937efbfcd1a'

In [23]:
# import shutil
# from pathlib import Path

# # Define the source file path
# source_file_path = Path('/content/drive/MyDrive/Udemy/KGP-TALKIE/Deep_Agent/resources/data/rag-data/markdown/amazon/amazon 10-k 2023.md')

# # Define the destination file path (new name in the same directory)
# destination_file_path = source_file_path.parent / 'amazon 10-k 2023_copy.md'

# try:
#     shutil.copyfile(source_file_path, destination_file_path)
#     print(f"File '{source_file_path.name}' copied to '{destination_file_path.name}' successfully.")
# except FileNotFoundError:
#     print(f"Error: Source file '{source_file_path}' not found.")
# except Exception as e:
#     print(f"An error occurred: {e}")


In [24]:
compute_file_hash(Path(r'/content/drive/MyDrive/Udemy/KGP-TALKIE/Deep_Agent/resources/data/rag-data/markdown/amazon/amazon 10-k 2023_copy.md'))

'fc7817c1b8473b2619bedf24fd8a094d9dbd638ee28546bc9f779937efbfcd1a'

In [43]:
# get the list of ingested file
all_points = vector_store.client.scroll(
    collection_name=COLLECTION_NAME,
    limit=10_00,
    with_payload=True,
    offset=None
)

In [45]:
len(all_points[0])

1000

In [46]:
all_points[0][0].payload['metadata']['file_hash']

'0d930b115fda8e8a3560a3a0edfffcd448538bc8f7523f8e88f5ffc7d2927183'

In [47]:
def get_processed_hashes():

    processed_hashes = set()
    offset = None

    while True:
        points, offset = vector_store.client.scroll(
                            collection_name=COLLECTION_NAME,
                            limit=10_000,
                            with_payload=True,
                            offset=offset
                        )

        if not points:
            break

        processed_hashes.update(point.payload['metadata']['file_hash'] for point in points)

        if offset is None:
            break

    return processed_hashes

In [48]:
processed_hashes = get_processed_hashes()

In [49]:
len(processed_hashes)

1115

In [34]:
# extract the page number from the file path
import re

def extract_page_number(file_path: Path):
    pattern = r'page_(\d+)'
    match = re.search(pattern=pattern, string=file_path.stem)
    return int(match.group(1)) if match else None

In [35]:
file_path = Path(r'/content/drive/MyDrive/Udemy/KGP-TALKIE/Deep_Agent/resources/data/rag-data/images_desc/google/google 10-k 2023/page_28.md')
extract_page_number(file_path)

28

### 6. Ingestion Function

In [36]:
def ingest_file_in_db(file_path, processed_hashes):

    file_hash = compute_file_hash(file_path)
    if file_hash in processed_hashes:
        print(f"Following file has been already uploaded: {file_path}")

    path_str = str(file_path)
    if 'markdown' in path_str:
        content_type = 'text'
        doc_name = file_path.name
    elif 'tables' in path_str:
        content_type = 'tables'
        doc_name = file_path.parent.name
    elif 'images_desc' in path_str:
        content_type = 'image'
        doc_name = file_path.parent.name
    else:
        content_type = 'unknown'
        doc_name = file_path.name

    content = file_path.read_text(encoding='utf-8')

    base_metadata = extract_metadata_from_filename(doc_name)

    base_metadata.update({
        'content_type': content_type,
        'file_hash': file_hash,
        'source_file': doc_name
    })

    if content_type == 'text':
        # write method for ingesting markdown data
        pages = content.split('<!-- page break -->')
        documents = []
        for idx, page in enumerate(pages, start=1):
            metadata = base_metadata.copy()
            metadata.update({'page': idx})
            documents.append(Document(page_content=page, metadata=metadata))

        vector_store.add_documents(documents)

    else:
        # write method to ingest images desc and tables .md data
        page_num = extract_page_number(file_path)
        metadata = base_metadata.copy()
        metadata.update({'page': page_num})
        documents = [Document(page_content=content, metadata=metadata)]

        vector_store.add_documents(documents)


    processed_hashes.add(file_hash)


In [37]:
file_path = Path(r'/content/drive/MyDrive/Udemy/KGP-TALKIE/Deep_Agent/resources/data/rag-data/markdown/amazon/amazon 10-k 2023.md')
processed_hashes = get_processed_hashes()

ingest_file_in_db(file_path, processed_hashes)

Following file has been already uploaded: /content/drive/MyDrive/Udemy/KGP-TALKIE/Deep_Agent/resources/data/rag-data/markdown/amazon/amazon 10-k 2023.md


In [38]:
from tqdm import tqdm

base_path = Path('/content/drive/MyDrive/Udemy/KGP-TALKIE/Deep_Agent/resources/data/rag-data')
all_md_files = list(base_path.rglob("*.md"))

for md_file in tqdm(all_md_files):
    ingest_file_in_db(md_file, processed_hashes)

  1%|          | 9/1117 [00:53<1:56:13,  6.29s/it]

Following file has been already uploaded: /content/drive/MyDrive/Udemy/KGP-TALKIE/Deep_Agent/resources/data/rag-data/markdown/amazon/amazon 10-k 2023.md


  1%|          | 13/1117 [01:30<2:52:36,  9.38s/it]

Following file has been already uploaded: /content/drive/MyDrive/Udemy/KGP-TALKIE/Deep_Agent/resources/data/rag-data/markdown/amazon/amazon 10-k 2023_copy.md


 65%|██████▍   | 721/1117 [09:49<03:29,  1.89it/s]

Following file has been already uploaded: /content/drive/MyDrive/Udemy/KGP-TALKIE/Deep_Agent/resources/data/rag-data/tables/google/google 10-q q3 2024/table_23_page_20.md


100%|██████████| 1117/1117 [13:27<00:00,  1.38it/s]


### 8. Verify Ingestion

In [39]:
collection_info = vector_store.client.get_collection(COLLECTION_NAME)
collection_info



### 9. Test Search

In [40]:
query = "what is the tesla's revenue"
results = vector_store.similarity_search(query)

In [41]:
results

[Document(metadata={'company_name': 'meta', 'doc_type': '10-k', 'fiscal_quarter': None, 'fiscal_year': '2024', 'content_type': 'image', 'file_hash': '948e25d6da5427cb285071545d4ed7ef3bed2772bd9f93583609ab88b20c00b4', 'source_file': 'meta 10-k 2024', 'page': 64, '_id': '82839871-da50-439b-b35a-cc2ef1610d91', '_collection_name': 'financial_docs'}, page_content='**Summary of Key Facts and Numbers:**\n\nRevenue is calculated based on the geography where ad impressions are delivered, virtual goods are purchased, or consumer hardware products are shipped. Regions like Asia-Pacific and Rest of World monetize at lower rates. In 2024, revenue increased by 18% in United States & Canada, 26% in Europe, 22% in Asia-Pacific, and 31% in Rest of World, relative to 2023. Non-advertising revenue includes consumer hardware, WhatsApp Business Platform, Meta Verified subscriptions, and developer fees.\n\n---\n**Charts and Graphs Data Extraction:**\n\n**1. Revenue Worldwide (in $ millions)**\n*   **Metric: