# STEP-1
This code extracts text and metadata from a PDF file using the `unstructured` library.

1. **PDF Partitioning**: It uses `partition_pdf` to read and parse the provided PDF file (`"Physics 9.pdf"`).
2. **Extract Elements**: It extracts the elements from the PDF into a list of objects, each representing a distinct part of the PDF (such as text or images).
3. **Chunk Creation**: For each element, a dictionary is created containing:
   - `type`: The class name of the element (e.g., title, narrative text).
   - `text`: The actual text content of the element.
   - `page_number`: The page number where the element is located (if available).
4. **Output**: Finally, it prints the first chunk from the list to verify the extracted data.

In [10]:
from unstructured.partition.text import partition_text

# Use Unstructured to partition the combined text into chunks
elements = partition_text("Physics 9.pdf")

# Create chunks with page numbers
chunks = []
current_page_number = 1  # Start with page 1

for element in elements:
    # Assign the current page number to the chunk
    chunk_data = {
        'type': element.__class__.__name__,
        'text': element.text,
        'page_number': current_page_number  # Assign the page number
    }
    chunks.append(chunk_data)
    
    # Update the page number if a page break is detected
    if "page_break" in str(element):  # Check if the element indicates a page break
        current_page_number += 1

In [None]:
# Example of a chunk from the list
for chunk in chunks[0:100]:
    print(chunk)

# STEP-2
This code generates embeddings for each chunk of text using a pre-trained Sentence-BERT model.

1. **Model Loading**: The `SentenceTransformer` class is used to load the pre-trained model `"all-MiniLM-L6-v2"`, which is a lightweight model for generating sentence embeddings.
2. **Embedding Generation**: For each chunk in the `chunks` list, the text is passed through the model to generate a vector (embedding) that represents the semantic meaning of the text.
3. **Storing Embeddings**: The resulting embeddings, which are in the form of a NumPy array, are converted into a list (`tolist()`) to make them suitable for storage or further processing.

In [None]:
from sentence_transformers import SentenceTransformer

# Load a pre-trained Sentence-BERT model
model = SentenceTransformer("all-MiniLM-L6-v2")

print("Number of chunks ",len(chunks))
# Generate embeddings for each chunk
for chunk in chunks:
    chunk["embedding"] = model.encode(chunk["text"]).tolist()  # Convert numpy array to list for storage

# STEP-3
This code connects to an SQLite database and stores chunk data, including text, type, page number, and embeddings, into a table.

1. **Database Connection**: The `sqlite3.connect()` method is used to connect to the SQLite database named `"BOOK_VISION_CHUNKS.db"`. If the database does not exist, it will be created.
2. **Table Creation**: A `chunks` table is created if it does not already exist. The table has columns for `id`, `type`, `text`, `page_number`, and `embedding`. The `embedding` column stores the embedding as a JSON string.
3. **Data Insertion**: For each chunk in the `chunks` list, the `INSERT INTO` SQL statement is executed to store the chunk's data (type, text, page number, and embedding) into the table. The `embedding` is stored as a JSON string using `json.dumps()`.
4. **Commit and Close**: After inserting the data, the changes are committed to the database, and the connection is closed.

In [None]:
import sqlite3
import json

# Connect to SQLite database (or create one if it doesn't exist)
conn = sqlite3.connect("BOOK_VISION_CHUNKS.db")
cursor = conn.cursor()

# Create table to store chunks and embeddings
cursor.execute("""
CREATE TABLE IF NOT EXISTS chunks (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    type TEXT,
    text TEXT,
    page_number INTEGER,
    embedding TEXT  -- Store as JSON string
)
""")

# Insert chunk data into database
for chunk in chunks:
    cursor.execute("""
    INSERT INTO chunks (type, text, page_number, embedding) VALUES (?, ?, ?, ?)
    """, (chunk["type"], chunk["text"], chunk["page_number"], json.dumps(chunk["embedding"])))  # Store embedding as JSON

# Commit and close
conn.commit()
conn.close()

In [None]:
import sqlite3

# Connect to SQLite database
connection = sqlite3.connect('BOOK_VISION_CHUNKS.db')
cursor = connection.cursor()

# Fetch all chunks from the database
cursor.execute("SELECT id, type, text, page_number FROM chunks")
rows = cursor.fetchall()

# Print the chunks to the console
print("\n🔍 **All Chunks in Database:**")
for row in rows:
    print(f"📄 Chunk ID {row[0]} - Page {row[3]} - Type: {row[1]}")
    print(f"Text: {row[2]}")  # Display first 200 characters of text for brevity
    print()
# Close the cursor and connection
cursor.close()
connection.close()

# STEP-4
This code demonstrates how to use FAISS (Facebook AI Similarity Search) to index and store text embeddings for efficient similarity search.
### 📌 FAISS Indexing for Text Chunk Retrieval

This script loads text chunk embeddings from an SQLite database (`BOOK_VISION_CHUNKS.db`), converts them into a NumPy array, and indexes them using **FAISS** for efficient similarity search. 
It utilizes **L2 distance** for nearest neighbor search and maps embeddings with their corresponding chunk IDs. 
The FAISS index is then saved to a file (`BOOK_VISION_FAISS_INDEX.bin`) and reloaded for future queries. 
Finally, it prints the total indexed embeddings and their shape.


In [None]:
import faiss
import numpy as np

embedding_list = []

ids = []

connection = sqlite3.connect('BOOK_VISION_CHUNKS.db')  # Reopen the connection if it was closed
cursor = connection.cursor()

cursor.execute("SELECT id, embedding FROM chunks")
rows = cursor.fetchall()

for row in rows:
    chunk_id = row[0]
    embedding = json.loads(row[1])  # Convert JSON string back to a list
    embedding_list.append(embedding)
    ids.append(chunk_id)

# Convert to numpy arrays
embedding_array = np.array(embedding_list, dtype=np.float32)
ids_array = np.array(ids, dtype=np.int64)  # Store actual chunk IDs

# Normalize embeddings before indexing
faiss.normalize_L2(embedding_array)  # Normalize the embeddings
# Initialize FAISS index
embedding_dimension = embedding_array.shape[1]
index = faiss.IndexFlatIP(embedding_dimension)

# Create ID-based FAISS index
index_with_ids = faiss.IndexIDMap(index)
index_with_ids.add_with_ids(embedding_array, ids_array)  # Store IDs inside FAISS

# Save FAISS index
faiss.write_index(index_with_ids, "BOOK_VISION_FAISS_INDEX.bin")

index = faiss.read_index("BOOK_VISION_FAISS_INDEX.bin")

print(index)

print(f"FAISS Index Size: {index.ntotal}")

print("Database embedding shape:", embedding_array.shape)

# STEP - 5
Add functions to filter chunks by word count and type. These functions will help you refine the chunks before and after retrieval.

In [None]:
def filter_chunks_by_word_count(chunks, min_words=5):
    """Filter out chunks that have fewer than `min_words` words."""
    filtered_chunks = []
    for chunk in chunks:
        word_count = len(chunk["text"].split())  # Count words in the chunk
        if word_count >= min_words:
            filtered_chunks.append(chunk)
    return filtered_chunks

def filter_chunks_by_type(chunks, exclude_types=["title", "heading"]):
    """Filter out chunks of specific types."""
    filtered_chunks = []
    for chunk in chunks:
        if chunk["type"] not in exclude_types:
            filtered_chunks.append(chunk)
    return filtered_chunks

def sort_chunks_by_length(chunks):
    """Sort chunks by word count in descending order."""
    return sorted(chunks, key=lambda x: len(x["text"].split()), reverse=True)

# STEP-5
This code defines a function `search_similar_chunks` to retrieve the most similar chunks of text based on a query by utilizing both FAISS and SQLite.

1. **Convert Query to Embedding**: The input `query` is encoded into a vector using the pre-trained `SentenceTransformer` model. This transforms the query into the same format as the stored embeddings for comparison.
2. **FAISS Search**: FAISS is used to find the `top_k` most similar embeddings from the stored embeddings, based on cosine similarity (or other distance metrics like L2) between the query vector and the indexed embeddings.
3. **SQLite Retrieval**: After obtaining the indices of the most similar chunks from FAISS, the code uses these indices to fetch the corresponding chunk data (type, text, and page number) from the SQLite database.
4. **Return Results**: The function returns a list of similar chunks, including the type, text, and page number of each chunk.

In [None]:
def search_similar_chunks(query, top_k=50, min_words=5, exclude_types=["title", "heading"]):
    """Retrieve top_k most similar chunks from SQLite using FAISS and filter by word count and type."""
    
    # Step 1: Convert query to embedding
    query_vector = model.encode(query).reshape(1, -1)
    
    # Normalize the query embedding
    faiss.normalize_L2(query_vector)
    
    print("Query embedding shape:", query_vector.shape)
    
    # Step 2: Use FAISS to find nearest embeddings
    distances, indices = index.search(query_vector, top_k)  # FAISS returns cosine similarity scores

    # Step 3: Retrieve corresponding chunks from SQLite
    results = []
    for faiss_index in indices[0]:  # FAISS returns indices
        if faiss_index == -1:
            continue  # Skip invalid index
        
        # Fetch the actual database row ID corresponding to FAISS index
        cursor.execute("SELECT id FROM chunks LIMIT 1 OFFSET ?", (int(faiss_index),))
        row_id = cursor.fetchone()
        if row_id:
            cursor.execute("SELECT id, type, text, page_number FROM chunks WHERE id=?", (row_id[0],))
            row = cursor.fetchone()
            if row:
                results.append({"id": row[0], "type": row[1], "text": row[2], "page_number": row[3]})

    # Step 4: Filter chunks by word count and type
    filtered_results = filter_chunks_by_word_count(results, min_words=min_words)
    filtered_results = filter_chunks_by_type(filtered_results, exclude_types=exclude_types)
    
    return filtered_results

### Hybrid Search for keywords
To further improve retrieval accuracy, consider combining semantic search (using FAISS) with keyword-based search (e.g., BM25). This hybrid approach can help capture both semantic and lexical relevance.

In [None]:
from rank_bm25 import BM25Okapi

# Create a BM25 index for keyword-based search
corpus = [chunk["text"] for chunk in chunks]
bm25 = BM25Okapi(corpus)

# Perform hybrid search
def hybrid_search(query, top_k=100, min_words=5, exclude_types=[]):
    """Retrieve top_k most similar chunks using hybrid search and filter by word count and type."""
    
    # Step 1: Semantic search with FAISS
    faiss_results = search_similar_chunks(query, top_k, min_words=min_words, exclude_types=exclude_types)
    
    # Step 2: Keyword search with BM25
    tokenized_query = query.split(" ")
    bm25_scores = bm25.get_scores(tokenized_query)
    bm25_indices = np.argsort(bm25_scores)[-top_k:][::-1]  # Get top_k indices
    
    # Step 3: Fetch BM25 results with the same structure as FAISS results
    bm25_results = []
    for idx in bm25_indices:
        if idx < len(chunks):  # Ensure the index is within bounds
            chunk = chunks[idx]
            bm25_results.append({
                "id": idx,  # Use the index as the ID (or fetch the actual ID from the database if needed)
                "type": chunk["type"],
                "text": chunk["text"],
                "page_number": chunk["page_number"]
            })
    
    # Step 4: Combine results
    combined_results = faiss_results + bm25_results
    
    # Step 5: Remove duplicates (if any)
    unique_results = []
    seen_ids = set()
    for result in combined_results:
        if result["id"] not in seen_ids:
            unique_results.append(result)
            seen_ids.add(result["id"])
    
    # Step 6: Filter chunks by word count and type
    filtered_results = filter_chunks_by_word_count(unique_results, min_words=min_words)
    filtered_results = filter_chunks_by_type(filtered_results, exclude_types=exclude_types)
    
    # Step 7: Sort chunks by length and return top_k
    sorted_results = sort_chunks_by_length(filtered_results)
    return sorted_results[:top_k]  # Return exactly top_k chunks

In [None]:
connection = sqlite3.connect('BOOK_VISION_CHUNKS.db')  # Reopen the connection if it was closed
cursor = connection.cursor()
query = "force"

similar_chunks = hybrid_search(query)

print("\n🔍 **Top Relevant Chunks:**")
print(len(similar_chunks))
for chunk in similar_chunks:
    print(f"📄 ID {chunk['id']} - Page {chunk['page_number']} - {chunk['type']}: {chunk['text']}")
cursor.close()
connection.close()