# 🚀 Step 1: Install Dependencies

This cell installs all the necessary Python packages for our indexing process.
- We upgrade `pip` and `numpy` first to ensure compatibility and avoid dependency conflicts within the Colab environment.
- Then, we install all required `langchain` components, `chromadb` for the vector store, `sentence-transformers` for the embedding model, and other utilities.

After the installation is complete, the Colab session (runtime) is automatically restarted. This is a crucial step to ensure that all newly installed libraries are correctly loaded into the environment. You might see a "Session crashed" warning, which is expected and normal behavior.

In [None]:
# Update pip and install all required packages.
# The `-q` flag stands for "quiet" to reduce installation log verbosity.
!pip install --upgrade -q pip
!pip install --upgrade -q numpy
!pip install -q langchain langchain-core langchain-community langchain-chroma langchain-huggingface langchain-openai sentence-transformers tqdm

# Restart the runtime automatically to apply changes.
# This is a standard procedure in Colab after installing major packages.
import os
os.kill(os.getpid(), 9)

# 📂 Step 2: Connect to Google Drive

This cell mounts your personal Google Drive to the Colab environment, allowing us to access files stored there.

When you run this cell, you will be prompted to authorize access.
1. Click the authorization link.
2. Sign in to your Google account.
3. Grant permission to Google Colab.
4. Copy the authorization code provided and paste it into the input box in this cell.

Once mounted, your entire Google Drive will be accessible under the directory `/content/drive/MyDrive/`. This allows us to read our source data (`content.jsonl`) and, more importantly, to save the persistent vector database directly to your Drive.

In [None]:
try:
    from google.colab import drive
except ImportError:
    # This library is only available in the Google Colab environment.
    # We can pass silently when running locally.
    drive = None

# We only attempt to mount if the 'drive' object was successfully imported.
if drive:
    # Mount Google Drive to the Colab virtual machine.
    # The '/content/drive' directory will serve as the mounting point.
    drive.mount('/content/drive')

# ⚙️ Step 3: Configure, Process, and Index the Data

This is the main part of our notebook. It performs the entire RAG indexing pipeline from start to finish.

### The process is as follows:
1.  **Configuration**: We define all necessary parameters, such as file paths on Google Drive, the name of the embedding model, and text splitting settings. **Please verify that `PROJECT_DIR_COLAB` matches the name of your folder on Google Drive.**
2.  **Load & Prepare Data**: The script reads the `content.jsonl` file, extracts relevant text and metadata (`url`, `title`), and formats it into a standardized `Document` object.
3.  **Split Documents**: Long documents are split into smaller, overlapping chunks. This is crucial for providing focused context to the language model later on.
4.  **Check Existing Index**: The script checks if a vector database already exists and what documents are already indexed. This makes the process resumable – if it's interrupted, you won't lose your progress.
5.  **Generate Embeddings & Index**: For any new documents, the script uses a powerful multilingual model to generate vector embeddings. These embeddings, along with the document chunks and metadata, are then stored in a ChromaDB vector database on your Google Drive. This is the most computationally intensive step, which is why we are using a GPU.

When you run this cell, you will see a progress bar (`tqdm`) showing the indexing process.

In [None]:
# ==============================================================================
# CELL 3: MAIN INDEXING SCRIPT FOR GOOGLE COLAB
# ==============================================================================
import json
import logging
import hashlib
from pathlib import Path
from typing import List

# --- Part 1: Imports ---
# Import all libraries installed in Cell 1
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from tqdm import tqdm

# Configure logging to display informative messages
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')


# --- Part 2: Configuration ---
# PLEASE VERIFY THIS PATH MATCHES YOUR GOOGLE DRIVE FOLDER
PROJECT_DIR_COLAB = Path("/content/drive/MyDrive/AkademikAI_Colab/")

class IndexingSettings:
    """A simple class to hold configuration parameters."""
    DATA_PATH: Path = PROJECT_DIR_COLAB / "content.jsonl"
    DB_PATH: Path = PROJECT_DIR_COLAB / "vector_db"
    EMBEDDING_MODEL_NAME: str = 'intfloat/multilingual-e5-large'
    CHUNK_SIZE: int = 1500  # Larger chunk size can be effective for overview context
    CHUNK_OVERLAP: int = 200

settings = IndexingSettings()


# --- Part 3: Helper Functions ---

def load_and_prepare_documents(file_path: Path) -> List[Document]:
    """
    Loads data from a .jsonl file and transforms it into a list of
    LangChain Document objects.
    """
    documents = []
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            data = json.loads(line)
            # We enrich the main text with title and H1 for better semantic context
            enriched_content = f"Page Title: {data.get('title', '')}\n" \
                               f"H1 Header: {data.get('h1', '')}\n\n" \
                               f"{data.get('text', '')}"
            metadata = {
                "source": data.get('url', ''),
                "title": data.get('title', ''),
            }
            doc = Document(page_content=enriched_content, metadata=metadata)
            documents.append(doc)
    return documents

def create_chunk_id(chunk: Document, chunk_index: int) -> str:
    """
    Creates a unique and deterministic ID for a document chunk
    by hashing its content and its index.
    """
    unique_string = f"{chunk.metadata.get('source', '')}{chunk.page_content}{chunk_index}"
    return hashlib.sha256(unique_string.encode()).hexdigest()

# --- Part 4: Main Indexing Logic ---

def run_indexing():
    """Main function to create and populate the vector database."""
    logging.info("🚀 Starting the indexing process in Google Colab...")

    logging.info(f"📑 Loading documents from: {settings.DATA_PATH}")
    if not settings.DATA_PATH.exists():
        logging.error(f"Data file not found! Please check the path: {settings.DATA_PATH}")
        return
    documents = load_and_prepare_documents(settings.DATA_PATH)

    logging.info(f"🔪 Splitting {len(documents)} documents into chunks (chunk_size={settings.CHUNK_SIZE})...")
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=settings.CHUNK_SIZE,
        chunk_overlap=settings.CHUNK_OVERLAP
    )
    chunks = text_splitter.split_documents(documents)

    # Assign a unique ID to each chunk to enable resumability and avoid duplicates
    for i, chunk in enumerate(chunks):
        chunk.metadata["id"] = create_chunk_id(chunk, i)

    logging.info(f"✅ Prepared a total of {len(chunks)} chunks to be processed.")

    logging.info(f"🧠 Initializing embedding model: {settings.EMBEDDING_MODEL_NAME} (running on GPU)")
    embedding_model = HuggingFaceEmbeddings(
        model_name=settings.EMBEDDING_MODEL_NAME,
        model_kwargs={'device': 'cuda'}, # Leverage the GPU
        encode_kwargs={'normalize_embeddings': True}
    )

    db = Chroma(
        persist_directory=str(settings.DB_PATH),
        embedding_function=embedding_model
    )

    logging.info("🔍 Checking for already indexed documents in the database...")
    existing_ids = set(db.get(include=[])['ids'])
    logging.info(f"✅ Found {len(existing_ids)} existing chunks in the database.")

    chunks_to_index = [chunk for chunk in chunks if chunk.metadata["id"] not in existing_ids]

    if not chunks_to_index:
        logging.info("🎉 All documents are already indexed! Nothing to do.")
        return

    logging.info(f"⏳ {len(chunks_to_index)} chunks remaining to be indexed.")

    logging.info(f"💾 Adding new chunks to the vector database...")
    batch_size = 128 # A larger batch size is efficient on GPUs

    for i in tqdm(range(0, len(chunks_to_index), batch_size), desc="Indexing (GPU)"):
        batch = chunks_to_index[i:i + batch_size]
        batch_ids = [chunk.metadata["id"] for chunk in batch]
        # Using add_documents is an efficient way to add a batch with IDs
        db.add_documents(documents=batch, ids=batch_ids)

    # .persist() is no longer needed with the new langchain-chroma package.
    # The database saves automatically when initialized with a persist_directory.
    logging.info("🎉 Indexing process completed successfully!")


# --- Part 5: Run the Main Function ---
run_indexing()