<a href="https://colab.research.google.com/github/Storm00212/JARVIS/blob/main/colab_ingestion_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# JARVIS RAG Ingestion Notebook (Colab-ready)

**Purpose:** This notebook walks you through an end-to-end prototype ingestion pipeline that:
- Accepts PDF / DOCX / PPTX documents
- Extracts clean text (with optional OCR)
- Splits documents into semantic chunks
- Generates embeddings for chunks
- Stores chunks + embeddings into a local Chroma vector store
- Exposes a simple `ask(question)` function that uses retrieval + prompt assembly (RAG)

**Notes & assumptions**
- Designed for Google Colab interactive use.
- Includes a sample path from this session: `/mnt/data/jarvis-ai.zip` which you can inspect or replace with your own uploads.
- Each code cell includes detailed comments to help you follow along.


In [None]:

# SECTION 1: Install required packages
# Run this cell in Google Colab to install dependencies. It may take 1-2 minutes.
!pip install --quiet pypdf python-docx python-pptx sentence-transformers chromadb langchain tiktoken
print('Dependencies installed (or already present).')


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m329.5/329.5 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m253.0/253.0 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.8/472.8 kB[0m [31m31.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.4/21.4 MB[0m [31m62.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.2/278.2 kB[0m [31m13.9 MB/s[0m eta [3


## SECTION 2: Upload files (use UI) or use sample path

You can upload files interactively using the cell below, or skip upload and use the sample file `'/mnt/data/jarvis-ai.zip'` if present.


In [None]:
# mounting google drive
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [2]:
# setting up the directory to upload the files
import os

BASE_DIR = "/content/drive/MyDrive/jarvis-ai"
RAW_DATA_DIR = f"{BASE_DIR}/data/raw"

# Create folders if they don't exist
os.makedirs(RAW_DATA_DIR, exist_ok=True)

print("Base project folder:", BASE_DIR)
print("Raw data folder:", RAW_DATA_DIR)


Base project folder: /content/drive/MyDrive/jarvis-ai
Raw data folder: /content/drive/MyDrive/jarvis-ai/data/raw


In [1]:
# uploading files to directory
from google.colab import files
import shutil # Import shutil for cross-device moves

uploaded_files = files.upload()  # choose multiple files


# Move uploaded files into the Drive folder
for filename in uploaded_files.keys():
    src = f"/content/{filename}"
    dst = f"{RAW_DATA_DIR}/{filename}"
    print(f"Moving {src} → {dst}")
    # Use shutil.move to handle cross-device links (copy then delete)
    shutil.move(src, dst)

print("\nUpload complete!")

print("Files in your study notes folder:")
print(os.listdir(RAW_DATA_DIR))

Saving 1. Amplifiers with Negative Feedback.pdf to 1. Amplifiers with Negative Feedback.pdf
Saving 3.1 Resources .pdf to 3.1 Resources .pdf
Saving 3.2 Past Papers  .pdf to 3.2 Past Papers  .pdf
Saving A textbook of Electrical Technology B. L. Thereja All Volumes ( PDFDrive.pdf to A textbook of Electrical Technology B. L. Thereja All Volumes ( PDFDrive.pdf
Saving applied-numerical-methods-with-matlab-for-engineers-and-scientists-4nbsped-0073397962-9780073397962_compress.pdf to applied-numerical-methods-with-matlab-for-engineers-and-scientists-4nbsped-0073397962-9780073397962_compress.pdf
Saving assignment_1.pdf to assignment_1.pdf
Saving cat.ii.q5.revised.solution.png to cat.ii.q5.revised.solution.png
Saving churchillbrown.pdf to churchillbrown.pdf
Saving Complex analysis Q&A.pdf to Complex analysis Q&A.pdf
Saving Complex analysis Q&A2.pdf to Complex analysis Q&A2.pdf
Saving Design_of_Analog_Filters_Rolf_Schaumann.pdf to Design_of_Analog_Filters_Rolf_Schaumann.pdf
Saving digielec.pdf to

NameError: name 'RAW_DATA_DIR' is not defined


## SECTION 3: Extraction utilities

Below we define helper functions for PDF, DOCX and PPTX text extraction. These are intentionally simple and well-commented.
For scanned documents you will need an OCR pipeline (Tesseract or PaddleOCR) which is optional and not included by default.


In [None]:

# Extraction helpers
from pypdf import PdfReader
from docx import Document as DocxDocument
from pptx import Presentation as PptxPresentation
import os


def extract_text_from_pdf(path):
    """Extract text from a text-based PDF using pypdf (fast for native PDFs).
    If the PDF is scanned, you'll need OCR (not included here).
    """
    text_parts = []
    reader = PdfReader(path)
    for i, page in enumerate(reader.pages):
        try:
            page_text = page.extract_text() or ""
        except Exception:
            page_text = ""
        text_parts.append(f"\n--- PAGE {i+1} ---\n" + page_text)
    return "\n".join(text_parts)


def extract_text_from_docx(path):
    doc = DocxDocument(path)
    paragraphs = [p.text for p in doc.paragraphs]
    return "\n".join(paragraphs)


def extract_text_from_pptx(path):
    prs = PptxPresentation(path)
    slides_text = []
    for si, slide in enumerate(prs.slides):
        parts = []
        for shape in slide.shapes:
            if hasattr(shape, 'text') and shape.text:
                parts.append(shape.text)
        slide_text = "\n".join(parts)
        slides_text.append(f"\n--- SLIDE {si+1} ---\n" + slide_text)
    return "\n".join(slides_text)

print('Extraction helpers defined.')


ModuleNotFoundError: No module named 'pypdf'


## SECTION 4: Cleaning and chunking utilities

We perform simple cleaning and chunking. The chunker below is character-based and suitable for prototyping.


In [None]:

import re

def clean_text(text):
    # Normalize whitespace and remove long runs of newlines
    text = text.replace('\r\n', '\n')
    text = re.sub('\n{3,}', '\n\n', text)
    return text.strip()


def chunk_text(text, chunk_size=1000, chunk_overlap=200):
    """Return list of (chunk_id, chunk_text). Character-based overlapping chunks."""
    chunks = []
    start = 0
    idx = 0
    L = len(text)
    while start < L:
        end = min(start + chunk_size, L)
        chunk = text[start:end]
        chunks.append((f'chunk_{idx}', chunk))
        idx += 1
        start = end - chunk_overlap
        if start < 0:
            start = 0
    return chunks

print('Cleaning & chunking utilities ready.')



## SECTION 5: Embeddings + Chroma setup

We use `sentence-transformers` + Chroma (local duckdb+parquet) for embeddings and indexing.
For higher-quality embeddings, replace the model with `instructor-xl` or `bge-large` if you have access.


In [None]:

from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings

EMBED_MODEL = 'all-MiniLM-L6-v2'  # small & fast for prototype
embedder = SentenceTransformer(EMBED_MODEL)

persist_dir = 'chroma_db'
client = chromadb.Client(Settings(chroma_db_impl='duckdb+parquet', persist_directory=persist_dir))
collection_name = 'jarvis_notes'
try:
    collection = client.get_collection(collection_name)
except Exception:
    collection = client.create_collection(name=collection_name)

print('Embedding model and Chroma collection ready:', collection_name)



## SECTION 6: Ingest function

This function implements: extraction -> cleaning -> chunking -> embedding -> index (Chroma).
It returns an ingestion summary.


In [None]:

import uuid, time

def ingest_file(path, filename=None, course=None):
    if filename is None:
        filename = os.path.basename(path)
    name, ext = os.path.splitext(filename.lower())
    doc_id = str(uuid.uuid4())

    # Extract text based on extension
    if ext == '.pdf':
        raw = extract_text_from_pdf(path)
    elif ext == '.docx':
        raw = extract_text_from_docx(path)
    elif ext == '.pptx':
        raw = extract_text_from_pptx(path)
    else:
        raise ValueError('Unsupported extension: ' + ext)

    cleaned = clean_text(raw)
    chunks = chunk_text(cleaned, chunk_size=1000, chunk_overlap=200)

    ids, docs, metas, embs = [], [], [], []
    t0 = time.time()
    for idx, (_, chunk_text) in enumerate(chunks):
        cid = f"{doc_id}_chunk_{idx}"
        meta = {'document_id': doc_id, 'source_filename': filename, 'chunk_index': idx, 'course': course or ''}
        emb = embedder.encode(chunk_text).tolist()
        ids.append(cid); docs.append(chunk_text); metas.append(meta); embs.append(emb)

    collection.add(ids=ids, documents=docs, metadatas=metas, embeddings=embs)
    client.persist()
    t1 = time.time()
    return {'document_id': doc_id, 'filename': filename, 'num_chunks': len(chunks), 'time_seconds': t1-t0}

print('Ingest function ready. Example: ingest_file("/path/to/file.pdf")')



## SECTION 7: Retrieval + simple RAG assembly

`ask(question)` will retrieve top-k chunks and assemble a prompt. Replace the 'call_model' placeholder with your preferred model call
(e.g., Hugging Face Inference API or local quantized model call).


In [None]:

def retrieve(query, n_results=3):
    q_emb = embedder.encode(query).tolist()
    results = collection.query(query_embeddings=[q_emb], n_results=n_results)
    docs = results['documents'][0]
    metas = results['metadatas'][0]
    return list(zip(docs, metas))

def assemble_prompt(question, retrieved):
    prompt = 'You are JARVIS, a helpful assistant. Use the context below to answer the question.\n\n'
    for i, (doc_text, meta) in enumerate(retrieved):
        prompt += f"[Context {i+1}] (source: {meta.get('source_filename')}, chunk: {meta.get('chunk_index')})\n"
        prompt += doc_text[:800] + '\n\n'
    prompt += '\nQuestion: ' + question + '\nAnswer:'
    return prompt

# Placeholder model call - replace this function with a call to a model (HF, OpenAI, local runner)
def call_model(prompt):
    # Example: return prompt for inspection. Replace with actual API call or local inference.
    return 'MODEL_OUTPUT_PLACEHOLDER - replace call_model with actual model invocation.'


def ask(question, n_results=3):
    retrieved = retrieve(question, n_results=n_results)
    prompt = assemble_prompt(question, retrieved)
    answer = call_model(prompt)
    return {'answer': answer, 'prompt': prompt, 'retrieved': retrieved}

print('ask(question) ready. Try ask("What is X?") after ingesting documents.')
