<a href="https://colab.research.google.com/github/Storm00212/JARVIS/blob/main/colab_ingestion_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# JARVIS RAG Ingestion Notebook (Colab-ready)

**Purpose:** This notebook walks you through an end-to-end prototype ingestion pipeline that:
- Accepts PDF / DOCX / PPTX documents
- Extracts clean text (with optional OCR)
- Splits documents into semantic chunks
- Generates embeddings for chunks
- Stores chunks + embeddings into a local Chroma vector store
- Exposes a simple `ask(question)` function that uses retrieval + prompt assembly (RAG)

**Notes & assumptions**
- Designed for Google Colab interactive use.
- Includes a sample path from this session: `/mnt/data/jarvis-ai.zip` which you can inspect or replace with your own uploads.
- Each code cell includes detailed comments to help you follow along.


In [9]:

# SECTION 1: Install required packages
# Run this cell in Google Colab to install dependencies. It may take 1-2 minutes.
!pip install --quiet pypdf python-docx python-pptx sentence-transformers chromadb langchain tiktoken PyMuPDF langchain_text_splitters faiss-cpu llama-cpp-python
print('Dependencies installed (or already present).')


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.7/50.7 MB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
Dependencies installed (or already present).



## SECTION 2: Upload files (use UI) or use sample path

You can upload files interactively using the cell below, or skip upload and use the sample file `'/mnt/data/jarvis-ai.zip'` if present.


In [None]:
# mounting google drive
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
# setting up the directory to upload the files
import os

BASE_DIR = "/content/drive/MyDrive/jarvis-ai"
RAW_DATA_DIR = f"{BASE_DIR}/data/raw"

# Create folders if they don't exist
os.makedirs(RAW_DATA_DIR, exist_ok=True)

print("Base project folder:", BASE_DIR)
print("Raw data folder:", RAW_DATA_DIR)


Base project folder: /content/drive/MyDrive/jarvis-ai
Raw data folder: /content/drive/MyDrive/jarvis-ai/data/raw


In [None]:
# uploading files to directory
from google.colab import files
import shutil # Import shutil for cross-device moves

uploaded_files = files.upload()  # choose multiple files


# Move uploaded files into the Drive folder
for filename in uploaded_files.keys():
    src = f"/content/{filename}"
    dst = f"{RAW_DATA_DIR}/{filename}"
    print(f"Moving {src} → {dst}")
    # Use shutil.move to handle cross-device links (copy then delete)
    shutil.move(src, dst)

print("\nUpload complete!")

print("Files in your study notes folder:")
print(os.listdir(RAW_DATA_DIR))

Saving 1. Amplifiers with Negative Feedback.pdf to 1. Amplifiers with Negative Feedback (2).pdf
Saving 3.1 Resources .pdf to 3.1 Resources  (2).pdf
Saving 3.2 Past Papers  .pdf to 3.2 Past Papers   (2).pdf
Saving A textbook of Electrical Technology B. L. Thereja All Volumes ( PDFDrive.pdf to A textbook of Electrical Technology B. L. Thereja All Volumes ( PDFDrive (2).pdf
Saving applied-numerical-methods-with-matlab-for-engineers-and-scientists-4nbsped-0073397962-9780073397962_compress.pdf to applied-numerical-methods-with-matlab-for-engineers-and-scientists-4nbsped-0073397962-9780073397962_compress (2).pdf
Saving assignment_1.pdf to assignment_1 (2).pdf
Saving churchillbrown.pdf to churchillbrown (2).pdf
Saving Complex analysis Q&A.pdf to Complex analysis Q&A (2).pdf
Saving Complex analysis Q&A2.pdf to Complex analysis Q&A2 (2).pdf
Saving Design_of_Analog_Filters_Rolf_Schaumann.pdf to Design_of_Analog_Filters_Rolf_Schaumann (2).pdf
Saving digielec.pdf to digielec (2).pdf
Saving DOC-202

# Reading the pdfs from my drive

In [5]:
# STEP 1 — Load PDFs from Google Drive

from google.colab import drive
drive.mount('/content/drive')

import os
import fitz  # PyMuPDF for PDFs
import docx  # DOCX reader
from pptx import Presentation  # PPTX reader

# CHANGE THIS to your folder
DATA_FOLDER = "/content/drive/MyDrive/jarvis-ai/data/raw"

documents = {}  # filename → extracted text


# function to read docx
def extract_docx(path):
    doc = docx.Document(path)
    text = "\n".join([para.text for para in doc.paragraphs])
    return text


# function to read pptx
def extract_pptx(path):
    prs = Presentation(path)
    text = []

    for slide in prs.slides:
        for shape in slide.shapes:
            if hasattr(shape, "text"):
                text.append(shape.text)

    return "\n".join(text)


# function to read pdf
def extract_pdf(path):
    doc = fitz.open(path)
    text = ""
    for page in doc:
        text += page.get_text("text")
    return text


# iterate through the study notes folder
for filename in os.listdir(DATA_FOLDER):
    path = os.path.join(DATA_FOLDER, filename)

    if filename.lower().endswith(".pdf"):
        print(f"Extracting PDF: {filename}")
        documents[filename] = extract_pdf(path)

    elif filename.lower().endswith(".docx"):
        print(f"Extracting DOCX: {filename}")
        documents[filename] = extract_docx(path)

    elif filename.lower().endswith(".pptx"):
        print(f"Extracting PPTX: {filename}")
        documents[filename] = extract_pptx(path)

    else:
        print(f"Skipping unsupported file: {filename}")


print("\n✔ Extraction complete!")
print(f"Total loaded documents: {len(documents)}")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Extracting PDF: 1. Amplifiers with Negative Feedback (2).pdf
Extracting PDF: 3.1 Resources  (2).pdf
Extracting PDF: churchillbrown (2).pdf
Extracting PDF: EEE 3208 ELECTROMAGNETICS III lec1 notes (2).pdf
Extracting PDF: eee.eti.3104.cat.ii.make_up.ms (2).pdf
Extracting PDF: eee3102 [1-20] (2).pdf
Extracting PDF: EEE 2206_EET 2204_Electromagnetics I_Exam (2).pdf
Extracting PDF: eee3102 [21-33] (2).pdf
Extracting PDF: EEE_ETI3105_Assignment ONE (2).pdf
Extracting PDF: Design_of_Analog_Filters_Rolf_Schaumann (2).pdf
Extracting PDF: digielec (2).pdf
Extracting PDF: EEE 3208 ELECTROMAGNETICS IIILecture 2 3 and4 notes (3).pdf
Extracting PDF: eee3104eti3104 [1-68] (2).pdf
Extracting PDF: Electromagnetics (2).pdf
Extracting PDF: A textbook of Electrical Technology B. L. Thereja All Volumes ( PDFDrive (2).pdf
Extracting PDF: EEE_ETI 3101_SUP_EXAM_ANALOGUE ELECTRONICS 

# Next we will split the documents into chunks before building the FAISS AND EMBEDDINGS

In [7]:
# CREATING THE SPLITTER

from langchain_text_splitters import RecursiveCharacterTextSplitter

# Recursive splitter -- best for mixed text types
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=[
        "\n\n",  # prefer splitting at paragraphs
        "\n",
        ". ",
        "! ",
        "? ",
        "; ",
        ", ",
        " ",    # fallback: whitespace
        ""      # absolute fallback
    ]
)

all_chunks = {}  # filename → list of text chunks

for filename, text in documents.items():
    print(f"Chunking: {filename}")

    chunks = splitter.split_text(text)
    all_chunks[filename] = chunks

print("\n✔ Chunking complete!")

Chunking: 1. Amplifiers with Negative Feedback (2).pdf
Chunking: 3.1 Resources  (2).pdf
Chunking: churchillbrown (2).pdf
Chunking: EEE 3208 ELECTROMAGNETICS III lec1 notes (2).pdf
Chunking: eee.eti.3104.cat.ii.make_up.ms (2).pdf
Chunking: eee3102 [1-20] (2).pdf
Chunking: EEE 2206_EET 2204_Electromagnetics I_Exam (2).pdf
Chunking: eee3102 [21-33] (2).pdf
Chunking: EEE_ETI3105_Assignment ONE (2).pdf
Chunking: Design_of_Analog_Filters_Rolf_Schaumann (2).pdf
Chunking: digielec (2).pdf
Chunking: EEE 3208 ELECTROMAGNETICS IIILecture 2 3 and4 notes (3).pdf
Chunking: eee3104eti3104 [1-68] (2).pdf
Chunking: Electromagnetics (2).pdf
Chunking: A textbook of Electrical Technology B. L. Thereja All Volumes ( PDFDrive (2).pdf
Chunking: EEE_ETI 3101_SUP_EXAM_ANALOGUE ELECTRONICS 1 (2).pdf
Chunking: EEE 3208 ELECTROMAGNETICS IIILecture 2 3 and4 notes (1) (2).pdf
Chunking: EEE2205 Electromagnetics I (2).pdf
Chunking: 3.2 Past Papers   (2).pdf
Chunking: EEE 3207 ELECTRICAL MACHINES 2 (2).pptx
Chunking: 

# Embedding and FAISS

In [11]:
from sentence_transformers import SentenceTransformer

# FREE embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

def embed_text(texts):
    return embedding_model.encode(texts, convert_to_numpy=True)
