# AI Tutor - Textbook Ingestion (Google Colab)

This notebook uses the AI Tutor's ingestion system to process textbooks.

**Features:**
- ✅ Uses TutorSystem ingestion pipeline
- ✅ Upload your own PDFs or use sample textbooks
- ✅ Creates embeddings with sentence-transformers

In [1]:
import os
from pathlib import Path


print("🔄 Cloning AI Tutor repository...")
!git clone https://github.com/HenryNVP/ai-tutor.git
%cd ai-tutor

PROJECT_ROOT = Path.cwd()
print(f"Project root: {PROJECT_ROOT}")

🔄 Cloning AI Tutor repository...
Cloning into 'ai-tutor'...
remote: Enumerating objects: 434, done.[K
remote: Counting objects: 100% (434/434), done.[K
remote: Compressing objects: 100% (202/202), done.[K
remote: Total 434 (delta 221), reused 406 (delta 195), pack-reused 0 (from 0)[K
Receiving objects: 100% (434/434), 202.27 KiB | 1.53 MiB/s, done.
Resolving deltas: 100% (221/221), done.
/content/ai-tutor
Project root: /content/ai-tutor


In [2]:
!pip install -r requirements.txt

# Add src to Python path
import sys
SRC_ROOT = PROJECT_ROOT / "src"
if str(SRC_ROOT) not in sys.path:
    sys.path.insert(0, str(SRC_ROOT))

Obtaining file:///content/ai-tutor (from -r requirements.txt (line 1))
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting chromadb>=0.4.24 (from personal-stem-instructor==0.1.0->-r requirements.txt (line 1))
  Downloading chromadb-1.2.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.2 kB)
Collecting faiss-cpu>=1.7.4 (from personal-stem-instructor==0.1.0->-r requirements.txt (line 1))
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Collecting pymupdf>=1.22.0 (from personal-stem-instructor==0.1.0->-r requirements.txt (line 1))
  Downloading pymupdf-1.26.5-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Collecting structlog>=24.1.0 (from personal-stem-instructor==0.1.0->-r requirements.txt

In [7]:
from posix import environ
# Initialize TutorSystem
from ai_tutor.system import TutorSystem
import os
from google.colab import userdata

# Optional: Set API key
os.environ["OPENAI_DEFAULT_MODEL"]="gpt-4o-mini"
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

print("🚀 Initializing TutorSystem...")
system = TutorSystem.from_config()

print(f"✓ TutorSystem initialized")
print(f"  Embedding model: {system.embedder.config.model}")
print(f"  Chunk size: {system.settings.chunking.chunk_size}")
print(f"  Vector store: {system.settings.paths.vector_store_dir}")

🚀 Initializing TutorSystem...
✓ TutorSystem initialized
  Embedding model: BAAI/bge-base-en
  Chunk size: 500
  Vector store: data/vector_store


In [13]:
from google.colab import drive

drive.mount('/content/drive')
# Data folder in Drive
drive_folder_path = '/content/drive/MyDrive/ai-tutor/raw'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [17]:
# Upload PDFs
from pathlib import Path
from google.colab import files
import shutil

raw_data_dir = PROJECT_ROOT / "data" / "raw"
raw_data_dir.mkdir(parents=True, exist_ok=True)

# Copy PDF files from Google Drive to data/raw
print(f"Copying PDF files from '{drive_folder_path}' to '{raw_data_dir}'...")
for pdf_file in Path(drive_folder_path).glob("*.pdf"):
    shutil.copy(pdf_file, raw_data_dir / pdf_file.name)
    print(f"✅ {pdf_file.name}")

# List all PDF files
pdf_files = sorted(raw_data_dir.glob("*.pdf"))
print(f"\n📚 Found {len(pdf_files)} PDF file(s) ready for ingestion:\n")
for i, pdf in enumerate(pdf_files, 1):
    size_mb = pdf.stat().st_size / (1024 * 1024)
    print(f"  {i}. {pdf.name} ({size_mb:.1f} MB)")

Copying PDF files from '/content/drive/MyDrive/ai-tutor/raw' to '/content/ai-tutor/data/raw'...
✅ collegephysicsvol1.pdf
✅ collegephysicsvol3.pdf
✅ collegephysicsvol2.pdf
✅ Algebra-and-Trigonometry-2e-WEB.pdf
✅ Calculus_Volume_1_-_WEB_68M1Z5W.pdf
✅ Principles-of-Data-Science-WEB.pdf
✅ Introduction_To_Computer_Science_-_WEB.pdf

📚 Found 7 PDF file(s) ready for ingestion:

  1. Algebra-and-Trigonometry-2e-WEB.pdf (109.6 MB)
  2. Calculus_Volume_1_-_WEB_68M1Z5W.pdf (44.5 MB)
  3. Introduction_To_Computer_Science_-_WEB.pdf (50.9 MB)
  4. Principles-of-Data-Science-WEB.pdf (32.4 MB)
  5. collegephysicsvol1.pdf (40.7 MB)
  6. collegephysicsvol2.pdf (35.2 MB)
  7. collegephysicsvol3.pdf (29.5 MB)


In [18]:
# Ingest all textbooks
# This will:
# 1. Parse PDFs
# 2. Chunk the text
# 3. Generate embeddings
# 4. Store in vector store

print("🔄 Starting ingestion (this may take a few minutes)...\n")

result = system.ingest_directory(raw_data_dir)

print("\n" + "="*60)
print("✅ INGESTION COMPLETE")
print("="*60)
print(f"📄 Documents processed: {len(result.documents)}")
print(f"📝 Chunks created: {len(result.chunks)}")
print(f"⏭️  Files skipped: {len(result.skipped)}")

if result.skipped:
    print(f"\nSkipped files:")
    for skip in result.skipped:
        print(f"  - {skip}")


🔄 Starting ingestion (this may take a few minutes)...



Ingesting documents: 100%|██████████| 7/7 [00:13<00:00,  1.94s/it]
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/719 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]


✅ INGESTION COMPLETE
📄 Documents processed: 7
📝 Chunks created: 4569
⏭️  Files skipped: 0


In [19]:
# Verify ingestion
print("📊 Verification:\n")

# Check chunks file
chunks_file = PROJECT_ROOT / "data" / "processed" / "chunks.jsonl"
if chunks_file.exists():
    with open(chunks_file, 'r') as f:
        chunk_count = sum(1 for _ in f)
    print(f"✓ Chunks stored: {chunk_count}")
else:
    print("✗ No chunks file found")

# Check vector store
vector_dir = PROJECT_ROOT / "data" / "vector_store"
embeddings_file = vector_dir / "embeddings.npy"
metadata_file = vector_dir / "metadata.json"

if embeddings_file.exists():
    import numpy as np
    embeddings = np.load(embeddings_file)
    print(f"✓ Embeddings: {embeddings.shape[0]} vectors of dim {embeddings.shape[1]}")
else:
    print("✗ No embeddings file found")

if metadata_file.exists():
    print(f"✓ Metadata file exists")
else:
    print("✗ No metadata file found")

print("\n✅ Ingestion successful!")


📊 Verification:

✓ Chunks stored: 4569
✓ Embeddings: 4569 vectors of dim 768
✓ Metadata file exists

✅ Ingestion successful!


In [20]:
folders_to_download = [
    PROJECT_ROOT / "data" / "processed",
    PROJECT_ROOT / "data" / "vector_store"
]

for folder_to_download in folders_to_download:
    if folder_to_download.exists():
        zip_filename = f"{folder_to_download.name}.zip"
        # Create a zip archive of the folder
        print(f"Compressing '{folder_to_download}' into '{zip_filename}'...")
        shutil.make_archive(zip_filename.replace(".zip", ""), 'zip', folder_to_download)

        # Download the zip file
        print(f"Downloading '{zip_filename}'...")
        files.download(zip_filename)
        print(f"✅ Download initiated for {zip_filename}")
    else:
        print(f"❗️ Folder not found: {folder_to_download}. Skipping download for this folder.")

print("\nFinished processing download requests.")

Compressing '/content/ai-tutor/data/processed' into 'processed_data.zip'...
Downloading 'processed_data.zip'...


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

✅ Download initiated for processed_data.zip
Compressing '/content/ai-tutor/data/vector_store' into 'vector_store_data.zip'...
Downloading 'vector_store_data.zip'...


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

✅ Download initiated for vector_store_data.zip

Finished processing download requests.
