# Hybrid RAG — Colab Notebook

This notebook runs the heavy indexing and evaluation steps on Google Colab (GPU/large disk).

Overview:
- Install dependencies
- Mount Google Drive (save inputs/outputs there)
- Upload your local project files (scripts, `fixed_urls.json`) or clone a GitHub repo
- Run data collection, preprocessing, indexing, retrieval, question generation, and evaluation
- Save results and a submission ZIP to Drive

Notes:
- If you haven't pushed your repo to GitHub, use the file upload cell below to upload the `scripts/` folder and `fixed_urls.json`/`corpus.json` etc.
- For large models (sentence-transformers, flan-t5), Colab Pro/Pro+ or a GPU runtime is recommended.

In [None]:
# 1) Install dependencies (may take several minutes)
!pip install -U pip setuptools wheel || true
# core libs used by the project
!pip install sentence-transformers faiss-cpu rank-bm25 transformers wikipedia-api beautifulsoup4 nltk rouge-score bert-score scikit-learn tqdm joblib pandas numpy matplotlib seaborn streamlit || true

# Download models on demand during the pipeline to avoid unnecessary downloads here.
print('Dependencies install started (may take several minutes).')

## 2) Mount Google Drive and prepare workspace
We'll mount Drive to save large artifacts (indices, vectors, datasets). Create a folder such as `/content/drive/MyDrive/HybridRAG` and copy your project files there.

In [None]:
from google.colab import drive
drive.mount('/content/drive')
import os
WORKDIR = '/content/drive/MyDrive/HybridRAG'
os.makedirs(WORKDIR, exist_ok=True)
print('Workdir:', WORKDIR)

## 3) Upload project files (if not using GitHub)
If you haven't pushed the repository to GitHub, use the cell below to upload `scripts/` and required files (`fixed_urls.json`, etc.). You can also `!git clone <url>` if you pushed the repo.

In [None]:
# Option A: Clone from GitHub (uncomment and set REPO).
# REPO = 'https://github.com/yourusername/yourrepo.git'
# !git clone $REPO /content/HybridRAG_repo
# Option B: Upload local files directly (run and use the upload widget)
from google.colab import files
print('If you need to upload local files, use files.upload() and then move them into the workspace.')
# uploaded = files.upload()  # uncomment to open upload widget

## 4) Example: Run the full pipeline (assumes project files are in WORKDIR)
The commands below assume the repository files (scripts/) and `fixed_urls.json` exist in `WORKDIR`. Adjust paths if you cloned to a different directory. Each step writes outputs into `WORKDIR` so they persist to Drive.
Run cells one by one and monitor outputs.

In [None]:
import os
WORKDIR = '/content/drive/MyDrive/HybridRAG'
os.chdir(WORKDIR)
print('CWD:', os.getcwd())

# 4.1 Generate fixed urls if needed
if not os.path.exists('fixed_urls.json'):
  print('fixed_urls.json not found in WORKDIR — you can generate it here or upload a prepared one.')
else:
  print('fixed_urls.json found.')

# 4.2 Data collection (fixed + random sample)
# This will create corpus.json in WORKDIR
!python3 scripts/data_collection.py --fixed fixed_urls.json --out corpus.json --random 300 || true

# 4.3 Preprocess / chunk
!python3 scripts/preprocess.py --in corpus.json --out chunks.json || true

# 4.4 Build indices (this will download models and embed chunks — may take long)
!python3 scripts/build_index.py --chunks chunks.json --out_dir indices || true

# 4.5 Generate questions (100 Qs)
!python3 scripts/generate_questions.py --chunks chunks.json --out questions.json --num_questions 100 || true

# 4.6 Run evaluation (assumes indices exist)
!python3 scripts/evaluate.py --indices indices --chunks chunks.json --questions_in questions.json --report_out report.json || true

print('Pipeline commands executed (check outputs in the Workdir).')

## 5) Save outputs and create submission ZIP
After the pipeline completes, collect the required submission artifacts into a ZIP (fixed URLs JSON, preprocessed corpus/chunks, indices, questions, report, code).

In [None]:
import shutil, os
WORKDIR = '/content/drive/MyDrive/HybridRAG'
os.chdir(WORKDIR)
ZIP_NAME = 'Group_XX_Hybrid_RAG.zip'
# Include important files and directories if they exist
items = []
for candidate in ['fixed_urls.json','corpus.json','chunks.json','indices','questions.json','report.json','scripts','app']:
  if os.path.exists(candidate):
    items.append(candidate)
print('Zipping:', items)
shutil.make_archive(ZIP_NAME.replace('.zip',''), 'zip', WORKDIR)
print('ZIP created at', os.path.join(WORKDIR, ZIP_NAME))

## Notes and troubleshooting
- If you run out of RAM while embedding, try embedding in smaller batches or use Colab Pro.
- FAISS with GPU is possible but requires different installation (`faiss-gpu`) and a GPU runtime.
- If you prefer, I can prepare a ready-to-run Colab notebook that clones from a GitHub repository — provide the repo URL or push the current workspace to GitHub and I will update the notebook to clone it automatically.