# Hybrid RAG — Colab Notebook

This notebook runs the heavy indexing and evaluation steps on Google Colab (GPU/large disk).

Overview:
- Install dependencies
- Mount Google Drive (save inputs/outputs there)
- Upload your local project files (scripts, `fixed_urls.json`) or clone a GitHub repo
- Run data collection, preprocessing, indexing, retrieval, question generation, and evaluation
- Save results and a submission ZIP to Drive

Notes:
- If you haven't pushed your repo to GitHub, use the file upload cell below to upload the `scripts/` folder and `fixed_urls.json`/`corpus.json` etc.
- For large models (sentence-transformers, flan-t5), Colab Pro/Pro+ or a GPU runtime is recommended.

In [1]:
# 1) Install dependencies (may take several minutes)
!pip install -U pip setuptools wheel || true
# core libs used by the project
!pip install sentence-transformers faiss-cpu rank-bm25 transformers wikipedia-api beautifulsoup4 nltk rouge-score bert-score scikit-learn tqdm joblib pandas numpy matplotlib seaborn streamlit || true

# Download models on demand during the pipeline to avoid unnecessary downloads here.
print('Dependencies install started (may take several minutes).')

Collecting pip
  Using cached pip-26.0.1-py3-none-any.whl.metadata (4.7 kB)
Collecting pip
  Using cached pip-26.0.1-py3-none-any.whl.metadata (4.7 kB)
Collecting setuptools
  Using cached setuptools-81.0.0-py3-none-any.whl.metadata (6.6 kB)
Collecting wheel
  Using cached wheel-0.46.3-py3-none-any.whl.metadata (2.4 kB)
Using cached pip-26.0.1-py3-none-any.whl (1.8 MB)
Using cached setuptools-81.0.0-py3-none-any.whl (1.1 MB)
Using cached wheel-0.46.3-py3-none-any.whl (30 kB)
Collecting setuptools
  Using cached setuptools-81.0.0-py3-none-any.whl.metadata (6.6 kB)
Collecting wheel
  Using cached wheel-0.46.3-py3-none-any.whl.metadata (2.4 kB)
Using cached pip-26.0.1-py3-none-any.whl (1.8 MB)
Using cached setuptools-81.0.0-py3-none-any.whl (1.1 MB)
Using cached wheel-0.46.3-py3-none-any.whl (30 kB)
Installing collected packages: wheel, setuptools, pip
[2K  Attempting uninstall: wheel
[2K    Found existing installation: wheel 0.45.1
[2K    Uninstalling wheel-0.45.1:
[2K      Successfu

## 2) Mount Google Drive and prepare workspace
We'll mount Drive to save large artifacts (indices, vectors, datasets). Create a folder such as `/content/drive/MyDrive/HybridRAG` and copy your project files there.

In [2]:
from google.colab import drive
drive.mount('/content/drive')
import os
WORKDIR = '/content/drive/MyDrive/HybridRAG'
os.makedirs(WORKDIR, exist_ok=True)
print('Workdir:', WORKDIR)

The history saving thread hit an unexpected error (OperationalError('unable to open database file')).History will not be written to the database.


ModuleNotFoundError: No module named 'google'

## 3) Upload project files (if not using GitHub)
If you haven't pushed the repository to GitHub, use the cell below to upload `scripts/` and required files (`fixed_urls.json`, etc.). You can also `!git clone <url>` if you pushed the repo.

In [None]:
# Option A: Clone from GitHub (auto-clone your provided repo)
REPO = 'https://github.com/AshwaniJaiswalIt/CAI_RAG.git'
import os
WORKDIR = '/content/drive/MyDrive/HybridRAG'
os.makedirs(WORKDIR, exist_ok=True)

# If scripts/ is missing in the Drive workspace, attempt several robust fallbacks so the pipeline can run.
if not os.path.exists(os.path.join(WORKDIR, 'scripts')):
    print('\n`scripts/` not found in WORKDIR. Attempting to populate from the repository...')
    # First, try cloning to a temp location in local runtime and copy into Drive
    if not os.path.exists('/content/HybridRAG_repo'):
        print('Cloning repository to /content/HybridRAG_repo...')
        !git clone $REPO /content/HybridRAG_repo || true
    else:
        print('Repository already exists at /content/HybridRAG_repo, pulling latest...')
        !cd /content/HybridRAG_repo && git pull || true

    print('Copying repository files into WORKDIR...')
    !cp -r /content/HybridRAG_repo/* $WORKDIR || true

    # If copy didn't produce scripts/, try a fresh shallow clone into /content/tmp_repo and copy only the scripts folder
    if not os.path.exists(os.path.join(WORKDIR, 'scripts')):
        print('Copy failed or `scripts/` still missing; trying a fallback shallow clone into /content/tmp_repo...')
        !rm -rf /content/tmp_repo || true
        !git clone --depth 1 $REPO /content/tmp_repo || true
        print('Copying `scripts/` from /content/tmp_repo to WORKDIR...')
        !cp -r /content/tmp_repo/scripts $WORKDIR || true

    if not os.path.exists(os.path.join(WORKDIR, 'scripts')):
        print('\nWARNING: `scripts/` still not found in WORKDIR after fallback attempts.')
        print('Please either:')
        print('  1) Upload the `scripts/` folder and `fixed_urls.json` manually via the Colab Upload widget,')
        print('  2) Ensure the GitHub repo URL is correct and accessible,')
        print('  3) Or run the notebook from a runtime that already has the project copied into Drive.')
    else:
        print('\n`scripts/` successfully copied to WORKDIR.')
else:
    print('`scripts/` already present in WORKDIR — no action required.')

print('\nListing WORKDIR contents (top-level):')
!ls -la $WORKDIR || true
print('\nListing WORKDIR/scripts contents (if present):')
!ls -la $WORKDIR/scripts || true

In [None]:
# Verification cell: check WORKDIR, fixed_urls.json format, and required scripts
import os, json
WORKDIR = '/content/drive/MyDrive/HybridRAG'
print('Checking WORKDIR:', WORKDIR)
if not os.path.exists(WORKDIR):
    print('WORKDIR does not exist. Make sure Drive is mounted and the path is correct.\nCall: from google.colab import drive; drive.mount("/content/drive")')
else:
    os.chdir(WORKDIR)
    print('CWD:', os.getcwd())

# Check scripts folder and required script files
required_scripts = [
    'scripts/data_collection.py',
    'scripts/preprocess.py',
    'scripts/build_index.py',
    'scripts/generate_questions.py',
    'scripts/evaluate.py'
]
missing = [p for p in required_scripts if not os.path.exists(p)]
if missing:
    print('\nMissing required script files:')
    for m in missing:
        print(' -', m)
    print('\nIf these are missing, either:')
    print('  * Upload the `scripts/` folder via the Colab file upload widget,')
    print('  * Or ensure the GitHub repo was cloned successfully into the WORKDIR (re-run the clone cell).')
else:
    print('\nAll required script files are present.')

# Check fixed_urls.json
fixed_path = 'fixed_urls.json'
ready = True
if not os.path.exists(fixed_path):
    print('\nfixed_urls.json not found in WORKDIR. Please upload it or generate it with scripts/fixed_urls_generator.py and copy here.')
    ready = False
else:
    try:
        with open(fixed_path, 'r') as f:
            fj = json.load(f)
        if isinstance(fj, dict) and 'fixed_urls' in fj:
            fixed_list = fj['fixed_urls']
        elif isinstance(fj, list):
            fixed_list = fj
        else:
            print('\nfixed_urls.json has an unexpected format. Should be a list or {"fixed_urls": [...]}.')
            ready = False
            fixed_list = []
        # de-duplicate while keeping order
        fixed_unique = list(dict.fromkeys(fixed_list))
        print(f'\nfixed_urls.json contains {len(fixed_list)} entries, {len(fixed_unique)} unique.')
        if len(fixed_unique) != 200:
            print('ERROR: The fixed set must contain exactly 200 unique URLs.')
            ready = False
        else:
            # basic sanity checks for URL shape
            bad_urls = [u for u in fixed_unique if not (isinstance(u, str) and ('/wiki/' in u or u.startswith('http')))]
            if bad_urls:
                print('Warning: Some fixed URLs look malformed (do not contain /wiki/ or are not strings):')
                for b in bad_urls[:10]:
                    print(' -', b)
                ready = False
    except Exception as e:
        print('\nFailed to read fixed_urls.json:', e)
        ready = False

print('\nReadiness check result:', 'OK' if ready else 'NOT READY')
if not ready:
    print('\nFix the issues above, then re-run this verification cell before running the pipeline cell.')

## 4) Example: Run the full pipeline (assumes project files are in WORKDIR)
The commands below assume the repository files (scripts/) and `fixed_urls.json` exist in `WORKDIR`. Adjust paths if you cloned to a different directory. Each step writes outputs into `WORKDIR` so they persist to Drive.
Run cells one by one and monitor outputs.

In [None]:
%%bash
set -euo pipefail
WORKDIR='/content/drive/MyDrive/HybridRAG'
cd "$WORKDIR"

echo "CWD: $(pwd)"

# 4.1 Generate fixed urls if needed
if [ ! -f fixed_urls.json ]; then
  echo "fixed_urls.json not found in WORKDIR — please upload or generate it first."
  echo "You can run: python3 scripts/fixed_urls_generator.py --n 200 --out fixed_urls.json"
  exit 1
else
  echo "fixed_urls.json found."
fi

# Ensure wikipedia-api is installed, as it's required by data_collection.py
python3 -m pip install wikipedia-api || true

# 4.2 Data collection (fixed + random sample)
# This will create corpus.json in WORKDIR
echo "Running data collection..."
python3 scripts/data_collection.py --fixed fixed_urls.json --out corpus.json --random 300

# 4.3 Preprocess / chunk
echo "Running preprocessing/chunking..."
python3 scripts/preprocess.py --in corpus.json --out chunks.json

# Ensure sentence-transformers is installed, as it's required by build_index.py
python3 -m pip install sentence-transformers || true

# 4.4 Build indices (this will download models and embed chunks — may take long)
echo "Building indices (smoke-test with --max_chunks 500)..."
python3 scripts/build_index.py --chunks chunks.json --out_dir indices --max_chunks 500

# 4.5 Generate questions (100 Qs)
echo "Generating questions..."
# Fix: Remove the problematic import from generate_questions.py
sed -i '/from generate import generate_answer_if_needed/d' scripts/generate_questions.py
python3 scripts/generate_questions.py --chunks chunks.json --out questions.json --num_questions 100

# 4.6 Run evaluation (assumes indices exist)
echo "Running evaluation..."
# Fix: Remove the problematic import from evaluate.py
sed -i '/from generate import generate_answer_if_needed/d' scripts/evaluate.py
python3 scripts/evaluate.py --indices indices --chunks chunks.json --questions_in questions.json --report_out report.json

echo 'Pipeline completed successfully (check outputs in the Workdir).'

## 5) Save outputs and create submission ZIP
After the pipeline completes, collect the required submission artifacts into a ZIP .

In [None]:
import shutil, os
WORKDIR = '/content/drive/MyDrive/HybridRAG'
os.chdir(WORKDIR)
ZIP_NAME = 'Group_149_Hybrid_RAG.zip'
# Include important files and directories if they exist
items = []
for candidate in ['fixed_urls.json','corpus.json','chunks.json','indices','questions.json','report.json','scripts','app']:
  if os.path.exists(candidate):
    items.append(candidate)
print('Zipping:', items)
shutil.make_archive(ZIP_NAME.replace('.zip',''), 'zip', WORKDIR)
print('ZIP created at', os.path.join(WORKDIR, ZIP_NAME))

## Notes and troubleshooting
- If you run out of RAM while embedding, try embedding in smaller batches or use Colab Pro.
- FAISS with GPU is possible but requires different installation (`faiss-gpu`) and a GPU runtime.
