# 01 - Data Preparation (Local Database)

This notebook guides you through creating and populating your local PostgreSQL database using the project scripts:

- `src/database/connection.py` (create/init DB + schema)
- `src/database/populate.py` (load MS MARCO and insert data)

At the end, you will verify that the tables were populated.

## Prerequisites

Before running, ensure:

1. PostgreSQL is running locally.
2. You have database credentials ready.
3. You created a `.env` file at the project root (optional, defaults exist).

Expected environment variables:

- `DB_HOST` (default: `localhost`)
- `DB_PORT` (default: `5432`)
- `DB_NAME` (default: `msmarco_db`)
- `DB_USER` (default: `postgres`)
- `DB_PASSWORD` (default: `postgres`)
- `HF_TOKEN` (optional, for Hugging Face access)

In [None]:
import sys
from pathlib import Path

# Resolve project root from notebook location
project_root = Path.cwd().resolve().parent
print(f"Project root: {project_root}")

# Ensure project root is importable so `src` package can be resolved from notebooks/
project_root = Path.cwd().resolve().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# Check for .env file at project root
env_path = project_root / ".env"
if env_path.exists():
    print(f"Found .env file: {env_path}")
else:
    print("No .env found at project root. Default values from config.py will be used.")
    print("You can create one manually if needed.")

## 1) Install Python dependencies

Run this once per environment.

In [None]:
import sys
import subprocess

requirements_file = project_root / "requirements.txt"
print(f"Installing dependencies from: {requirements_file}")

subprocess.run(
    [sys.executable, "-m", "pip", "install", "-r", str(requirements_file)],
    check=True,
)

## 2) Create / initialize database schema

This step uses `connection.py` to:

- create the target database if it does not exist
- execute `schema.sql`

In [None]:
import sys
import subprocess

print("Running: python -m src.database.connection")

subprocess.run([sys.executable, "-m", "src.database.connection"], check=True, cwd=str(project_root))

## 3) Populate database from MS MARCO

This step can take some time and insert a large volume of rows.

The script will:

- download/load the dataset (`microsoft/ms_marco`, `v1.1`)
- extract rows
- insert `queries`, `passages`, and `qrels`

In [None]:
import sys
import subprocess

module_name = "src.database.populate"
print(f"Running: python -m {module_name}")

subprocess.run([sys.executable, "-m", module_name], check=True, cwd=str(project_root))

## 4) Verify database population

This step display the first 10 rows of the three core tables to confirm successful insertion.

In [None]:
import pandas as pd
import warnings
from src.database.connection import get_connection

# Suppress pandas DBAPI warning for psycopg2 connections
warnings.filterwarnings(
    "ignore",
    message="pandas only supports SQLAlchemy connectable.*",
    category=UserWarning,
)

conn = get_connection()

try:
    print("\n=== First 10 rows from queries ===")
    display(pd.read_sql_query("SELECT * FROM queries ORDER BY id LIMIT 10", conn))

    print("\n=== First 10 rows from passages ===")
    display(pd.read_sql_query("SELECT * FROM passages ORDER BY id LIMIT 10", conn))

    print("\n=== First 10 rows from qrels ===")
    display(pd.read_sql_query("SELECT * FROM qrels ORDER BY id LIMIT 10", conn))
finally:
    conn.close()

## Notes

- If population fails due to connectivity/authentication, re-check your PostgreSQL credentials in `.env`.
- If Hugging Face access is restricted in your environment, set `HF_TOKEN` in `.env`.
- You can re-run the initialization safely as table creation uses your SQL schema definition.