A Retrieval-Augmented Generation (RAG) System for Canadian Banking FAQs β starting with RBC
- π§ RAG Project
- π Project Overview
- π― Why This RAG System Matters
- π Current Sprint Status
- π― Project Goals
- π§ Model
- π§ Architecture
- βοΈ Tech Stack
- ποΈ Project Structure
- Sprint Tracker
- π Running the Project (Colab + Cloudflare)
- π Quick Demo (Colab)
- π Version Control & GitHub Workflow
- π Monitoring & Observability
- βοΈ Deployment
- π‘ Future Enhancements
- π¨βπ» Author
- πͺͺ License
This project implements a fully modular, production-ready Retrieval-Augmented Generation (RAG) system focused on Canadian banking FAQs, starting with Royal Bank of Canada (RBC).
The system is designed with accuracy, grounding, reproducibility, and cloud readiness as core principles. It performs end-to-end retrieval and generation using:
- Embeddings from all-MiniLM-L6-v2 (Sentence Transformers)
- Normalized vector similarity using FAISS (GPU-accelerated)
- Clean metadata mapped to the original scraped FAQs
-
Model: microsoft/Phi-3-mini-4k-instruct
-
Loaded efficiently on Colab GPU (float16, device_map="auto")
-
Custom RAG prompt ensures:
- factuality
- context-grounded answers
- no hallucinations
If the answer is not in the retrieved context, the model responds with:
βI donβt know.β
Handles:
- Top-k semantic retrieval
- Construction of the grounding prompt
- Calling Phi-3 Mini for final generation
/healthand/askendpoints
- Chat experience similar to ChatGPT
- Conversations preserved per session
- Sidebar displays retrieved FAQ evidence
- Automatically reads backend URL from
rag_llm_url.txt
Both backend and frontend are exposed publicly via Cloudflare quick tunnels, providing:
- Stable URLs (
https://*.trycloudflare.com) - No account required
- Fully compatible with Colab
The project provides a single notebook (rag_full_pipeline_colab.ipynb) that:
-
Installs all dependencies
-
Scrapes and preprocesses data
-
Builds embeddings + FAISS index
-
Launches FastAPI backend
-
Creates a Cloudflare tunnel for backend
-
Starts Streamlit UI
-
Creates a Cloudflare tunnel for the UI
-
Returns two URLs:
- Backend API URL
- Public Streamlit Chat Interface
- Demonstrates a clean, explainable retrieval process
- Enforces hallucination-safe constraints
- Provides a complete MLOps-ready pipeline
- Designed for real banking FAQ use cases
- Can be extended to other banks (TD, CIBC, BMO, Scotiabank)
This architecture ensures consistent, auditable, and accurate answers from the LLM by grounding every response in validated FAQ data.
This README reflects the modernized architecture:
- Data ingestion, cleaning, and preprocessing
- Embeddings + FAISS
- Backend (FastAPI)
- Frontend (Streamlit)
- Public access (Cloudflare Tunnels)
Monitoring (Sprint 6) and Cloud deployment (Sprint 7) are planned next.
- Develop a modular, explainable RAG system that can expand to other banks.
- Ensure accuracy and prevent hallucinations via contextual grounding.
- Demonstrate practical MLOps: experiment tracking, CI/CD, monitoring, and cloud readiness.
- Support both GPU (T4) and CPU inference modes in Colab.
microsoft/Phi-3-mini-4k-instruct
A compact, instruction-tuned model optimized for fast, high-quality inference in Google Colab GPU environments.
- Works reliably on Colab GPUs (T4, L4 β 6β8 GB VRAM)
- Very fast generation (low latency)
- Instruction-tuned for Q&A and reasoning tasks
- Ideal for RAG pipelines that require concise, grounded answers
Your project loads the model with:
float16precision (GPU)device_map="auto"for efficient VRAM usage- Not 8-bit, not quantized (important correction)
- Tokenizer and model loaded from Hugging Face with
HUGGINGFACEHUB_API_TOKEN
Implemented using the Hugging Face Transformers pipeline("text-generation") API:
max_new_tokens=256temperature=0.2repetition_penalty=1.1top_p=0.9- deterministic generation (no sampling)
Every response is generated using a strict RAG prompt:
Use ONLY the provided context to answer the question accurately.
If the answer is not in the context, say exactly: "I donβt know."
This enforces:
- No hallucinations
- No invented banking policies
- No fabricated contact numbers
- No unauthorized assumptions
You are an expert assistant specializing in Canadian banking FAQs.
Use ONLY the provided context to answer the question accurately.
If the answer is not in the context, say exactly: "I donβt know."
Context:
{retrieved_docs}
Question: {question}
Answer:
The model produces exactly this string when:
- The retrieval step returns empty context
- None of the top-k retrieved answers include relevant information
- Generation pipeline fails or returns malformed output
This behavior ensures:
- High integrity of answers
- Compliance-friendly outputs
- Zero hallucination tolerance
- Auditability for regulated use cases (e.g., banking, finance)
flowchart TD
%% STYLE DEFINITIONS
classDef phase fill:#f3f2ff,stroke:#4b4bff,stroke-width:1px,color:#000,border-radius:6px
classDef component fill:#ffffff,stroke:#6b7280,stroke-width:1px,color:#000,border-radius:6px
classDef cloud fill:#e0f7ff,stroke:#0ea5e9,stroke-width:1px,color:#000,border-radius:6px
classDef db fill:#fef9c3,stroke:#facc15,stroke-width:1px,color:#000,border-radius:6px
%% PHASE 1: SCRAPER
A1([Playwright Scraper]):::component
A2([Raw RBC FAQ HTML]):::db
subgraph P1[PHASE 1 β SCRAPING AND VALIDATION]
A1 --> A2
end
class P1 phase
%% PHASE 2: PREPROCESSING
B1([clean_rbc_faqs.py]):::component
B2([normalize_faqs.py]):::component
B3([split_compound_faqs.py]):::component
B4([chunk_text.py]):::component
B5([rbc_faq_chunks.parquet]):::db
subgraph P2[PHASE 2 β TEXT PREPROCESSING]
A2 --> B1 --> B2 --> B3 --> B4 --> B5
end
class P2 phase
%% PHASE 3: EMBEDDINGS + FAISS
C1([MPNet Encoder]):::component
C2([generate_embeddings.py]):::component
C3([build_faiss_index.py]):::component
C4([rbc_embeddings.npy]):::db
C5([rbc_faiss.index]):::db
C6([rbc_metadata.parquet]):::db
subgraph P3[PHASE 3 β EMBEDDINGS AND FAISS]
B5 --> C2
C2 --> C4
C2 --> C6
C4 --> C3
C3 --> C5
end
class P3 phase
%% PHASE 3.5 β ONNX EXPORT
D1([export_mpnet_onnx.py]):::component
D2([mpnet.onnx]):::db
D3([tokenizer and config files]):::db
subgraph P35[PHASE 3.5 β MPNet TO ONNX]
C1 --> D1 --> D2
D1 --> D3
end
class P35 phase
%% PHASE 4: HYBRID RETRIEVER
E1([Hybrid RbcRetriever]):::component
E2([Local MPNet Encoder]):::component
E3([ONNXRuntime MPNet Encoder]):::component
E4([FAISS High Recall Search]):::component
subgraph P4[PHASE 4 β HYBRID RETRIEVER]
C5 --> E4
C6 --> E1
E1 --> E4
E1 -->|DEPLOY_ENV local| E2
E1 -->|DEPLOY_ENV cloud| E3
D2 --> E3
D3 --> E3
end
class P4 phase
%% PHASE 5β6: GENERATOR + BACKEND
F1([Strict Literal Generator]):::component
F2([FastAPI Backend]):::component
F3([Cloudflare Tunnel]):::cloud
F4([Cloud Run Service]):::cloud
subgraph P56[PHASE 5 AND 6 β GENERATION AND SERVING]
E1 --> F1
F1 --> F2
F2 --> F3 --> F4
end
class P56 phase
%% PHASE 7: MONITORING
G1([Streamlit Dashboard]):::component
G2([RAG Logs and Metrics]):::db
subgraph P7[PHASE 7 β MONITORING AND ANALYTICS]
F2 --> G2
G2 --> G1
end
class P7 phase
%% PHASE 8: CLOUD DEPLOYMENT
H1([Dockerfile]):::component
H2([cloudbuild.yaml]):::component
H3([Artifact Registry]):::cloud
H4([Cloud Run Deploy Script]):::component
subgraph P8[PHASE 8 β DOCKER BUILD AND CLOUD RUN DEPLOYMENT]
F2 --> H1
H1 --> H2 --> H3 --> H4 --> F4
end
class P8 phase
%% FINAL USER FLOW
User([User Query]):::component --> F4 --> F2 --> E1 --> F1 --> F2 --> UserResponse([Final Answer]):::component
The backend handles 3 responsibilities:
- Loads
rbc_faiss.index+rbc_metadata.parquet - Encodes user query using the same MiniLM model
- Performs normalized inner product search
- Returns top-k FAQs with metadata + scores
- Lazy loaded on first request for performance
- Uses a strict grounded prompt template
- Returns concise, context-bound answers only
- Responds with βI donβt know.β when retrieval is insufficient
GET /healthβ model + index statusGET /askβ retrieval + generation
Cloudflare tunnel exposes this backend at:
https://<something>.trycloudflare.com
Saved automatically to:
/content/rag-project/rag_llm_url.txt
Streamlit provides a clean conversational UI:
- Chat interface with markdown rendering
- Persisted session history
- Automatic backend URL loading from
rag_llm_url.txt - FAQ evidence viewer in sidebar
- Fully compatible with Cloudflare tunnels
Exposed at:
https://<something>.trycloudflare.com
The system uses two separate tunnels:
-
Backend Tunnel β FastAPI (port 8000)
cloudflared tunnel --url http://localhost:8000 -
Frontend Tunnel β Streamlit (port 8501)
cloudflared tunnel --url http://localhost:8501
Your RAG system architecture emphasizes:
Answers always derived from real FAQ context.
All steps scripted in rag_full_pipeline_colab.ipynb.
Clear separation of ingestion, preprocessing, embeddings, backend, and frontend.
Backend/frontend already compatible with Cloud Run, Streamlit Cloud, and Terraform.
Strict anti-hallucination prompt + explicit fallback.
This RAG system is built using a clean, modular, and cloud-ready stack that supports end-to-end retrieval, grounding, generation, and deployment.
| Component | Technology |
|---|---|
| Programming Language | Python 3.12 |
| Primary Runtime | Google Colab (GPU-enabled) |
| Cloud Tunneling | Cloudflare Quick Tunnels (cloudflared) |
| Layer | Tools & Libraries |
|---|---|
| Framework | FastAPI |
| Server | Uvicorn |
| RAG Pipeline | Custom retrieval β prompt building β Phi-3 inference |
| HTTP Endpoints | /health, /ask |
| Cloud Exposure | Cloudflare Tunnel (*.trycloudflare.com) |
| Component | Technology |
|---|---|
| UI Framework | Streamlit |
| Features | Chat interface, sidebar evidence viewer, session history |
| Backend URL Autoload | Reads from rag_llm_url.txt |
| Public URL | Cloudflare Streamlit tunnel (port 8501) |
| Layer | Tools |
|---|---|
| Embeddings Model | sentence-transformers/all-MiniLM-L6-v2 |
| Vector Database | FAISS (GPU-accelerated, cosine similarity) |
| Metadata Storage | Parquet files (rbc_metadata.parquet) |
| Embeddings Format | rbc_embeddings.npy |
| Layer | Technology |
|---|---|
| Model | microsoft/Phi-3-mini-4k-instruct |
| Precision | float16 on GPU |
| Inference | HuggingFace Transformers pipeline (text-generation) |
| Grounding Strategy | Strict context-only answers + "I donβt know" fallback |
| Step | Tools |
|---|---|
| Scraping | Playwright + BeautifulSoup4 |
| Cleaning | clean_rbc_faqs.py |
| Splitting | split_compound_faqs.py |
| Inspection | inspect_dataset.py |
| File Formats | JSON, Markdown, Parquet, NumPy |
| Purpose | Tools |
|---|---|
| Progress bars | tqdm |
| Cloud token management | Colab Secrets Storage |
| Logging | runtime logs stored under logs/ |
| Layer | Tools |
|---|---|
| Repository | Git + GitHub |
| Authentication | GitHub PAT (stored in Colab Secrets) |
| MLOps Readiness | Modular structure ready for CI/CD + Cloud Run |
The system is structured for upgrades to:
- GCP Cloud Run (FastAPI container)
- GCS (vector store + dataset hosting)
- Streamlit Cloud (UI hosting)
- Terraform (infrastructure-as-code)
- GitHub Actions (CI/CD)
Planned for Sprints 6 and 7.
The project follows a clean, modular layout designed for RAG systems, cloud deployment, and Colab execution.
rag-project/
βββ rag_llm_url.txt # Auto-saved backend URL (Cloudflare)
βββ backend.log # FastAPI runtime logs
βββ frontend.log # Streamlit runtime logs
βββ requirements.txt # Core dependencies
βββ README.md # Project documentation
βββ logs/
β βββ scrape_rbc.log # Ingestion/scraper logs
β
βββ data/ # (Drive-linked) persistent dataset storage
β βββ raw/ # Raw scraped RBC FAQ markdown/HTML
β βββ processed/ # Cleaned + split FAQ parquet files
β βββ index/ # FAISS index + embeddings + metadata
β
βββ src/
β βββ api/
β β βββ main.py # FastAPI RAG backend (/ask, /health)
β β
β βββ frontend/
β β βββ chat_ui.py # Streamlit chat interface (loads rag_llm_url.txt)
β β βββ static/
β β β βββ style.css # UI styles (custom)
β β βββ templates/
β β βββ index.html # Reserved (rarely used)
β β
β βββ ingestion/
β β βββ scrape_rbc_faqs.py # Playwright-based scraper
β β βββ rbc_urls.txt # Source URLs (RBC FAQ pages)
β β βββ test_playwright_visual.py # Visual debugging for Playwright
β β
β βββ preprocess/
β β βββ clean_rbc_faqs.py # HTML β Markdown β clean text
β β βββ split_compound_faqs.py # Split FAQs with multiple Q/A pairs
β β βββ inspect_dataset.py # Dataset inspection + reporting
β β
β βββ embeddings/
β β βββ generate_embeddings.py # MiniLM embedding generation
β β βββ build_faiss_index.py # FAISS index creation & validation
β β
β βββ retrieval/
β β βββ search_engine.py # RbcRetriever (FAISS search logic)
β β
β βββ generation/
β β βββ generator.py # Phi-3 Mini generation (prompt + pipeline)
β β
β βββ tests/
β β βββ test_scraper_integrity.py # Integrity checks for ingestion
β β
β βββ utils/
β βββ (empty or helper files) # Reserved for future utilities
β
βββ .gitignore # Optimized git ignoresContains:
/health/ask(retrieval β RAG answer)
Reads FAISS index + MiniLM embeddings + Phi-3 generator.
- Chat interface
- Sidebar evidence viewer
- Automatically loads backend from
rag_llm_url.txt - Runs on port 8501 in Colab
- Generates MiniLM embeddings
- Stores
.npyfiles - Creates FAISS index + metadata parquet
Persisted inside Google Drive:
raw/β scraped source filesprocessed/β cleaned and split parquet filesindex/β FAISS index, metadata, embeddings
Ensures data is saved across Colab runtimes.
Phi-3 Mini loading + prompt template:
- Grounded answers only
- Strict anti-hallucination rule
- Deterministic generation
- Normalized cosine similarity
- Top-k document retrieval
- Returns score + question + answer + url
Auto-created file storing:
https://<backend>.trycloudflare.com
The Streamlit UI reads this to know where the backend is hosted.
You may later add:
/monitoring/(WhyLogs/Evidently)/deploy/(Terraform, Cloud Run configs)/notebooks/(colab_demo, EDA notebooks)
But these do not exist yet β keeping README accurate.
| Sprint | Description | Key Deliverables | Status | Progress |
|---|---|---|---|---|
| Sprint 1 β Ingestion | Scrape RBC FAQ webpages using Playwright | scrape_rbc_faqs.py, raw .md/.html, logs |
β Completed | π© 100% |
| Sprint 2 β Preprocessing | Clean text, normalize formatting, split compound FAQs | Clean parquet files, dataset report | β Completed | π© 100% |
| Sprint 3 β Embeddings + FAISS | Generate MiniLM embeddings & build FAISS index | faiss.index, metadata parquet, semantic search |
β Completed | π© 100% |
| Sprint 4 β Backend (RAG API) | FastAPI server with Retriever + Phi-3 Generator + Cloudflare tunnel | /ask & /health, Cloudflare public URL, rag_llm_url.txt |
β Completed | π© 100% |
| Sprint 5 β Frontend (Streamlit UI) | Chat app interacting with backend via rag_llm_url.txt |
chat_ui.py + local test |
β Completed | π© 100% |
| Sprint 5.1 β Public Streamlit URL (Cloudflare) | Cloudflare tunnel for UI (8501) | Public Streamlit link | π© Completed | π© 100% |
| Sprint 6 β Monitoring & Observability | WhyLogs + Evidently dashboards; usage logging; latency stats | Coming soon | π§ In Progress | π¨ 40% |
| Sprint 7 β Cloud Deployment (GCP) | Cloud Run (Backend & UI), Cloud Storage, Terraform + CI/CD | Planned | β³ Not Started | β¬ 20% |
| Sprint 8 β Optimization | Model quantization, retrieval performance tuning | Planned | β³ Not Started | β¬ |
| Sprint 9 β Multi-Bank Expansion | Add TD, CIBC, BMO, Scotiabank pipelines | Planned | β³ Not Started | β¬ |
π©π©π©π©π©π©π©β¬β¬β¬ 75% Complete
This guide shows exactly how to run the full RAG system β backend + RAG pipeline + Streamlit UI β inside Google Colab, with public URLs served through Cloudflare Tunnels (no domain required).
This is the official, stable flow for this project.
!git clone https://github.com/JDede1/rag-project.git
%cd /content/rag-project!pip install -U pip --quiet
!pip install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 --quiet
!pip install -U transformers sentence-transformers accelerate safetensors --quiet
!pip install -U faiss-gpu-cu12 pandas numpy pyarrow fastparquet --quiet
!pip install -U beautifulsoup4 markdownify tqdm playwright --quiet
!pip install -U fastapi uvicorn httpx --quiet
!pip install -U bitsandbytes --quiet
!playwright install-deps > /dev/null
!playwright install chromium > /dev/nullfrom google.colab import drive
drive.mount("/content/drive")Run the full end-to-end data processing pipeline:
python src/ingestion/scrape_rbc_faqs.py
python src/preprocess/clean_rbc_faqs.py
python src/preprocess/split_compound_faqs.py
python src/preprocess/inspect_dataset.py data/processed/rbc_faqs_refined.parquet --report
python src/embeddings/generate_embeddings.py
python src/embeddings/build_faiss_index.pyArtifacts will be written to:
/content/rag-project/data/processed/
/content/rag-project/data/index/
Confirm the pipeline works before exposing endpoints:
from retrieval.search_engine import RbcRetriever
from generation.generator import generate_answer
retriever = RbcRetriever()
query = "How do I report a lost credit card?"
docs = retriever.search(query, top_k=3)
context = docs["answer"].tolist()
print(generate_answer(query, context))!pkill -f uvicorn || true!nohup uvicorn src.api.main:app --host 0.0.0.0 --port 8000 \
> backend.log 2>&1 &!ps -ef | grep uvicorn!wget -q https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.deb
!sudo dpkg -i cloudflared-linux-amd64.deb!cloudflared tunnel --url http://localhost:8000 > backend_tunnel.log 2>&1 &
import time; time.sleep(7)import re
logs = open("backend_tunnel.log").read()
backend_url = re.findall(r"https://[-a-zA-Z0-9.]+trycloudflare.com", logs)[0]
with open("rag_llm_url.txt", "w") as f:
f.write(backend_url)
print("Backend URL:", backend_url)import requests
requests.get(f"{backend_url}/health").json()!pkill -f streamlit || true!nohup streamlit run /content/rag-project/src/frontend/chat_ui.py \
--server.port 8501 \
--server.address 0.0.0.0 \
> frontend.log 2>&1 &!cloudflared tunnel --url http://localhost:8501 > frontend_tunnel.log 2>&1 &
import time; time.sleep(7)logs = open("frontend_tunnel.log").read()
frontend_url = re.findall(r"https://[-a-zA-Z0-9.]+trycloudflare.com", logs)[0]
print("Streamlit URL:", frontend_url)https://<random-hash>.trycloudflare.com
The pipeline will:
-
Send query β FastAPI backend
-
Retrieve top-k relevant FAQ chunks (FAISS)
-
Pass context β Phi-3 Mini
-
Generate grounded answer
-
Return both:
- answer
- retrieved evidence (sidebar)
βI donβt know.β
- Git configured with PAT token via Colab Secrets
- Repo: https://github.com/JDede1/rag-project
- Push workflow:
from google.colab import userdata
token = userdata.get("PAT_TOKEN")
!git config --global user.name "JDede1"
!git config --global user.email "dev@users.noreply.github.com"
%cd /content/rag-project
!git add .
!git commit -m "Update project"
!git push https://{token}@github.com/JDede1/rag-project.git mainYou can instantly try the full RAG system β including retrieval, Phi-3 Mini generation, FastAPI backend, and Streamlit chat UI β directly inside Google Colab, with public URLs provided automatically via Cloudflare Tunnels.
Click below to launch the live demo notebook:
The notebook executes the entire pipeline in one place:
- Clones your GitHub repo
- Installs all dependencies
- Mounts Google Drive for persistent storage
- Scrapes RBC FAQs
- Cleans & preprocesses the data
- Generates embeddings
- Builds a FAISS vector index
- Loads Phi-3 Mini 4K Instruct
- Performs grounded generation with retrieved context
- Enforces βI donβt knowβ when no relevant answer exists
- Starts the RAG backend on port
8000 - Exposes the API using Cloudflare Tunnel
- Auto-saves the backend URL for Streamlit
- A clean chat interface with sidebar evidence viewer
- Also exposed publicly via Cloudflare Tunnel
You get two live URLs:
β Backend API URL (FastAPI) β Frontend Chat URL (Streamlit UI)
Both remain active as long as the Colab notebook is running.
Once the UI loads, try:
βHow do I report a lost credit card?β
The model retrieves matching FAQs and generates a grounded, factual answer using Phi-3 Mini.
The notebook uses Colab Secrets Manager to load:
- Hugging Face token (
HF_TOKEN) - GitHub PAT (
PAT_TOKEN) if pushing updates
Nothing is hardcoded.
For best performance:
Runtime β Change Runtime Type β T4 GPU
This demo is ideal if you want to:
- Run the full RAG workflow without local setup
- Test the system publicly from mobile/desktop
- Prototype improvements before deploying to GCP
- Share the live chatbot with others instantly
This project uses Git + GitHub for source control, with development performed inside Google Colab and synchronized with the repository through a secure Personal Access Token (PAT) stored in Colab Secrets.
This workflow ensures:
- β Secure, authenticated pushes
- β Clean repository (no local logs or Colab artifacts)
- β Consistent structure between Colab and GitHub
- β Reproducible development after each Colab restart
Before pushing changes, configure Git identity (anonymous/default-friendly):
!git config --global user.name "rag-project-dev"
!git config --global user.email "dev@users.noreply.github.com"This uses a non-identifying username and GitHubβs privacy-safe no-reply email.
Your GitHub Personal Access Token (PAT) is safely stored in:
Colab β Settings β Secrets β PAT_TOKEN
Load it securely without exposing it:
from google.colab import userdata
token = userdata.get("PAT_TOKEN")from google.colab import userdata
token = userdata.get("PAT_TOKEN")
%cd /content/rag-project
!git add .
!git commit -m "Update: latest development changes"
!git push https://{token}@github.com/<YOUR_GITHUB_USERNAME>/rag-project.git mainReplace <YOUR_GITHUB_USERNAME> with your GitHub handle.
- main β Stable, production-ready version
- dev (optional) β Experimental work
- feat/* β New features
- fix/* β Hotfix patches
If GitHub has newer changes:
%cd /content/rag-project
!git pull origin mainWhen Colab restarts:
- Notebook reclones the repo
- All code is restored
- Development continues normally
Thanks to GitHub, no project state is lost.
Monitoring is a critical component of a reliable RAG system. Monitoring introduces a structured observability layer to track:
- Data drift (changes in scraped FAQ distributions)
- Embedding drift (FAISS vector distribution shifts)
- Model quality (answer accuracy, grounding quality)
- Latency & throughput (API performance)
- User behavior analytics (optional: anonymized logs)
This ensures that the RAG system remains accurate, stable, and safe even as data evolves.
WhyLogs enables automated logging of:
-
Text statistics (length, token counts, entities)
-
Input distribution changes
-
Drift in processed FAQ dataset over time
-
Embedding vector statistics before FAISS indexing
-
Request/response metadata such as:
- Query length
- Context length
- Retrieved FAQ count
Logs are stored in:
data/logs/whylogs/
and can be visualized later using WhyLabs or WhyLogs Python APIs.
Evidently supports interactive dashboards for:
-
Detecting changes between:
- Newly scraped RBC FAQs
- Previous versions of the dataset
- Comparing embedding distributions across time windows
- Highlighting semantic shifts that may require reindexing FAISS
Even though this is not a classifier, Evidently still tracks:
- LLM answer length
- Similarity between answer and retrieved context
- Hallucination rate (via βcontext adherenceβ heuristics)
Dashboards are exported to:
data/reports/evidently/
and rendered in notebooks or Streamlit.
The FastAPI backend automatically provides:
/healthendpoint- Latency stats from Cloudflare tunnel logs
- Uvicorn runtime logs
Additionally, Sprint 6 introduces custom middleware to track:
- Request timestamps
- Answer generation time
- Retrieval time (FAISS search latency)
- Total RAG response latency
These metrics can later feed into:
- Prometheus (optional)
- Cloud Run built-in metrics (in Sprint 7)
- Cloud Logging (in Sprint 7)
Each request logs:
{
"timestamp": "...",
"query": "How do I report a lost credit card?",
"retrieval_time_ms": 12,
"generation_time_ms": 580,
"total_time_ms": 612,
"context_items": 3,
"model": "Phi-3-mini-4k-instruct"
}Logs are saved to:
logs/rag_requests.log
This supports long-term monitoring, debugging, and drift analysis.
User Query
β
FastAPI Middleware
β
WhyLogs Data Logging
β
Evidently Drift Checks
β
Log Artifacts Stored
β
Visual Dashboards (Notebook / Streamlit / Cloud)
To complete Sprint 6, the following must be implemented:
| Task | Status |
|---|---|
| WhyLogs logging for dataset and embeddings | β¬ Pending |
| Request/response logging middleware | β¬ Pending |
| Evidently dataset drift report | β¬ Pending |
| Evidently embedding drift report | β¬ Pending |
| Streamlit monitoring dashboard (optional) | β¬ Pending |
| Cloud-ready logging structure | β¬ Pending |
This project is designed for local development in Google Colab, with optional deployment to Google Cloud Run for fully managed, scalable hosting. The deployment path is modular, following the Sprint roadmap:
- Sprint 4β5: Local backend + Streamlit in Colab
- Sprint 5.1: Public URLs using Cloudflare Tunnels
- Sprint 7: Full cloud deployment (FastAPI β Cloud Run, UI β Streamlit Cloud or Cloud Run)
Below are the official deployment options.
Local development happens inside Google Colab and uses:
- FastAPI (Uvicorn) β backend
- Streamlit β frontend
- Cloudflare Tunnels β public URLs
This enables full end-to-end RAG testing entirely in the browser, with no local installations.
!nohup uvicorn src.api.main:app --host 0.0.0.0 --port 8000 > backend.log 2>&1 &!streamlit run src/frontend/chat_ui.py --server.port 8501 --server.address 0.0.0.0!cloudflared tunnel --url http://localhost:8000
!cloudflared tunnel --url http://localhost:8501This creates public URLs for:
- Backend β FastAPI
- Frontend β Streamlit Chat UI
Both run entirely inside Colab.
Recommended for production.
Cloud Run provides:
- Autoscaling
- Zero-downtime deployments
- HTTPS endpoints
- Built-in logs + metrics
- GPU optional (for model inference)
gcloud builds submit --tag gcr.io/PROJECT_ID/rag-backendgcloud run deploy rag-backend \
--image gcr.io/PROJECT_ID/rag-backend \
--platform managed \
--region us-central1 \
--allow-unauthenticatedYou will receive a public HTTPS URL:
https://rag-backend-xxxxx.a.run.app
Set this as your backend URL in Streamlit Cloud or Colab:
rag_llm_url.txt
Deploy the Streamlit UI separately at:
Your repo must contain:
src/frontend/chat_ui.py
requirements.txt
rag_llm_url.txt or Streamlit secrets
RAG_BACKEND_URL = "https://rag-backend-xxxxx.a.run.app"Submit the GitHub repo to Streamlit Cloud, and it becomes publicly accessible.
Production-grade architecture (Sprint 7)
User Browser
β
Streamlit Frontend (Cloud Run)
β
FastAPI Backend (Cloud Run)
β
FAISS Index + Model (built into backend image)
| Component | Platform |
|---|---|
| FastAPI RAG Backend | Cloud Run |
| Streamlit UI | Cloud Run or Streamlit Cloud |
| CI/CD | GitHub Actions |
| Infrastructure | Terraform / IaC |
This architecture supports:
- Automated builds
- Scalable inference
- Persistent FAISS index
- Secure environment variables
- Managed HTTPS
| Sprint | Deployment Focus |
|---|---|
| Sprint 4β5 | Local development in Colab |
| Sprint 5.1 | Public access via Cloudflare |
| Sprint 6 | Logging and monitoring preparation |
| Sprint 7 | Full cloud deployment (Cloud Run + CI/CD) |
- All embeddings, metadata, and FAISS indexes are bundled into the Docker image during deployment.
- FAQ scraping is performed offline β not during cloud runtime.
- Phi-3 Mini runs inside the backend container using PyTorch CPU or CUDA.
- Cloud Run GPU deployment is supported if needed.
- Add LangChain/LlamaIndex retrieval chains
- Expand FAQ coverage to TD, CIBC, BMO, Scotiabank
Ajibola Dedenuola Data Scientist Β· Machine Learning Engineer Β· MLOps Specialist
π M.Sc. Information Science & Machine Learning β University of Arizona π GitHub
This project uses publicly available RBC FAQ content for educational and research purposes. All trademarks and materials belong to RBC Royal Bank.