Skip to content

End-to-end Retrieval-Augmented Generation (RAG) pipeline built with FastAPI, FAISS, and Hugging Face models for intelligent FAQ answering.

Notifications You must be signed in to change notification settings

JDede1/rag-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 RAG Project

A Retrieval-Augmented Generation (RAG) System for Canadian Banking FAQs β€” starting with RBC

Python FastAPI Streamlit SentenceTransformers FAISS Phi3 Cloudflare Tunnel GitHub


πŸ“š Table of Contents


πŸ“˜ Project Overview

This project implements a fully modular, production-ready Retrieval-Augmented Generation (RAG) system focused on Canadian banking FAQs, starting with Royal Bank of Canada (RBC).

The system is designed with accuracy, grounding, reproducibility, and cloud readiness as core principles. It performs end-to-end retrieval and generation using:

πŸ” 1. Semantic Retrieval

  • Embeddings from all-MiniLM-L6-v2 (Sentence Transformers)
  • Normalized vector similarity using FAISS (GPU-accelerated)
  • Clean metadata mapped to the original scraped FAQs

🧠 2. Grounded LLM Answer Generation

  • Model: microsoft/Phi-3-mini-4k-instruct

  • Loaded efficiently on Colab GPU (float16, device_map="auto")

  • Custom RAG prompt ensures:

    • factuality
    • context-grounded answers
    • no hallucinations

If the answer is not in the retrieved context, the model responds with:

β€œI don’t know.”

βš™οΈ 3. FastAPI Backend (RAG Pipeline)

Handles:

  • Top-k semantic retrieval
  • Construction of the grounding prompt
  • Calling Phi-3 Mini for final generation
  • /health and /ask endpoints

πŸ’¬ 4. Streamlit Chat UI

  • Chat experience similar to ChatGPT
  • Conversations preserved per session
  • Sidebar displays retrieved FAQ evidence
  • Automatically reads backend URL from rag_llm_url.txt

🌐 5. Cloudflare Tunnels (No Domain Needed)

Both backend and frontend are exposed publicly via Cloudflare quick tunnels, providing:

  • Stable URLs (https://*.trycloudflare.com)
  • No account required
  • Fully compatible with Colab

πŸ“ 6. Google Colab End-to-End Pipeline

The project provides a single notebook (rag_full_pipeline_colab.ipynb) that:

  1. Installs all dependencies

  2. Scrapes and preprocesses data

  3. Builds embeddings + FAISS index

  4. Launches FastAPI backend

  5. Creates a Cloudflare tunnel for backend

  6. Starts Streamlit UI

  7. Creates a Cloudflare tunnel for the UI

  8. Returns two URLs:

    • Backend API URL
    • Public Streamlit Chat Interface

🎯 Why This RAG System Matters

  • Demonstrates a clean, explainable retrieval process
  • Enforces hallucination-safe constraints
  • Provides a complete MLOps-ready pipeline
  • Designed for real banking FAQ use cases
  • Can be extended to other banks (TD, CIBC, BMO, Scotiabank)

This architecture ensures consistent, auditable, and accurate answers from the LLM by grounding every response in validated FAQ data.


πŸš€ Current Sprint Status

This README reflects the modernized architecture:

  • Data ingestion, cleaning, and preprocessing
  • Embeddings + FAISS
  • Backend (FastAPI)
  • Frontend (Streamlit)
  • Public access (Cloudflare Tunnels)

Monitoring (Sprint 6) and Cloud deployment (Sprint 7) are planned next.


🎯 Project Goals

  • Develop a modular, explainable RAG system that can expand to other banks.
  • Ensure accuracy and prevent hallucinations via contextual grounding.
  • Demonstrate practical MLOps: experiment tracking, CI/CD, monitoring, and cloud readiness.
  • Support both GPU (T4) and CPU inference modes in Colab.

🧠 Model

Primary LLM:

microsoft/Phi-3-mini-4k-instruct

A compact, instruction-tuned model optimized for fast, high-quality inference in Google Colab GPU environments.

Why Phi-3 Mini?

  • Works reliably on Colab GPUs (T4, L4 β€” 6–8 GB VRAM)
  • Very fast generation (low latency)
  • Instruction-tuned for Q&A and reasoning tasks
  • Ideal for RAG pipelines that require concise, grounded answers

Model Loading (Real Behavior)

Your project loads the model with:

  • float16 precision (GPU)
  • device_map="auto" for efficient VRAM usage
  • Not 8-bit, not quantized (important correction)
  • Tokenizer and model loaded from Hugging Face with HUGGINGFACEHUB_API_TOKEN

Generation Pipeline

Implemented using the Hugging Face Transformers pipeline("text-generation") API:

  • max_new_tokens=256
  • temperature=0.2
  • repetition_penalty=1.1
  • top_p=0.9
  • deterministic generation (no sampling)

Grounding Logic (Key Safety Feature)

Every response is generated using a strict RAG prompt:

Use ONLY the provided context to answer the question accurately.
If the answer is not in the context, say exactly: "I don’t know."

This enforces:

  • No hallucinations
  • No invented banking policies
  • No fabricated contact numbers
  • No unauthorized assumptions

Prompt Template (Exact Behavior)

You are an expert assistant specializing in Canadian banking FAQs.
Use ONLY the provided context to answer the question accurately.
If the answer is not in the context, say exactly: "I don’t know."

Context:
{retrieved_docs}

Question: {question}
Answer:

When the model says β€œI don’t know”

The model produces exactly this string when:

  • The retrieval step returns empty context
  • None of the top-k retrieved answers include relevant information
  • Generation pipeline fails or returns malformed output

Why this matters

This behavior ensures:

  • High integrity of answers
  • Compliance-friendly outputs
  • Zero hallucination tolerance
  • Auditability for regulated use cases (e.g., banking, finance)

🧭 Architecture

flowchart TD

    %% STYLE DEFINITIONS
    classDef phase fill:#f3f2ff,stroke:#4b4bff,stroke-width:1px,color:#000,border-radius:6px
    classDef component fill:#ffffff,stroke:#6b7280,stroke-width:1px,color:#000,border-radius:6px
    classDef cloud fill:#e0f7ff,stroke:#0ea5e9,stroke-width:1px,color:#000,border-radius:6px
    classDef db fill:#fef9c3,stroke:#facc15,stroke-width:1px,color:#000,border-radius:6px

    %% PHASE 1: SCRAPER
    A1([Playwright Scraper]):::component
    A2([Raw RBC FAQ HTML]):::db
    subgraph P1[PHASE 1 β€” SCRAPING AND VALIDATION]
        A1 --> A2
    end
    class P1 phase

    %% PHASE 2: PREPROCESSING
    B1([clean_rbc_faqs.py]):::component
    B2([normalize_faqs.py]):::component
    B3([split_compound_faqs.py]):::component
    B4([chunk_text.py]):::component
    B5([rbc_faq_chunks.parquet]):::db

    subgraph P2[PHASE 2 β€” TEXT PREPROCESSING]
        A2 --> B1 --> B2 --> B3 --> B4 --> B5
    end
    class P2 phase

    %% PHASE 3: EMBEDDINGS + FAISS
    C1([MPNet Encoder]):::component
    C2([generate_embeddings.py]):::component
    C3([build_faiss_index.py]):::component
    C4([rbc_embeddings.npy]):::db
    C5([rbc_faiss.index]):::db
    C6([rbc_metadata.parquet]):::db

    subgraph P3[PHASE 3 β€” EMBEDDINGS AND FAISS]
        B5 --> C2
        C2 --> C4
        C2 --> C6
        C4 --> C3
        C3 --> C5
    end
    class P3 phase

    %% PHASE 3.5 β€” ONNX EXPORT
    D1([export_mpnet_onnx.py]):::component
    D2([mpnet.onnx]):::db
    D3([tokenizer and config files]):::db

    subgraph P35[PHASE 3.5 β€” MPNet TO ONNX]
        C1 --> D1 --> D2
        D1 --> D3
    end
    class P35 phase

    %% PHASE 4: HYBRID RETRIEVER
    E1([Hybrid RbcRetriever]):::component
    E2([Local MPNet Encoder]):::component
    E3([ONNXRuntime MPNet Encoder]):::component
    E4([FAISS High Recall Search]):::component

    subgraph P4[PHASE 4 β€” HYBRID RETRIEVER]
        C5 --> E4
        C6 --> E1
        E1 --> E4

        E1 -->|DEPLOY_ENV local| E2
        E1 -->|DEPLOY_ENV cloud| E3

        D2 --> E3
        D3 --> E3
    end
    class P4 phase

    %% PHASE 5–6: GENERATOR + BACKEND
    F1([Strict Literal Generator]):::component
    F2([FastAPI Backend]):::component
    F3([Cloudflare Tunnel]):::cloud
    F4([Cloud Run Service]):::cloud

    subgraph P56[PHASE 5 AND 6 β€” GENERATION AND SERVING]
        E1 --> F1
        F1 --> F2
        F2 --> F3 --> F4
    end
    class P56 phase

    %% PHASE 7: MONITORING
    G1([Streamlit Dashboard]):::component
    G2([RAG Logs and Metrics]):::db

    subgraph P7[PHASE 7 β€” MONITORING AND ANALYTICS]
        F2 --> G2
        G2 --> G1
    end
    class P7 phase

    %% PHASE 8: CLOUD DEPLOYMENT
    H1([Dockerfile]):::component
    H2([cloudbuild.yaml]):::component
    H3([Artifact Registry]):::cloud
    H4([Cloud Run Deploy Script]):::component

    subgraph P8[PHASE 8 β€” DOCKER BUILD AND CLOUD RUN DEPLOYMENT]
        F2 --> H1
        H1 --> H2 --> H3 --> H4 --> F4
    end
    class P8 phase

    %% FINAL USER FLOW
    User([User Query]):::component --> F4 --> F2 --> E1 --> F1 --> F2 --> UserResponse([Final Answer]):::component
Loading

πŸ”Ή Backend (FastAPI RAG Pipeline)

The backend handles 3 responsibilities:

1. Retrieval (FAISS)

  • Loads rbc_faiss.index + rbc_metadata.parquet
  • Encodes user query using the same MiniLM model
  • Performs normalized inner product search
  • Returns top-k FAQs with metadata + scores

2. Generation (Phi-3 Mini)

  • Lazy loaded on first request for performance
  • Uses a strict grounded prompt template
  • Returns concise, context-bound answers only
  • Responds with β€œI don’t know.” when retrieval is insufficient

3. API Endpoints

  • GET /health β†’ model + index status
  • GET /ask β†’ retrieval + generation

Cloudflare tunnel exposes this backend at:

https://<something>.trycloudflare.com

Saved automatically to:

/content/rag-project/rag_llm_url.txt

πŸ”Ή Frontend (Streamlit UI)

Streamlit provides a clean conversational UI:

  • Chat interface with markdown rendering
  • Persisted session history
  • Automatic backend URL loading from rag_llm_url.txt
  • FAQ evidence viewer in sidebar
  • Fully compatible with Cloudflare tunnels

Exposed at:

https://<something>.trycloudflare.com

πŸ”Ή Cloudflare Tunnel Layer

The system uses two separate tunnels:

  1. Backend Tunnel β†’ FastAPI (port 8000)

    cloudflared tunnel --url http://localhost:8000
    
  2. Frontend Tunnel β†’ Streamlit (port 8501)

    cloudflared tunnel --url http://localhost:8501
    

πŸ”Ή Design Principles

Your RAG system architecture emphasizes:

βœ“ Grounding

Answers always derived from real FAQ context.

βœ“ Reproducibility

All steps scripted in rag_full_pipeline_colab.ipynb.

βœ“ Modularity

Clear separation of ingestion, preprocessing, embeddings, backend, and frontend.

βœ“ Cloud Readiness

Backend/frontend already compatible with Cloud Run, Streamlit Cloud, and Terraform.

βœ“ Safety

Strict anti-hallucination prompt + explicit fallback.



βš™οΈ Tech Stack

This RAG system is built using a clean, modular, and cloud-ready stack that supports end-to-end retrieval, grounding, generation, and deployment.


πŸ”Ή Language & Runtime

Component Technology
Programming Language Python 3.12
Primary Runtime Google Colab (GPU-enabled)
Cloud Tunneling Cloudflare Quick Tunnels (cloudflared)

πŸ”Ή Backend (RAG API)

Layer Tools & Libraries
Framework FastAPI
Server Uvicorn
RAG Pipeline Custom retrieval β†’ prompt building β†’ Phi-3 inference
HTTP Endpoints /health, /ask
Cloud Exposure Cloudflare Tunnel (*.trycloudflare.com)

πŸ”Ή Frontend (Chat Interface)

Component Technology
UI Framework Streamlit
Features Chat interface, sidebar evidence viewer, session history
Backend URL Autoload Reads from rag_llm_url.txt
Public URL Cloudflare Streamlit tunnel (port 8501)

πŸ”Ή Embeddings & Vector Search

Layer Tools
Embeddings Model sentence-transformers/all-MiniLM-L6-v2
Vector Database FAISS (GPU-accelerated, cosine similarity)
Metadata Storage Parquet files (rbc_metadata.parquet)
Embeddings Format rbc_embeddings.npy

πŸ”Ή LLM (Generation Engine)

Layer Technology
Model microsoft/Phi-3-mini-4k-instruct
Precision float16 on GPU
Inference HuggingFace Transformers pipeline (text-generation)
Grounding Strategy Strict context-only answers + "I don’t know" fallback

πŸ”Ή Data Pipeline

Step Tools
Scraping Playwright + BeautifulSoup4
Cleaning clean_rbc_faqs.py
Splitting split_compound_faqs.py
Inspection inspect_dataset.py
File Formats JSON, Markdown, Parquet, NumPy

πŸ”Ή Utilities

Purpose Tools
Progress bars tqdm
Cloud token management Colab Secrets Storage
Logging runtime logs stored under logs/

πŸ”Ή Version Control

Layer Tools
Repository Git + GitHub
Authentication GitHub PAT (stored in Colab Secrets)
MLOps Readiness Modular structure ready for CI/CD + Cloud Run

πŸ”Ή Deployment-Ready

The system is structured for upgrades to:

  • GCP Cloud Run (FastAPI container)
  • GCS (vector store + dataset hosting)
  • Streamlit Cloud (UI hosting)
  • Terraform (infrastructure-as-code)
  • GitHub Actions (CI/CD)

Planned for Sprints 6 and 7.


πŸ—‚οΈ Project Structure

The project follows a clean, modular layout designed for RAG systems, cloud deployment, and Colab execution.

rag-project/
β”œβ”€β”€ rag_llm_url.txt                 # Auto-saved backend URL (Cloudflare)
β”œβ”€β”€ backend.log                     # FastAPI runtime logs
β”œβ”€β”€ frontend.log                    # Streamlit runtime logs
β”œβ”€β”€ requirements.txt                # Core dependencies
β”œβ”€β”€ README.md                       # Project documentation
β”œβ”€β”€ logs/
β”‚   └── scrape_rbc.log              # Ingestion/scraper logs
β”‚
β”œβ”€β”€ data/                           # (Drive-linked) persistent dataset storage
β”‚   β”œβ”€β”€ raw/                        # Raw scraped RBC FAQ markdown/HTML
β”‚   β”œβ”€β”€ processed/                  # Cleaned + split FAQ parquet files
β”‚   └── index/                      # FAISS index + embeddings + metadata
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ api/
β”‚   β”‚   └── main.py                 # FastAPI RAG backend (/ask, /health)
β”‚   β”‚
β”‚   β”œβ”€β”€ frontend/
β”‚   β”‚   β”œβ”€β”€ chat_ui.py              # Streamlit chat interface (loads rag_llm_url.txt)
β”‚   β”‚   β”œβ”€β”€ static/
β”‚   β”‚   β”‚   └── style.css           # UI styles (custom)
β”‚   β”‚   └── templates/
β”‚   β”‚       └── index.html          # Reserved (rarely used)
β”‚   β”‚
β”‚   β”œβ”€β”€ ingestion/
β”‚   β”‚   β”œβ”€β”€ scrape_rbc_faqs.py      # Playwright-based scraper
β”‚   β”‚   β”œβ”€β”€ rbc_urls.txt            # Source URLs (RBC FAQ pages)
β”‚   β”‚   └── test_playwright_visual.py # Visual debugging for Playwright
β”‚   β”‚
β”‚   β”œβ”€β”€ preprocess/
β”‚   β”‚   β”œβ”€β”€ clean_rbc_faqs.py       # HTML β†’ Markdown β†’ clean text
β”‚   β”‚   β”œβ”€β”€ split_compound_faqs.py  # Split FAQs with multiple Q/A pairs
β”‚   β”‚   └── inspect_dataset.py      # Dataset inspection + reporting
β”‚   β”‚
β”‚   β”œβ”€β”€ embeddings/
β”‚   β”‚   β”œβ”€β”€ generate_embeddings.py  # MiniLM embedding generation
β”‚   β”‚   └── build_faiss_index.py    # FAISS index creation & validation
β”‚   β”‚
β”‚   β”œβ”€β”€ retrieval/
β”‚   β”‚   └── search_engine.py        # RbcRetriever (FAISS search logic)
β”‚   β”‚
β”‚   β”œβ”€β”€ generation/
β”‚   β”‚   └── generator.py            # Phi-3 Mini generation (prompt + pipeline)
β”‚   β”‚
β”‚   β”œβ”€β”€ tests/
β”‚   β”‚   └── test_scraper_integrity.py # Integrity checks for ingestion
β”‚   β”‚
β”‚   └── utils/
β”‚       └── (empty or helper files) # Reserved for future utilities
β”‚
└── .gitignore                      # Optimized git ignores

πŸ” Key Folder Highlights

1. /src/api β€” FastAPI Backend

Contains:

  • /health
  • /ask (retrieval β†’ RAG answer)

Reads FAISS index + MiniLM embeddings + Phi-3 generator.


2. /src/frontend β€” Streamlit Chat UI

  • Chat interface
  • Sidebar evidence viewer
  • Automatically loads backend from rag_llm_url.txt
  • Runs on port 8501 in Colab

3. /src/embeddings β€” Dense Vector Generation

  • Generates MiniLM embeddings
  • Stores .npy files
  • Creates FAISS index + metadata parquet

4. /data β€” Persistent Workspace

Persisted inside Google Drive:

  • raw/ β†’ scraped source files
  • processed/ β†’ cleaned and split parquet files
  • index/ β†’ FAISS index, metadata, embeddings

Ensures data is saved across Colab runtimes.


5. /src/generation β€” LLM Answer Engine

Phi-3 Mini loading + prompt template:

  • Grounded answers only
  • Strict anti-hallucination rule
  • Deterministic generation

6. /src/retrieval β€” Semantic Search

  • Normalized cosine similarity
  • Top-k document retrieval
  • Returns score + question + answer + url

7. rag_llm_url.txt

Auto-created file storing:

https://<backend>.trycloudflare.com

The Streamlit UI reads this to know where the backend is hosted.


🟦 Optional Future Additions (Sprints 6–7)

You may later add:

  • /monitoring/ (WhyLogs/Evidently)
  • /deploy/ (Terraform, Cloud Run configs)
  • /notebooks/ (colab_demo, EDA notebooks)

But these do not exist yet β€” keeping README accurate.


Sprint Description Key Deliverables Status Progress
Sprint 1 β€” Ingestion Scrape RBC FAQ webpages using Playwright scrape_rbc_faqs.py, raw .md/.html, logs βœ… Completed 🟩 100%
Sprint 2 β€” Preprocessing Clean text, normalize formatting, split compound FAQs Clean parquet files, dataset report βœ… Completed 🟩 100%
Sprint 3 β€” Embeddings + FAISS Generate MiniLM embeddings & build FAISS index faiss.index, metadata parquet, semantic search βœ… Completed 🟩 100%
Sprint 4 β€” Backend (RAG API) FastAPI server with Retriever + Phi-3 Generator + Cloudflare tunnel /ask & /health, Cloudflare public URL, rag_llm_url.txt βœ… Completed 🟩 100%
Sprint 5 β€” Frontend (Streamlit UI) Chat app interacting with backend via rag_llm_url.txt chat_ui.py + local test βœ… Completed 🟩 100%
Sprint 5.1 β€” Public Streamlit URL (Cloudflare) Cloudflare tunnel for UI (8501) Public Streamlit link 🟩 Completed 🟩 100%
Sprint 6 β€” Monitoring & Observability WhyLogs + Evidently dashboards; usage logging; latency stats Coming soon 🚧 In Progress 🟨 40%
Sprint 7 β€” Cloud Deployment (GCP) Cloud Run (Backend & UI), Cloud Storage, Terraform + CI/CD Planned ⏳ Not Started ⬜ 20%
Sprint 8 β€” Optimization Model quantization, retrieval performance tuning Planned ⏳ Not Started ⬜
Sprint 9 β€” Multi-Bank Expansion Add TD, CIBC, BMO, Scotiabank pipelines Planned ⏳ Not Started ⬜

🧩 Overall Project Completion:

🟩🟩🟩🟩🟩🟩🟩⬜⬜⬜ 75% Complete


πŸš€ Running the Project (Colab + Cloudflare)

This guide shows exactly how to run the full RAG system β€” backend + RAG pipeline + Streamlit UI β€” inside Google Colab, with public URLs served through Cloudflare Tunnels (no domain required).

This is the official, stable flow for this project.


1️⃣ Environment Setup

Clone the Repository

!git clone https://github.com/JDede1/rag-project.git
%cd /content/rag-project

Install All Dependencies

!pip install -U pip --quiet
!pip install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 --quiet
!pip install -U transformers sentence-transformers accelerate safetensors --quiet
!pip install -U faiss-gpu-cu12 pandas numpy pyarrow fastparquet --quiet
!pip install -U beautifulsoup4 markdownify tqdm playwright --quiet
!pip install -U fastapi uvicorn httpx --quiet
!pip install -U bitsandbytes --quiet
!playwright install-deps > /dev/null
!playwright install chromium > /dev/null

Mount Drive (for persistent storage)

from google.colab import drive
drive.mount("/content/drive")

2️⃣ Data Pipeline (Ingestion β†’ Preprocessing β†’ Embeddings β†’ FAISS)

Run the full end-to-end data processing pipeline:

python src/ingestion/scrape_rbc_faqs.py
python src/preprocess/clean_rbc_faqs.py
python src/preprocess/split_compound_faqs.py
python src/preprocess/inspect_dataset.py data/processed/rbc_faqs_refined.parquet --report
python src/embeddings/generate_embeddings.py
python src/embeddings/build_faiss_index.py

Artifacts will be written to:

/content/rag-project/data/processed/
/content/rag-project/data/index/

3️⃣ Optional: Local LLM Test (Retrieval + Phi-3)

Confirm the pipeline works before exposing endpoints:

from retrieval.search_engine import RbcRetriever
from generation.generator import generate_answer

retriever = RbcRetriever()
query = "How do I report a lost credit card?"

docs = retriever.search(query, top_k=3)
context = docs["answer"].tolist()

print(generate_answer(query, context))

4️⃣ Start the Backend (FastAPI)

Stop previous servers

!pkill -f uvicorn || true

Start FastAPI on port 8000

!nohup uvicorn src.api.main:app --host 0.0.0.0 --port 8000 \
    > backend.log 2>&1 &

Verify it is running

!ps -ef | grep uvicorn

5️⃣ Expose Backend via Cloudflare (Public API URL)

Install cloudflared

!wget -q https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.deb
!sudo dpkg -i cloudflared-linux-amd64.deb

Start the tunnel

!cloudflared tunnel --url http://localhost:8000 > backend_tunnel.log 2>&1 &
import time; time.sleep(7)

Extract + Save the public backend URL

import re

logs = open("backend_tunnel.log").read()
backend_url = re.findall(r"https://[-a-zA-Z0-9.]+trycloudflare.com", logs)[0]

with open("rag_llm_url.txt", "w") as f:
    f.write(backend_url)

print("Backend URL:", backend_url)

Health check

import requests
requests.get(f"{backend_url}/health").json()

6️⃣ Start Streamlit UI (Local Runtime)

Stop previous instances

!pkill -f streamlit || true

Launch Streamlit

!nohup streamlit run /content/rag-project/src/frontend/chat_ui.py \
    --server.port 8501 \
    --server.address 0.0.0.0 \
    > frontend.log 2>&1 &

7️⃣ Expose Streamlit UI via Cloudflare (Public Chatbot URL)

Create the tunnel

!cloudflared tunnel --url http://localhost:8501 > frontend_tunnel.log 2>&1 &
import time; time.sleep(7)

Extract public frontend URL

logs = open("frontend_tunnel.log").read()
frontend_url = re.findall(r"https://[-a-zA-Z0-9.]+trycloudflare.com", logs)[0]
print("Streamlit URL:", frontend_url)

8️⃣ Full System Test (End-to-End)

βœ” Visit your public Streamlit Chatbot URL

https://<random-hash>.trycloudflare.com

βœ” Ask any question

The pipeline will:

  1. Send query β†’ FastAPI backend

  2. Retrieve top-k relevant FAQ chunks (FAISS)

  3. Pass context β†’ Phi-3 Mini

  4. Generate grounded answer

  5. Return both:

    • answer
    • retrieved evidence (sidebar)

βœ” If no grounding β†’ answer becomes:

β€œI don’t know.”


☁️ Deployment & Version Control

from google.colab import userdata
token = userdata.get("PAT_TOKEN")

!git config --global user.name "JDede1"
!git config --global user.email "dev@users.noreply.github.com"
%cd /content/rag-project
!git add .
!git commit -m "Update project"
!git push https://{token}@github.com/JDede1/rag-project.git main

πŸš€ Quick Demo (Colab)

You can instantly try the full RAG system β€” including retrieval, Phi-3 Mini generation, FastAPI backend, and Streamlit chat UI β€” directly inside Google Colab, with public URLs provided automatically via Cloudflare Tunnels.

Click below to launch the live demo notebook:

Open in Colab


🎯 What the Demo Runs Automatically

The notebook executes the entire pipeline in one place:

1. Environment Setup

  • Clones your GitHub repo
  • Installs all dependencies
  • Mounts Google Drive for persistent storage

2. Data Pipeline

  • Scrapes RBC FAQs
  • Cleans & preprocesses the data
  • Generates embeddings
  • Builds a FAISS vector index

3. RAG Engine

  • Loads Phi-3 Mini 4K Instruct
  • Performs grounded generation with retrieved context
  • Enforces β€œI don’t know” when no relevant answer exists

4. FastAPI Backend

  • Starts the RAG backend on port 8000
  • Exposes the API using Cloudflare Tunnel
  • Auto-saves the backend URL for Streamlit

5. Streamlit Chat UI

  • A clean chat interface with sidebar evidence viewer
  • Also exposed publicly via Cloudflare Tunnel

6. End-to-End Demo

You get two live URLs:

βœ” Backend API URL (FastAPI) βœ” Frontend Chat URL (Streamlit UI)

Both remain active as long as the Colab notebook is running.


πŸ§ͺ Example Usage

Once the UI loads, try:

β€œHow do I report a lost credit card?”

The model retrieves matching FAQs and generates a grounded, factual answer using Phi-3 Mini.


πŸ”’ No Secrets Stored

The notebook uses Colab Secrets Manager to load:

  • Hugging Face token (HF_TOKEN)
  • GitHub PAT (PAT_TOKEN) if pushing updates

Nothing is hardcoded.


🟦 Recommended: GPU Runtime

For best performance:

Runtime β†’ Change Runtime Type β†’ T4 GPU

🧩 Best For

This demo is ideal if you want to:

  • Run the full RAG workflow without local setup
  • Test the system publicly from mobile/desktop
  • Prototype improvements before deploying to GCP
  • Share the live chatbot with others instantly

πŸ”„ Version Control & GitHub Workflow

This project uses Git + GitHub for source control, with development performed inside Google Colab and synchronized with the repository through a secure Personal Access Token (PAT) stored in Colab Secrets.

This workflow ensures:

  • βœ” Secure, authenticated pushes
  • βœ” Clean repository (no local logs or Colab artifacts)
  • βœ” Consistent structure between Colab and GitHub
  • βœ” Reproducible development after each Colab restart

🧭 Git Workflow (Colab β†’ GitHub)

Before pushing changes, configure Git identity (anonymous/default-friendly):

!git config --global user.name "rag-project-dev"
!git config --global user.email "dev@users.noreply.github.com"

This uses a non-identifying username and GitHub’s privacy-safe no-reply email.


πŸ” Secure GitHub Authentication (PAT)

Your GitHub Personal Access Token (PAT) is safely stored in:

Colab β†’ Settings β†’ Secrets β†’ PAT_TOKEN

Load it securely without exposing it:

from google.colab import userdata
token = userdata.get("PAT_TOKEN")

πŸš€ Commit & Push Workflow

from google.colab import userdata
token = userdata.get("PAT_TOKEN")

%cd /content/rag-project

!git add .
!git commit -m "Update: latest development changes"

!git push https://{token}@github.com/<YOUR_GITHUB_USERNAME>/rag-project.git main

Replace <YOUR_GITHUB_USERNAME> with your GitHub handle.


🧬 Branch Strategy

  • main β€” Stable, production-ready version
  • dev (optional) β€” Experimental work
  • feat/* β€” New features
  • fix/* β€” Hotfix patches

πŸ”„ Syncing with Remote

If GitHub has newer changes:

%cd /content/rag-project
!git pull origin main

πŸ›‘οΈ After Colab Reset

When Colab restarts:

  1. Notebook reclones the repo
  2. All code is restored
  3. Development continues normally

Thanks to GitHub, no project state is lost.


πŸ“ˆ Monitoring & Observability (Sprint 6)

Monitoring is a critical component of a reliable RAG system. Monitoring introduces a structured observability layer to track:

  • Data drift (changes in scraped FAQ distributions)
  • Embedding drift (FAISS vector distribution shifts)
  • Model quality (answer accuracy, grounding quality)
  • Latency & throughput (API performance)
  • User behavior analytics (optional: anonymized logs)

This ensures that the RAG system remains accurate, stable, and safe even as data evolves.


πŸ” Key Monitoring Components

1. WhyLogs (Data & Statistical Monitoring)

WhyLogs enables automated logging of:

  • Text statistics (length, token counts, entities)

  • Input distribution changes

  • Drift in processed FAQ dataset over time

  • Embedding vector statistics before FAISS indexing

  • Request/response metadata such as:

    • Query length
    • Context length
    • Retrieved FAQ count

Logs are stored in:

data/logs/whylogs/

and can be visualized later using WhyLabs or WhyLogs Python APIs.


2. Evidently AI (Drift Dashboards)

Evidently supports interactive dashboards for:

Dataset Drift
  • Detecting changes between:

    • Newly scraped RBC FAQs
    • Previous versions of the dataset
Embedding Drift
  • Comparing embedding distributions across time windows
  • Highlighting semantic shifts that may require reindexing FAISS
Prediction Quality Tracking

Even though this is not a classifier, Evidently still tracks:

  • LLM answer length
  • Similarity between answer and retrieved context
  • Hallucination rate (via β€œcontext adherence” heuristics)

Dashboards are exported to:

data/reports/evidently/

and rendered in notebooks or Streamlit.


3. Backend Health & Performance Metrics

The FastAPI backend automatically provides:

  • /health endpoint
  • Latency stats from Cloudflare tunnel logs
  • Uvicorn runtime logs

Additionally, Sprint 6 introduces custom middleware to track:

  • Request timestamps
  • Answer generation time
  • Retrieval time (FAISS search latency)
  • Total RAG response latency

These metrics can later feed into:

  • Prometheus (optional)
  • Cloud Run built-in metrics (in Sprint 7)
  • Cloud Logging (in Sprint 7)

4. Logging Architecture

Each request logs:

{
  "timestamp": "...",
  "query": "How do I report a lost credit card?",
  "retrieval_time_ms": 12,
  "generation_time_ms": 580,
  "total_time_ms": 612,
  "context_items": 3,
  "model": "Phi-3-mini-4k-instruct"
}

Logs are saved to:

logs/rag_requests.log

This supports long-term monitoring, debugging, and drift analysis.


πŸ“Š Monitoring Workflow (Summary)

User Query
   ↓
FastAPI Middleware
   ↓
WhyLogs Data Logging
   ↓
Evidently Drift Checks
   ↓
Log Artifacts Stored
   ↓
Visual Dashboards (Notebook / Streamlit / Cloud)

🧩 Sprint 6 Completion Criteria

To complete Sprint 6, the following must be implemented:

Task Status
WhyLogs logging for dataset and embeddings ⬜ Pending
Request/response logging middleware ⬜ Pending
Evidently dataset drift report ⬜ Pending
Evidently embedding drift report ⬜ Pending
Streamlit monitoring dashboard (optional) ⬜ Pending
Cloud-ready logging structure ⬜ Pending

☁️ Deployment

This project is designed for local development in Google Colab, with optional deployment to Google Cloud Run for fully managed, scalable hosting. The deployment path is modular, following the Sprint roadmap:

  • Sprint 4–5: Local backend + Streamlit in Colab
  • Sprint 5.1: Public URLs using Cloudflare Tunnels
  • Sprint 7: Full cloud deployment (FastAPI β†’ Cloud Run, UI β†’ Streamlit Cloud or Cloud Run)

Below are the official deployment options.


πŸ”§ Local Development (Colab)

Local development happens inside Google Colab and uses:

  • FastAPI (Uvicorn) β†’ backend
  • Streamlit β†’ frontend
  • Cloudflare Tunnels β†’ public URLs

This enables full end-to-end RAG testing entirely in the browser, with no local installations.

Run backend locally (Colab)

!nohup uvicorn src.api.main:app --host 0.0.0.0 --port 8000 > backend.log 2>&1 &

Run UI locally (Colab)

!streamlit run src/frontend/chat_ui.py --server.port 8501 --server.address 0.0.0.0

Expose both using Cloudflare Tunnel

!cloudflared tunnel --url http://localhost:8000
!cloudflared tunnel --url http://localhost:8501

This creates public URLs for:

  • Backend β†’ FastAPI
  • Frontend β†’ Streamlit Chat UI

Both run entirely inside Colab.


πŸ”₯ Option 1 β€” Cloud Run (Backend Only)

Recommended for production.

Cloud Run provides:

  • Autoscaling
  • Zero-downtime deployments
  • HTTPS endpoints
  • Built-in logs + metrics
  • GPU optional (for model inference)

Deployment Steps

1️⃣ Build backend Docker image
gcloud builds submit --tag gcr.io/PROJECT_ID/rag-backend
2️⃣ Deploy to Cloud Run
gcloud run deploy rag-backend \
    --image gcr.io/PROJECT_ID/rag-backend \
    --platform managed \
    --region us-central1 \
    --allow-unauthenticated

You will receive a public HTTPS URL:

https://rag-backend-xxxxx.a.run.app

Set this as your backend URL in Streamlit Cloud or Colab:

rag_llm_url.txt

🎨 Option 2 β€” Streamlit Cloud (Frontend Only)

Deploy the Streamlit UI separately at:

https://share.streamlit.io

Requirements

Your repo must contain:

src/frontend/chat_ui.py
requirements.txt
rag_llm_url.txt or Streamlit secrets

Streamlit Secrets Example

RAG_BACKEND_URL = "https://rag-backend-xxxxx.a.run.app"

Submit the GitHub repo to Streamlit Cloud, and it becomes publicly accessible.


πŸš€ Option 3 β€” Full Cloud Deployment (Backend + UI)

Production-grade architecture (Sprint 7)

User Browser
   ↓
Streamlit Frontend (Cloud Run)
   ↓
FastAPI Backend (Cloud Run)
   ↓
FAISS Index + Model (built into backend image)

Deployment Summary

Component Platform
FastAPI RAG Backend Cloud Run
Streamlit UI Cloud Run or Streamlit Cloud
CI/CD GitHub Actions
Infrastructure Terraform / IaC

This architecture supports:

  • Automated builds
  • Scalable inference
  • Persistent FAISS index
  • Secure environment variables
  • Managed HTTPS

🧩 Deployment Guidance by Sprint

Sprint Deployment Focus
Sprint 4–5 Local development in Colab
Sprint 5.1 Public access via Cloudflare
Sprint 6 Logging and monitoring preparation
Sprint 7 Full cloud deployment (Cloud Run + CI/CD)

πŸ“Œ Notes on Data & Model

  • All embeddings, metadata, and FAISS indexes are bundled into the Docker image during deployment.
  • FAQ scraping is performed offline β€” not during cloud runtime.
  • Phi-3 Mini runs inside the backend container using PyTorch CPU or CUDA.
  • Cloud Run GPU deployment is supported if needed.

πŸ’‘ Future Enhancements

  • Add LangChain/LlamaIndex retrieval chains
  • Expand FAQ coverage to TD, CIBC, BMO, Scotiabank

πŸ‘¨β€πŸ’» Author

Ajibola Dedenuola Data Scientist Β· Machine Learning Engineer Β· MLOps Specialist

πŸŽ“ M.Sc. Information Science & Machine Learning β€” University of Arizona πŸ”— GitHub


πŸͺͺ License

This project uses publicly available RBC FAQ content for educational and research purposes. All trademarks and materials belong to RBC Royal Bank.


About

End-to-end Retrieval-Augmented Generation (RAG) pipeline built with FastAPI, FAISS, and Hugging Face models for intelligent FAQ answering.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published