# üìò Project Overview: FDIC Regulatory Assistant

## üîç The Problem
Banks must process a large number of loan applications while ensuring strict
adherence to complex federal regulations. Manually reviewing each loan against
the FDIC Risk Management Supervision (RMS) Manual is time-consuming and prone
to human error.

---

## üí° The Solution
This project builds an AI-powered regulatory assistant that:
- Reads loan documents
- Extracts key loan information
- Searches only the **FDIC RMS Manual ‚Äì Section 3.2 (Loans)**
- Provides document-grounded regulatory risk considerations
- Does **not** approve, reject, or determine compliance for loans

The system produces **audit-ready, consistent, and regulator-aligned outputs**
suitable for senior banking officials.

---

# ‚öôÔ∏è Step-by-Step Notebook Explanation

## üõ†Ô∏è Step 1: Environment & Tool Setup
This cell prepares the workspace so the project runs smoothly.

- Installs required libraries for AI, PDF processing, OCR, and the web interface
- Applies system patches to avoid runtime issues in Google Colab

In [None]:
!pip install openai pypdf gradio numpy pytesseract pdf2image pillow nest_asyncio
!apt-get install poppler-utils tesseract-ocr
!pip install "uvicorn==0.25.0"

# Apply the Asyncio Patch
import nest_asyncio
nest_asyncio.apply()
print("‚úÖ Environment patched with compatible versions.")

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
poppler-utils is already the newest version (22.02.0-2ubuntu0.12).
0 upgraded, 0 newly installed, 0 to remove and 1 not upgraded.
‚úÖ Environment patched with compatible versions.


## üì¶ Step 2: Core Library Imports
This cell loads all Python libraries needed for the project.

- Data and AI libraries for numerical processing and language model interaction
- File-handling tools for reading PDFs and images

In [None]:
import os
import pickle
import json
import numpy as np
import gradio as gr
import pytesseract
from PIL import Image
from pdf2image import convert_from_path
from pypdf import PdfReader
from openai import OpenAI
from google.colab import files
from google.colab import userdata

## üîê Step 3: Secure Connection & Storage
This cell connects the project to external services securely.

- Mounts Google Drive to store processed data and avoid repeated computation

In [None]:
# --- FIX: MOUNT GOOGLE DRIVE ---
from google.colab import drive
drive.mount('/content/drive')
print("‚úÖ Google Drive Mounted!")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
‚úÖ Google Drive Mounted!


## üîë Step 4: Initialize OpenAI Client

Initializes the OpenAI client using credentials stored securely
in Google Colab's user data (secrets).

Why this is important:
- Keeps API keys out of the code
- Supports clean, secure authentication
- Allows switching base URLs if required


In [None]:
# Setup Client using Google Colab User Data (Secrets)
api_key = userdata.get('API_KEY')
base_url = userdata.get('BASE_URL')

client = OpenAI(
    api_key=api_key,
    base_url=base_url
)

print("‚úÖ Client initialized with custom Base URL.")

‚úÖ Client initialized with custom Base URL.


## üìÑ Step 5: Manual Upload & Verification
This cell ensures the regulatory source document is available.

- Checks for the presence of `section3-2.pdf`
- Prompts the user to upload the document if it is missing

This enforces a **single source of truth** for all responses.

In [None]:
import os
from google.colab import files

# Check if file exists, if not, prompt upload
pdf_filename = "/content/drive/My Drive/section3-2.pdf" # <-- MATCH THIS NAME

if not os.path.exists(pdf_filename):
    print(f"Please upload the Regulatory Document: '{pdf_filename}'")
    uploaded = files.upload()
    # Rename the uploaded file to match expected name
    for filename in uploaded.keys():
        os.rename(filename, pdf_filename)
        print(f"File saved as {pdf_filename}")
else:
    print(f"‚úÖ '{pdf_filename}' found. Skipping upload.")

‚úÖ '/content/drive/My Drive/section3-2.pdf' found. Skipping upload.


## üß† Step 6: Smart Search Memory (Embedding Cache)
This cell builds a fast semantic search system for the FDIC manual.

- Splits the PDF into small overlapping text chunks
- Converts text chunks into embeddings for semantic search
- Stores the embeddings so they can be reused across sessions

This improves performance and reduces API cost.

In [None]:
# --- FIX: SAVE TO DRIVE INSTEAD OF LOCAL DISK ---
# This path is inside your actual Google Drive
EMBEDDING_FILE = "/content/drive/My Drive/fdic_embeddings_cache.pkl"
EMBEDDING_MODEL = "text-embedding-3-small"

def load_and_chunk_pdf(pdf_path):
    print(f"Reading PDF: {pdf_path}...")
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        t = page.extract_text()
        if t: text += t + "\n"
    return [text[i:i+1000] for i in range(0, len(text), 900)]

def get_embedding_batch(texts):
    res = client.embeddings.create(input=texts, model=EMBEDDING_MODEL)
    return [d.embedding for d in res.data]

# Check Google Drive for the file
if os.path.exists(EMBEDDING_FILE):
    print(f"‚úÖ Cache found in Google Drive! Loading...")
    with open(EMBEDDING_FILE, 'rb') as f:
        data = pickle.load(f)
        chunks = data['chunks']
        chunk_embeddings_np = data['embeddings']
    print("Knowledge base loaded without spending API credits.")
else:
    print("‚ö†Ô∏è Cache not found in Drive. Generating embeddings...")

    # --- FIX: CHECK FOR THE CORRECT FILENAME FROM CELL 4 ---
    if os.path.exists(pdf_filename):  # Uses "section3-2.pdf" variable from Cell 4
        chunks = load_and_chunk_pdf(pdf_filename)
        chunk_embeddings = []

        batch_size = 50
        for i in range(0, len(chunks), batch_size):
            print(f"Processing batch {i}...")
            batch_embeddings = get_embedding_batch(chunks[i:i+batch_size])
            chunk_embeddings.extend(batch_embeddings)

        chunk_embeddings_np = np.array(chunk_embeddings)

        # Save to Google Drive
        with open(EMBEDDING_FILE, 'wb') as f:
            pickle.dump({'chunks': chunks, 'embeddings': chunk_embeddings_np}, f)
        print(f"‚úÖ Saved to Google Drive: {EMBEDDING_FILE}")
    else:
        print(f"‚ùå Error: {pdf_filename} missing. Please run the Upload cell (Cell 4) again.")

‚ö†Ô∏è Cache not found in Drive. Generating embeddings...
Reading PDF: /content/drive/My Drive/section3-2.pdf...
Processing batch 0...
Processing batch 50...
Processing batch 100...
Processing batch 150...
Processing batch 200...
Processing batch 250...
Processing batch 300...
Processing batch 350...
Processing batch 400...
Processing batch 450...
Processing batch 500...
‚úÖ Saved to Google Drive: /content/drive/My Drive/fdic_embeddings_cache.pkl


## üñºÔ∏è Step 7: Extract Raw Text from Uploaded Loan Documents

Handles user-uploaded loan documents.

Supported formats:
- Image files (PNG, JPG)
- PDF files (scanned or digital)

How it works:
- Images ‚Üí OCR using Tesseract
- PDFs ‚Üí Converted to images, then OCR applied

The output is **raw, unstructured text** from the loan application.




In [None]:
def extract_text_from_file(filepath):
    """OCR Logic: Convert Image/PDF to Raw Text"""
    text = ""
    try:
        if filepath.lower().endswith(('.png', '.jpg', '.jpeg')):
            image = Image.open(filepath)
            text = pytesseract.image_to_string(image)
        elif filepath.lower().endswith('.pdf'):
            images = convert_from_path(filepath)
            for img in images:
                text += pytesseract.image_to_string(img) + "\n"
    except Exception as e:
        return f"Error reading file: {str(e)}"
    return text

## üßæ Step 8: Convert Raw Loan Text into Structured Data

This cell uses the language model to extract key loan fields
from the raw OCR text and convert them into clean JSON.

Extracted fields include:
- Borrower Name
- Loan Amount
- Interest Rate
- Purpose
- Address
- Income
- Credit Score (if available)

In [None]:
def structure_loan_data(raw_text):
    """LLM Logic: Convert Raw Text to Clean JSON"""
    system_prompt = """
    You are a Data Extraction Specialist.
    Task: Extract key loan application details from the OCR text.
    Output: Return ONLY a valid JSON object. No markdown, no commentary.
    Fields: Borrower Name, Loan Amount, Interest Rate, Purpose, Address, Income, Credit Score (if available).
    """

    response = client.chat.completions.create(
        model="gpt-4.1-nano",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": raw_text}
        ],
        temperature=0,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )

    # Clean up response to ensure valid JSON
    content = response.choices[0].message.content
    content = content.replace("```json", "").replace("```", "").strip()
    return content

## üîç Step 9: Retrieve Relevant Regulatory Context

This cell performs **semantic retrieval** using embeddings.

Process:
1. The user query is converted into an embedding
2. It is compared against stored document embeddings
3. The most relevant regulatory chunks are selected

This ensures:
- Only relevant portions of Section 3.2 are used
- The language model never sees the full document
- Hallucination risk is minimized


In [None]:
import json
import asyncio

# --- 1. RETRIEVAL LOGIC (Fixed for Broader Context) ---
def find_relevant_context(query):
    # Retrieve top 10 chunks to ensure we catch the "General Policy" intro sections
    q_vec = client.embeddings.create(input=[query], model=EMBEDDING_MODEL).data[0].embedding
    sims = np.dot(chunk_embeddings_np, np.array(q_vec))

    # CHANGE 1: Increased from 5 to 10 to catch broad regulatory pillars
    top_idxs = np.argsort(sims)[-10:][::-1]

    return "\n\n---------------------\n".join([chunks[i] for i in top_idxs])


## üß† Step 10: Generate Regulatory Answer Using Prompt Engineering

This is the core reasoning step.

The system prompt:
- Defines the role (senior bank manager)
- Restricts answers to FDIC Section 3.2
- Prohibits approval, rejection, or compliance decisions
- Enforces refusal if information is missing
- Prevents showing internal reasoning

The model receives:
- Retrieved regulatory context
- Structured loan data
- The user‚Äôs regulatory question

The output is a **formal, document-grounded regulatory response**.


In [None]:
import json
import asyncio


# --- PIPELINE LOGIC ---
async def process_pipeline(file_obj, user_query):
    structured_json = "{}"

    # A. HANDLE FILE UPLOAD
    if file_obj is not None:
        print(f"Processing file: {file_obj.name}")
        raw_text = await asyncio.to_thread(extract_text_from_file, file_obj.name)
        print("Structuring data (LLM)...")
        structured_json = await asyncio.to_thread(structure_loan_data, raw_text)
    else:
        print("No file uploaded. Skipping OCR.")
        structured_json = json.dumps({"Info": "No loan provided. Answering based on regulations only."})

    # B. RETRIEVE CONTEXT
    print("Retrieving context...")
    context = await asyncio.to_thread(find_relevant_context, user_query)

    # C. GENERATE ANSWER
    system_prompt = f"""
Role:
You are a senior bank manager responsible for reviewing loan-related matters, acting in the role of a strict FDIC regulatory compliance officer.

Regulatory Context (FDIC Risk Management Supervision Manual ‚Äì Section 3.2):
{context}

Loan Data:
{structured_json}

Instructions:

Use only the document titled ‚ÄúFDIC Risk Management Supervision Manual ‚Äì Section 3.2 (Loans)‚Äù as your source.

Carefully review the entire provided text before answering, and identify relevant requirements even if the wording in the question differs from the wording used in the document.

Answer whenever Section 3.2 addresses the regulatory requirement or expectation in substance, including cases where the answer must be derived by combining guidance from multiple parts of the text.

Use the same terminology as the document. Do not add assumptions, interpretations, or outside knowledge.

Output response:

You may think internally, but do not show your reasoning.

Reply with ‚ÄúThe provided document does not contain information to answer this question.‚Äù only if Section 3.2 does not address the subject in any form.

Provide only the final answer.

The final answer must be consise and precise to the point like a summarized version.
"""

    response = await asyncio.to_thread(
        client.chat.completions.create,
        model="gpt-4.1-nano",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_query}
        ],
        temperature=0.3 # Slightly increased to allow synthesis of multiple chunks
    )

    return structured_json, response.choices[0].message.content, context

## üñ•Ô∏è Step 11: Interactive Chat Interface

This cell builds a Gradio-based UI that allows users to:
- Upload a loan document (optional)
- Ask a regulatory question
- View structured loan data
- View the regulatory answer
- View retrieved regulatory context for verification

This improves transparency and auditability.



In [None]:
import nest_asyncio
import gradio as gr

# Apply patch immediately before launch
nest_asyncio.apply()

# --- GRADIO INTERFACE SETUP ---
with gr.Blocks(title="Loan Validator") as demo:
    gr.Markdown("# üè¶ FDIC Loan Validator")
    gr.Markdown("Upload a loan application image (optional) and ask a regulatory question.")

    with gr.Row():
        with gr.Column():
            file_input = gr.File(label="Upload Loan (Optional)")
            query_input = gr.Textbox(
                label="Regulatory Question",
                value="What are the appraisal requirements?",
                lines=2
            )
            btn = gr.Button("Analyze", variant="primary")

        with gr.Column():
            json_output = gr.JSON(label="Extracted Data")
            answer_output = gr.Textbox(
                label="Regulatory Analysis",
                lines=8,
                show_copy_button=True
            )
            context_output = gr.TextArea(
                label="üîç Verification: Source Context",
                lines=10,
                interactive=False
            )

    # Note: process_pipeline is now async, which Gradio handles natively
    btn.click(
        process_pipeline,
        inputs=[file_input, query_input],
        outputs=[json_output, answer_output, context_output]
    )

print("üöÄ Launching Loan Validator...")
demo.launch(debug=True, share=True)

üöÄ Launching Loan Validator...
Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://9f59499b20dc74dead.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


No file uploaded. Skipping OCR.
Retrieving context...
No file uploaded. Skipping OCR.
Retrieving context...
No file uploaded. Skipping OCR.
Retrieving context...
Processing file: /tmp/gradio/05d27fb82ea73fc995620986cc020cff3ff942a126999b0c2e0545538745a90a/WhatsApp Image 2026-01-07 at 12.19.40 PM 1.jpeg
Structuring data (LLM)...
Retrieving context...


## ‚úÖ Summary
This notebook demonstrates how prompt engineering and semantic retrieval
can be used to safely apply large language models in a regulated banking
environment while maintaining accuracy, transparency, and audit readiness.