Skip to content

Big-Data-Team-4/Assignment_3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

damg7245_assignment3

Project AURELIA: Automated Financial Concept Note Generator

Live application Links

https://codelabs-preview.appspot.com/?file_id=1dERIVAa1BNXEoY00KKMGaGv1l_Kj28kQ6HnKmfkcpAM#0

Video Recording

https://teams.microsoft.com/l/meetingrecap?driveId=b%21EoS7pVU_8kqkOUphvPtDSfkAdNlsJmVKlaqSKn_q2S2GYvdth_CcTqlco0hHW2Zb&driveItemId=016WYFLZRTQTFBYWQVJVGKQXGGTF3MEQJE&sitePath=https%3A%2F%2Fnortheastern-my.sharepoint.com%2F%3Av%3A%2Fg%2Fpersonal%2Fprakash_anush_northeastern_edu%2FETOEyhxaFU1MqFzGmXbCQSQBk2SaSXI80m7zBlVx7x8hDQ&fileUrl=https%3A%2F%2Fnortheastern-my.sharepoint.com%2Fpersonal%2Fprakash_anush_northeastern_edu%2FDocuments%2FRecordings%2FCall%2520with%2520Big%2520Data%2520-%2520Fall%25202025-20251024_153649-Meeting%2520Recording.mp4%3Fweb%3D1&threadId=19%3A5686ab43082647cc892195a2be5731b9%40thread.v2&organizerId=38442747-3b78-46c9-ba07-9ab2f0121399&tenantId=a8eec281-aaa3-4dae-ac9b-9a398b9215e7&callId=528fa1e3-638a-4c7b-a4e2-f1555d258f85&threadType=GroupChat&meetingType=Adhoc&subType=RecapSharingLink_RecapCore

Problem Statement

Financial analysts and quantitative researchers frequently need to reference complex formulas, MATLAB code examples, and technical concepts from the MATLAB Financial Toolbox documentation (3000+ pages). Traditional keyword search is inefficient for:

Finding exact formulas with proper mathematical notation across thousands of pages Locating relevant code examples and understanding relationships between financial concepts Accessing information quickly during time-sensitive analysis without manual PDF navigation Understanding complex financial models with proper context and citations

The lack of an intelligent, conversational interface for financial documentation results in wasted time, increased errors, and reduced productivity for financial professionals.

Project Goals

  1. Develop an intelligent RAG system that transforms static MATLAB Financial Toolbox documentation into a conversational knowledge base
  2. Implement automated data pipelines using Apache Airflow for weekly PDF processing, parsing, chunking, and vector embedding
  3. Provide semantic search across 10,000+ document chunks with accurate formula preservation and source citations
  4. Implement smart caching with PostgreSQL to reduce query latency by 95% and API costs by 80%
  5. Create structured LLM outputs using GPT-4o with Instructor framework to ensure consistent formula extraction and code examples
  6. Deploy a cloud-native, production-ready system with comprehensive evaluation and monitoring

Architecture Diagram

image

System Flow

Phase 1: Data Preparation (Weekly/Scheduled)

DAG 1: S3 → Azure DI (Parse PDF) → Chunk → OpenAI (Embeddings) → Pinecone (Upload) DAG 2: Create PostgreSQL Schema → Seed Cache with pre-loaded Concepts

Phase 2: User Query (Real-Time)

User → Streamlit → FastAPI → RAG Service Check PostgreSQL Cache → Query Pinecone → Send to GPT-4o → Return Answer

Technology Stack

image

Installation

Prerequisites

  • Python 3.11+
  • Docker & Docker Compose
  • Git
  • Pinecone Account (API Key required)
  • OpenAI API Key
  • Azure Document Intelligence Credentials
  • AWS Account (for S3 storage)

Steps

  1. Clone the Repository

    git clone [https://github.com/BigDataIA-Fall2024-TeamA3/damg7245_final_project](https://github.com/Big-Data-Team-4/Assignment_3.git)
    cd PROJECT_AURELIA
  2. Copy environment template and update credentials

    cp .env.example .env
    nano .env
  3. Required Environment Variables

    #PostgreSQL
    POSTGRES_HOST=localhost
    POSTGRES_PORT=5432
    POSTGRES_DB=aurelia_rag
    POSTGRES_USER=postgres
    POSTGRES_PASSWORD=your_password
    
    #Pinecone
    PINECONE_API_KEY=your_pinecone_api_key
    PINECONE_INDEX_NAME=aurelia-fintbx
    PINECONE_NAMESPACE=financial-toolbox
    
    #OpenAI
    OPENAI_API_KEY=your_openai_api_key
    EMBEDDING_MODEL=text-embedding-3-large
    LLM_MODEL=gpt-4o
    LLM_TEMPERATURE=0
    
    #Azure Document Intelligence
    AZURE_DI_ENDPOINT=your_azure_endpoint
    AZURE_DI_KEY=your_azure_key
    
    #AWS S3
    AWS_ACCESS_KEY_ID=your_aws_access_key
    AWS_SECRET_ACCESS_KEY=your_aws_secret_key
    AWS_REGION=us-east-1
    S3_BUCKET_NAME=aurelia-pdf-storage
    
    #Cache Configuration
    CACHE_SIMILARITY_THRESHOLD=0.70
    MIN_SIMILARITY_THRESHOLD=0.5
    BORDERLINE_SIMILARITY_THRESHOLD=0.30
    ENABLE_WIKIPEDIA_FALLBACK=true
  4. Run Airflow (Data Pipeline)

    cd airflow_docker
    docker-compose up --build

Project Directories

PROJECT_AURELIA/
├── airflow_docker/              # Airflow orchestration
│   ├── dags/
│   │   ├── azure_parser.py          # Database initialization
│   │   ├── chunking_pipeline.py 
│   │   └── concept_seed_dag.py  # DAG 2: Cache seeding
|	|	└── fintbx_ingest_dag.py # DAG 1: PDF processing	
│   ├── docker-compose.yaml
│   ├── Dockerfile
│   └── requirements.txt
│
├── fastapi/                     # FastAPI RAG service
│   ├── main.py                 # FastAPI application
│   ├── rag_service.py          # Core RAG logic
│   ├── models.py               # Request/response models
│   ├── instructor_models.py    # Structured output schemas
│   ├── init_db.sql             # PostgreSQL schema
│   ├── Dockerfile
│   └── requirements.txt
│
├── streamlit/                    # Streamlit UI
│   ├── app.py                  # Streamlit application
│   ├── Dockerfile
│   └── requirements.txt
│
├── data/                        # Raw data
│   └── fintbx.pdf               # Financial Toolbox PDF
│
├── outputs/                        # Parsed Outputs
│   ├──figures
│   ├──tables
│   └──formulas
|	└──json
|
├── dockerfile
├── docker-compose.app.yaml      # App orchestration
├── .env                         # Environment template
├── requirements.txt            # Root dependencies
└── README.md                   # This file

References

Attestation Statement

Team Members

Name			Student ID	Contribution
Anusha Prakash	002306070	33.33%
Komal Khairnar	002472617	33.33%
Shriya Pekamwar	002059178	33.33%

Team: Team 4
Course: INFO 7245 - Big Data Systems & Intelligent Analytics
Institution: Northeastern University
Semester: Fall 2025

Project Title: AURELIA - Advanced Unified Retrieval and Embedding Layer for Intelligent Analysis

We, the undersigned members of Team 4, hereby attest that:

  1. Original Work: This project represents our collective original work, completed collaboratively for INFO 7245 - Big Data Systems & Intelligent Analytics at Northeastern University.

  2. Equal Contribution: All team members contributed substantially and equitably to this project.

  3. Proper Attribution: All external resources, libraries, frameworks, APIs, and code snippets have been properly cited and attributed in the documentation.

  4. Academic Integrity: This work has not been submitted for any other course or academic program and complies with Northeastern University's academic integrity policy.

  5. Collaborative Work: All work was completed through legitimate collaboration among team members. No unauthorized assistance was received.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published