https://codelabs-preview.appspot.com/?file_id=1dERIVAa1BNXEoY00KKMGaGv1l_Kj28kQ6HnKmfkcpAM#0
- Fast API: http://3.221.13.158:8000/docs
- Streamlit Application: http://3.221.13.158:8501
- Aiflow: http://35.202.14.13:8081
Financial analysts and quantitative researchers frequently need to reference complex formulas, MATLAB code examples, and technical concepts from the MATLAB Financial Toolbox documentation (3000+ pages). Traditional keyword search is inefficient for:
Finding exact formulas with proper mathematical notation across thousands of pages Locating relevant code examples and understanding relationships between financial concepts Accessing information quickly during time-sensitive analysis without manual PDF navigation Understanding complex financial models with proper context and citations
The lack of an intelligent, conversational interface for financial documentation results in wasted time, increased errors, and reduced productivity for financial professionals.
- Develop an intelligent RAG system that transforms static MATLAB Financial Toolbox documentation into a conversational knowledge base
- Implement automated data pipelines using Apache Airflow for weekly PDF processing, parsing, chunking, and vector embedding
- Provide semantic search across 10,000+ document chunks with accurate formula preservation and source citations
- Implement smart caching with PostgreSQL to reduce query latency by 95% and API costs by 80%
- Create structured LLM outputs using GPT-4o with Instructor framework to ensure consistent formula extraction and code examples
- Deploy a cloud-native, production-ready system with comprehensive evaluation and monitoring
Phase 1: Data Preparation (Weekly/Scheduled)
DAG 1: S3 → Azure DI (Parse PDF) → Chunk → OpenAI (Embeddings) → Pinecone (Upload) DAG 2: Create PostgreSQL Schema → Seed Cache with pre-loaded Concepts
Phase 2: User Query (Real-Time)
User → Streamlit → FastAPI → RAG Service Check PostgreSQL Cache → Query Pinecone → Send to GPT-4o → Return Answer
- Python 3.11+
- Docker & Docker Compose
- Git
- Pinecone Account (API Key required)
- OpenAI API Key
- Azure Document Intelligence Credentials
- AWS Account (for S3 storage)
-
Clone the Repository
git clone [https://github.com/BigDataIA-Fall2024-TeamA3/damg7245_final_project](https://github.com/Big-Data-Team-4/Assignment_3.git) cd PROJECT_AURELIA -
Copy environment template and update credentials
cp .env.example .env nano .env
-
Required Environment Variables
#PostgreSQL POSTGRES_HOST=localhost POSTGRES_PORT=5432 POSTGRES_DB=aurelia_rag POSTGRES_USER=postgres POSTGRES_PASSWORD=your_password #Pinecone PINECONE_API_KEY=your_pinecone_api_key PINECONE_INDEX_NAME=aurelia-fintbx PINECONE_NAMESPACE=financial-toolbox #OpenAI OPENAI_API_KEY=your_openai_api_key EMBEDDING_MODEL=text-embedding-3-large LLM_MODEL=gpt-4o LLM_TEMPERATURE=0 #Azure Document Intelligence AZURE_DI_ENDPOINT=your_azure_endpoint AZURE_DI_KEY=your_azure_key #AWS S3 AWS_ACCESS_KEY_ID=your_aws_access_key AWS_SECRET_ACCESS_KEY=your_aws_secret_key AWS_REGION=us-east-1 S3_BUCKET_NAME=aurelia-pdf-storage #Cache Configuration CACHE_SIMILARITY_THRESHOLD=0.70 MIN_SIMILARITY_THRESHOLD=0.5 BORDERLINE_SIMILARITY_THRESHOLD=0.30 ENABLE_WIKIPEDIA_FALLBACK=true
-
Run Airflow (Data Pipeline)
cd airflow_docker docker-compose up --build
PROJECT_AURELIA/
├── airflow_docker/ # Airflow orchestration
│ ├── dags/
│ │ ├── azure_parser.py # Database initialization
│ │ ├── chunking_pipeline.py
│ │ └── concept_seed_dag.py # DAG 2: Cache seeding
| | └── fintbx_ingest_dag.py # DAG 1: PDF processing
│ ├── docker-compose.yaml
│ ├── Dockerfile
│ └── requirements.txt
│
├── fastapi/ # FastAPI RAG service
│ ├── main.py # FastAPI application
│ ├── rag_service.py # Core RAG logic
│ ├── models.py # Request/response models
│ ├── instructor_models.py # Structured output schemas
│ ├── init_db.sql # PostgreSQL schema
│ ├── Dockerfile
│ └── requirements.txt
│
├── streamlit/ # Streamlit UI
│ ├── app.py # Streamlit application
│ ├── Dockerfile
│ └── requirements.txt
│
├── data/ # Raw data
│ └── fintbx.pdf # Financial Toolbox PDF
│
├── outputs/ # Parsed Outputs
│ ├──figures
│ ├──tables
│ └──formulas
| └──json
|
├── dockerfile
├── docker-compose.app.yaml # App orchestration
├── .env # Environment template
├── requirements.txt # Root dependencies
└── README.md # This file
- OpenAI API Documentation: https://platform.openai.com/docs
- Pinecone Documentation: https://docs.pinecone.io
- Apache Airflow: https://airflow.apache.org/docs
- LangChain: https://python.langchain.com/docs
- Azure Document Intelligence: https://learn.microsoft.com/azure/ai-services/document-intelligence
- FastAPI: https://fastapi.tiangolo.com
- Streamlit: https://docs.streamlit.io
- MATLAB Financial Toolbox Documentation
- RAG Papers: https://arxiv.org/abs/2005.11401
Team Members
Name Student ID Contribution
Anusha Prakash 002306070 33.33%
Komal Khairnar 002472617 33.33%
Shriya Pekamwar 002059178 33.33%
Team: Team 4
Course: INFO 7245 - Big Data Systems & Intelligent Analytics
Institution: Northeastern University
Semester: Fall 2025
Project Title: AURELIA - Advanced Unified Retrieval and Embedding Layer for Intelligent Analysis
We, the undersigned members of Team 4, hereby attest that:
-
Original Work: This project represents our collective original work, completed collaboratively for INFO 7245 - Big Data Systems & Intelligent Analytics at Northeastern University.
-
Equal Contribution: All team members contributed substantially and equitably to this project.
-
Proper Attribution: All external resources, libraries, frameworks, APIs, and code snippets have been properly cited and attributed in the documentation.
-
Academic Integrity: This work has not been submitted for any other course or academic program and complies with Northeastern University's academic integrity policy.
-
Collaborative Work: All work was completed through legitimate collaboration among team members. No unauthorized assistance was received.