damg7245_assignment3

Project AURELIA: Automated Financial Concept Note Generator

Live application Links

https://codelabs-preview.appspot.com/?file_id=1dERIVAa1BNXEoY00KKMGaGv1l_Kj28kQ6HnKmfkcpAM#0

Video Recording

https://teams.microsoft.com/l/meetingrecap?driveId=b%21EoS7pVU_8kqkOUphvPtDSfkAdNlsJmVKlaqSKn_q2S2GYvdth_CcTqlco0hHW2Zb&driveItemId=016WYFLZRTQTFBYWQVJVGKQXGGTF3MEQJE&sitePath=https%3A%2F%2Fnortheastern-my.sharepoint.com%2F%3Av%3A%2Fg%2Fpersonal%2Fprakash_anush_northeastern_edu%2FETOEyhxaFU1MqFzGmXbCQSQBk2SaSXI80m7zBlVx7x8hDQ&fileUrl=https%3A%2F%2Fnortheastern-my.sharepoint.com%2Fpersonal%2Fprakash_anush_northeastern_edu%2FDocuments%2FRecordings%2FCall%2520with%2520Big%2520Data%2520-%2520Fall%25202025-20251024_153649-Meeting%2520Recording.mp4%3Fweb%3D1&threadId=19%3A5686ab43082647cc892195a2be5731b9%40thread.v2&organizerId=38442747-3b78-46c9-ba07-9ab2f0121399&tenantId=a8eec281-aaa3-4dae-ac9b-9a398b9215e7&callId=528fa1e3-638a-4c7b-a4e2-f1555d258f85&threadType=GroupChat&meetingType=Adhoc&subType=RecapSharingLink_RecapCore

Fast API: http://3.221.13.158:8000/docs
Streamlit Application: http://3.221.13.158:8501
Aiflow: http://35.202.14.13:8081

Problem Statement

Financial analysts and quantitative researchers frequently need to reference complex formulas, MATLAB code examples, and technical concepts from the MATLAB Financial Toolbox documentation (3000+ pages). Traditional keyword search is inefficient for:

Finding exact formulas with proper mathematical notation across thousands of pages Locating relevant code examples and understanding relationships between financial concepts Accessing information quickly during time-sensitive analysis without manual PDF navigation Understanding complex financial models with proper context and citations

The lack of an intelligent, conversational interface for financial documentation results in wasted time, increased errors, and reduced productivity for financial professionals.

Project Goals

Develop an intelligent RAG system that transforms static MATLAB Financial Toolbox documentation into a conversational knowledge base
Implement automated data pipelines using Apache Airflow for weekly PDF processing, parsing, chunking, and vector embedding
Provide semantic search across 10,000+ document chunks with accurate formula preservation and source citations
Implement smart caching with PostgreSQL to reduce query latency by 95% and API costs by 80%
Create structured LLM outputs using GPT-4o with Instructor framework to ensure consistent formula extraction and code examples
Deploy a cloud-native, production-ready system with comprehensive evaluation and monitoring

Architecture Diagram

System Flow

Phase 1: Data Preparation (Weekly/Scheduled)

DAG 1: S3 → Azure DI (Parse PDF) → Chunk → OpenAI (Embeddings) → Pinecone (Upload) DAG 2: Create PostgreSQL Schema → Seed Cache with pre-loaded Concepts

Phase 2: User Query (Real-Time)

User → Streamlit → FastAPI → RAG Service Check PostgreSQL Cache → Query Pinecone → Send to GPT-4o → Return Answer

Technology Stack

Installation

Prerequisites

Python 3.11+
Docker & Docker Compose
Git
Pinecone Account (API Key required)
OpenAI API Key
Azure Document Intelligence Credentials
AWS Account (for S3 storage)

Steps

Clone the Repository

git clone [https://github.com/BigDataIA-Fall2024-TeamA3/damg7245_final_project](https://github.com/Big-Data-Team-4/Assignment_3.git)
cd PROJECT_AURELIA

Copy environment template and update credentials
```
cp .env.example .env
nano .env
```

Required Environment Variables

#PostgreSQL
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DB=aurelia_rag
POSTGRES_USER=postgres
POSTGRES_PASSWORD=your_password

#Pinecone
PINECONE_API_KEY=your_pinecone_api_key
PINECONE_INDEX_NAME=aurelia-fintbx
PINECONE_NAMESPACE=financial-toolbox

#OpenAI
OPENAI_API_KEY=your_openai_api_key
EMBEDDING_MODEL=text-embedding-3-large
LLM_MODEL=gpt-4o
LLM_TEMPERATURE=0

#Azure Document Intelligence
AZURE_DI_ENDPOINT=your_azure_endpoint
AZURE_DI_KEY=your_azure_key

#AWS S3
AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret_key
AWS_REGION=us-east-1
S3_BUCKET_NAME=aurelia-pdf-storage

#Cache Configuration
CACHE_SIMILARITY_THRESHOLD=0.70
MIN_SIMILARITY_THRESHOLD=0.5
BORDERLINE_SIMILARITY_THRESHOLD=0.30
ENABLE_WIKIPEDIA_FALLBACK=true

Run Airflow (Data Pipeline)

cd airflow_docker
docker-compose up --build

Project Directories

PROJECT_AURELIA/
├── airflow_docker/              # Airflow orchestration
│   ├── dags/
│   │   ├── azure_parser.py          # Database initialization
│   │   ├── chunking_pipeline.py 
│   │   └── concept_seed_dag.py  # DAG 2: Cache seeding
|	|	└── fintbx_ingest_dag.py # DAG 1: PDF processing	
│   ├── docker-compose.yaml
│   ├── Dockerfile
│   └── requirements.txt
│
├── fastapi/                     # FastAPI RAG service
│   ├── main.py                 # FastAPI application
│   ├── rag_service.py          # Core RAG logic
│   ├── models.py               # Request/response models
│   ├── instructor_models.py    # Structured output schemas
│   ├── init_db.sql             # PostgreSQL schema
│   ├── Dockerfile
│   └── requirements.txt
│
├── streamlit/                    # Streamlit UI
│   ├── app.py                  # Streamlit application
│   ├── Dockerfile
│   └── requirements.txt
│
├── data/                        # Raw data
│   └── fintbx.pdf               # Financial Toolbox PDF
│
├── outputs/                        # Parsed Outputs
│   ├──figures
│   ├──tables
│   └──formulas
|	└──json
|
├── dockerfile
├── docker-compose.app.yaml      # App orchestration
├── .env                         # Environment template
├── requirements.txt            # Root dependencies
└── README.md                   # This file

References

OpenAI API Documentation: https://platform.openai.com/docs
Pinecone Documentation: https://docs.pinecone.io
Apache Airflow: https://airflow.apache.org/docs
LangChain: https://python.langchain.com/docs
Azure Document Intelligence: https://learn.microsoft.com/azure/ai-services/document-intelligence
FastAPI: https://fastapi.tiangolo.com
Streamlit: https://docs.streamlit.io
MATLAB Financial Toolbox Documentation
RAG Papers: https://arxiv.org/abs/2005.11401

Attestation Statement

Team Members

Name			Student ID	Contribution
Anusha Prakash	002306070	33.33%
Komal Khairnar	002472617	33.33%
Shriya Pekamwar	002059178	33.33%

Team: Team 4
Course: INFO 7245 - Big Data Systems & Intelligent Analytics
Institution: Northeastern University
Semester: Fall 2025

Project Title: AURELIA - Advanced Unified Retrieval and Embedding Layer for Intelligent Analysis

We, the undersigned members of Team 4, hereby attest that:

Original Work: This project represents our collective original work, completed collaboratively for INFO 7245 - Big Data Systems & Intelligent Analytics at Northeastern University.
Equal Contribution: All team members contributed substantially and equitably to this project.
Proper Attribution: All external resources, libraries, frameworks, APIs, and code snippets have been properly cited and attributed in the documentation.
Academic Integrity: This work has not been submitted for any other course or academic program and complies with Northeastern University's academic integrity policy.
Collaborative Work: All work was completed through legitimate collaboration among team members. No unauthorized assistance was received.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

damg7245_assignment3

Project AURELIA: Automated Financial Concept Note Generator

Live application Links

Video Recording

Problem Statement

Project Goals

Architecture Diagram

System Flow

Technology Stack

Installation

Prerequisites

Steps

Project Directories

References

Attestation Statement

About

Uh oh!

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
airflow-deployment/airflow-deployment		airflow-deployment/airflow-deployment
fastapi		fastapi
streamlit		streamlit
.gitignore		.gitignore
README.md		README.md

Big-Data-Team-4/Assignment_3

Folders and files

Latest commit

History

Repository files navigation

damg7245_assignment3

Project AURELIA: Automated Financial Concept Note Generator

Live application Links

Video Recording

Problem Statement

Project Goals

Architecture Diagram

System Flow

Technology Stack

Installation

Prerequisites

Steps

Project Directories

References

Attestation Statement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages