Skip to content

Effyrt/FinFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

FinFlow - Automated Dow 30 Earnings Pipeline

**Case Study 2: Project LANTERN **
Automating quarterly earnings report collection and analysis for Dow Jones 30 companies.

πŸ“‹ Project Overview

FinFlow is an automated data pipeline that streamlines the collection of quarterly earnings reports from Dow 30 companies. The system programmatically discovers investor relations pages, identifies the latest earnings reports, downloads and parses them, including metadata, and stores the results in cloud storage.

Key Features

  • Automated IR Page Discovery: Programmatically finds investor relations pages for all Dow 30 companies
  • Smart Report Detection: Identifies the latest quarterly earnings reports using keywords
  • Multi-format Parsing: Extracts text, tables, and charts from PDFs, HTML, and other formats
  • Cloud Storage Integration: Stores raw and parsed data in Google Cloud Storage
  • Airflow Orchestration: Manages the entire workflow with Apache Airflow for reliability and scheduling

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Dow 30 List    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Find IR Pages   │─────▢│  IR Page URLs    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Find Reports    │─────▢│  Report URLs     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Download & Parse│─────▢│  Parsed Data     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Extract Metadata │─────▢│  Structured Data β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Upload to GCS   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Quick Start

Prerequisites

  • Docker Desktop
  • Docker Compose
  • Python 3.12+
  • Google Cloud Platform account (for GCS)
  • Git

One-Command Setup

  1. Clone the repository
git clone https://github.com/yourusername/finflow.git
cd finflow
  1. Build and start services
docker-compose build
docker-compose up -d
  1. Access Airflow UI

Running the Pipeline

  1. Navigate to http://localhost:8080
  2. Find the dow30_pipeline DAG
  3. Toggle it to "Active" (if paused)
  4. Click the "Play" button to trigger a manual run

πŸ“ Project Structure

finflow/
β”œβ”€β”€ dags/                                    # Airflow DAG definitions
β”‚   └── dow30_pipeline.py                   # Main pipeline orchestration
β”œβ”€β”€ src/                                     # Core pipeline logic
β”‚   β”œβ”€β”€ find_ir_pages.py                    # IR page discovery
β”‚   β”œβ”€β”€ find_latest_reports.py              # Report identification
β”‚   β”œβ”€β”€ upload_to_cloud.py                  # Cloud storage upload
β”‚   β”œβ”€β”€ parsers/                            # Data parsing modules
β”‚   β”‚   β”œβ”€β”€ docling_or_fallback_parser.py  # ⭐ Main parsing logic with Docling
β”‚   └── representations/                    # Data structure and metadata
β”‚       β”œβ”€β”€ metadata_builder.py             # ⭐ Metadata extraction and construction
β”‚       β”œβ”€β”€ metadata_storage_formats.py     # ⭐ Metadata schemas and formats
β”œβ”€β”€ config/                                  # Configuration files
β”‚   └── dow30_companies.json                # Dow 30 reference list
β”œβ”€β”€ data/                                    # Local data storage
β”‚   β”œβ”€β”€ raw/                                # Raw downloaded files
β”‚   └── parsed/                             # Parsed outputs with metadata
β”œβ”€β”€ logs/                                    # Airflow logs
β”œβ”€β”€ docker-compose.yml                      # Docker services configuration
β”œβ”€β”€ .airflow                                # Airflow schedule
β”œβ”€β”€ .env                                    # Environment variables
β”œβ”€β”€ requirements.txt                        # Python dependencies
└── README.md                               # This file

πŸ”§ Configuration

Environment Variables (.env)

# Database
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres/airflow

# Celery
AIRFLOW__CELERY__RESULT_BACKEND=db+postgresql://airflow:airflow@postgres/airflow
AIRFLOW__CELERY__BROKER_URL=redis://:@redis:6379/0

# Admin credentials
_AIRFLOW_WWW_USER_USERNAME=admin
_AIRFLOW_WWW_USER_PASSWORD=admin

GCP Setup

  1. Create a GCS bucket:
gsutil mb gs://your-finflow-bucket
  1. Create a service account with Storage Admin role
  2. Download the JSON key and save as gcp-key.json

πŸ“¦ Dependencies

Python Packages

  • apache-airflow==2.10.2 - Workflow orchestration
  • requests - HTTP requests
  • beautifulsoup4 - HTML parsing
  • pandas - Data manipulation
  • python-dotenv - Environment management
  • google-cloud-storage - GCS integration

Docker Services

  • PostgreSQL 15 - Airflow metadata database
  • Redis 7 - Celery message broker
  • Apache Airflow 2.10.2 - Workflow orchestration

πŸ”‘ Key Components

Parsing Engine: docling_or_fallback_parser.py

The core parsing module that handles document processing:

  • Primary Method: Uses Docling for advanced document understanding

  • Output: Structured data with confidence scores and parsing metadata

πŸ” Pipeline Details

Step 1: Find IR Pages

Programmatically discovers investor relations pages by:

  • Analyzing company websites
  • Looking for common IR page patterns
  • Using keywords

Step 2: Find Latest Reports

Identifies the most recent quarterly earnings by:

  • Scanning IR pages for report links
  • Filtering by publication date
  • Matching keywords like "quarterly results", "earnings release"

Step 3: Download and Parse

Downloads reports and extracts content using:

  • Docling Parser: Advanced document parsing with layout understanding
  • Fallback Parsers: Alternative extraction methods for edge cases

Extracts:

  • Text content
  • Financial tables
  • Charts and images
  • Document structure

Step 4: Extract Metadata

Builds structured metadata using metadata_builder.py:

  • Company information (ticker, name, sector)
  • Report metadata (quarter, year, publication date)
  • Parsing metadata (method used, confidence score)

Stores in standardized formats defined by metadata_storage_formats.py:

  • JSON, nmd, txt

Step 5: Upload to Cloud

Organizes and uploads to GCS:

gs://your-bucket/
β”œβ”€β”€ AAPL/
β”‚   β”œβ”€β”€ 2025Q2/
β”‚   β”‚   β”œβ”€β”€ raw data 
β”‚   β”‚   β”œβ”€β”€ parsed data
β”‚   β”‚   └── metadata
β”‚   └── ...
└── ...

πŸ“„ Final Report

The full reflection report for Team 2 – FinFlow Project is available here:

πŸ‘‰ Download Team_2_reflection.docx

This document includes the project reflection, challenges, and future extensions.

πŸ”— Links

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages