Skip to content

MariaMahNoor/Cluster-Rank-Abstract-Financial-Summarization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project Architecture

Overall Pipeline

Pipeline

Weighted Decay Extraction Strategy

Weighted Decay


Experimental Results

K-Means Elbow Analysis

Elbow Plot

Model Performance Comparison

Model Comparison

ROUGE Score Comparison

Score Comparison

LED Fine-Tuning Curve

Training Curve

Example Generated Summaries

Examples

# Hybrid Extractive-Abstractive Financial Summarization

## Overview

This repository contains the implementation of the research project:

**Hybrid Extractive-Abstractive Summarization of Financial Narratives via Unsupervised Semantic Clustering and DistilBERT Scoring**

The proposed framework addresses the challenge of summarizing extremely long financial reports (Form 10-K) by combining:

- Semantic Sentence Embeddings

- K-Means Clustering

- DistilBERT-based Sentence Ranking

- Longformer Encoder-Decoder (LED) Abstractive Summarization

The system follows a **Cluster-Rank-Abstract** pipeline designed for low-resource environments and Green AI research.

---

## Research Motivation

Financial reports often exceed 50,000 tokens, making them difficult to process using standard Transformer architectures.

This project proposes a hybrid solution that:

1. Maps document structure through semantic clustering.

2. Identifies important content using DistilBERT scoring.

3. Generates coherent summaries using LED.

4. Reduces computational requirements while maintaining competitive performance.

---

## System Architecture


Financial Report (10-K)

          │

          ▼

Sentence Segmentation

          │

          ▼

Sentence Embeddings (MiniLM)

          │

          ▼

K-Means Clustering

          │

          ▼

Reference-Guided Density Ranking

          │

          ▼

DistilBERT Sentence Scoring

          │

          ▼

Weighted Decay Extraction

          │

          ▼

LED Fine-Tuning

          │

          ▼

Abstractive Summary Generation

---

## Models

### Fine-Tuned LED Model (This Research)

Kaggle Model:

https://www.kaggle.com/models/mariamahnoor34214/finetuned-led/Transformers/default/1

Description:

- Fine-tuned on FINDSum subset

- Trained using approximately 17% of available data

- Used for final abstractive summary generation

### Baseline Models

#### BART-Large-CNN

https://www.kaggle.com/refs/hf-model/facebook/bart-large-cnn

Used as:

- Zero-shot summarization baseline

#### LED-Base-16384

https://www.kaggle.com/refs/hf-model/allenai/led-base-16384

Used as:

- Long-document baseline

- Base model for fine-tuning

---

## Datasets

Due to GitHub storage limitations, datasets are hosted on Kaggle.

### Sentence Embeddings Dataset

https://www.kaggle.com/datasets/mariamahnoor34214/sentence-embeddings-dataset

Contents:

- MiniLM sentence embeddings

- 384-dimensional vector representations

Size:

- Approximately 7 GB

---

### Clustered Dataset

https://www.kaggle.com/datasets/mariamahnoor34214/clustered-dataset

Contents:

- K-Means clustering outputs

- Cluster assignments

- Topic representations

Size:

- Approximately 285 MB

---

### FINDSum Sentences Dataset

https://www.kaggle.com/datasets/mariamahnoor34214/findsum-sentences

Contents:

- Processed sentence-level financial narratives

- Training data for clustering and scoring

Size:

- Approximately 281 MB

---

## Methodology

### Stage 1: Semantic Clustering

Embedding Model:

- all-MiniLM-L6-v2

Clustering:

- K-Means

- K = 30 clusters

Purpose:

- Discover latent financial topics

- Reduce redundancy

- Improve content coverage

---

### Stage 2: Sentence Importance Ranking

Model:

- DistilBERT

Purpose:

- Rank extracted sentences

- Identify high-value financial information

Strategy:

- Reference-Guided Density Ranking

- Weighted Decay Extraction

---

### Stage 3: Abstractive Generation

Model:

- Longformer Encoder Decoder (LED)

Advantages:

- 16,384 token context window

- Long-document summarization

- Reduced truncation effects

---

## Experimental Setup

### Hardware

- NVIDIA Tesla T4

- 16 GB VRAM

### DistilBERT Fine-Tuning

| Parameter | Value |

|------------|--------|

| Epochs | 1 |

| Batch Size | 16 |

| Learning Rate | 2e-5 |

### LED Fine-Tuning

| Parameter | Value |

|------------|--------|

| Epochs | 1 |

| Learning Rate | 3e-5 |

| Beam Size | 4 |

| Training Data | 17% |

---

## Results

### Model Performance Comparison

| Model | Configuration | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU |

|---------|-------------|---------|---------|---------|------|

| BART-Large | Zero-Shot | 20.10 | 5.11 | 12.11 | 0.66 |

| LED-Base | Zero-Shot | 33.65 | 10.12 | 17.41 | 6.35 |

| LED-FT (Ours) | 17% Data | 37.76 | 12.25 | 18.52 | 10.57 |

---

### Comparison Against FINDSum Baselines

| Method | Type | Liquidity | ROO | Average |

|----------|----------|----------|----------|----------|

| TextRank | Unsupervised | 41.71 | 35.93 | 38.82 |

| BART | Supervised (Full) | 52.37 | 49.00 | 50.69 |

| LED | Supervised (Full) | 53.52 | 53.06 | 53.29 |

| GCG | Hybrid (Full) | 54.55 | 54.32 | 54.44 |

| Ours | Low-Resource Hybrid | – | – | 37.76 |

---

## Repository Structure


Hybrid-Extractive-Abstractive-Financial-Summarization/

│

├── notebooks/

│   ├── finetune\_LED\_ABS.ipynb

│   ├── LED\_finetuning.ipynb

│   ├── kmeans\_1250\_abstraction.ipynb

│   ├── kmeans\_extracted\_sentences.ipynb

│   └── hdbs\_clustered.ipynb

│

├── papers/

│   └── Research\_Paper.pdf

│

├── models/

│   └── README.md

│

├── results/

├── data/

│

├── README.md

└── .gitignore

---

## Key Contributions

- Hybrid Extractive-Abstractive Summarization

- Semantic Clustering for Long Financial Documents

- DistilBERT-Based Sentence Ranking

- LED Fine-Tuning Under Low-Resource Constraints

- Green AI Approach for Financial NLP

---

## Author

**Maria Mahnoor**

Research Areas:

- Natural Language Processing

- Financial Document Summarization

- Transformer Models

- Long Document Understanding

- Green AI

About

Hybrid Extractive-Abstractive Financial Text Summarization using Semantic Clustering, DistilBERT Sentence Scoring, and Longformer Encoder-Decoder (LED) for long-document summarization of SEC Form 10-K reports.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors