In [13]:
import evaluate
import nltk
import numpy as np
import pandas as pd
import torch
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, \
    DataCollatorForSeq2Seq

# Dataset Preparation

The [dataset Indian_Financial_News](https://huggingface.co/datasets/kdave/Indian_Financial_News) was chosen for fine-tuning the model. It contains financial news articles in English, which is suitable for the task of summarization. The dataset is already in a structured format with columns for content, summary, and link.

As a measure to focus the model on financial news related to stocks, the dataset was filtered to include only articles that contain the word "stock" in the content.

In [3]:
df = pd.read_csv("hf://datasets/kdave/Indian_Financial_News/training_data_26000.csv")
print("Dataset size before filtering:", len(df))

df = df[df["Content"].str.contains("stock", case=False, na=False)]
print("Dataset size after filtering:", len(df))

Dataset size before filtering: 26961
Dataset size after filtering: 8825


## Data parsing from Yahoo Finance

For additional data, Yahoo Finance news of top tickers were parsed and cleaned.

Using the index S&P 500 Stocks the top tickers were selected for parsing.

In [9]:
url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
tables = pd.read_html(url)

sp500_table = tables[0]
sp500_tickers = sp500_table[['Symbol', 'Security']].sample(frac=1, random_state=42).reset_index(drop=True)
print(sp500_tickers.head(10))

  Symbol              Security
0      K             Kellanova
1    BRO         Brown & Brown
2    LIN             Linde plc
3    DTE            DTE Energy
4   CINF  Cincinnati Financial
5    LHX              L3Harris
6    RTX       RTX Corporation
7    GLW          Corning Inc.
8   BKNG      Booking Holdings
9   IDXX    Idexx Laboratories


After parsing using `services/news_fetcher.py`, the data was cleaned to remove duplicates and non-English characters and saved to csv files in iterative steps (see `notebooks/news_parsing.ipynb`)

In [11]:
df = pd.read_csv("dataset.csv")
print("Parsed news and the Indian Financial News dataset:", len(df))

Parsed news and the Indian Financial News dataset: 10605


In [17]:
# Create a Dataset object from the DataFrame
dataset = Dataset.from_pandas(df)

# Baseline Evaluation

As a baseline, the `facebook/bart-large-cnn` model was used, which is a fine-tuned bert-large on CNN Daily Mail dataset for summarization tasks. This model is widely used and serves as a good starting point for comparison with fine-tuned models considering the similarity of financial news to the CNN Daily Mail dataset.

In [None]:
baseline_tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
baseline_model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")

In [None]:
max_input_length = 512
max_target_length = 128

def preprocess(example):
    inputs = baseline_tokenizer(example["Content"], max_length=max_input_length, truncation=True, padding="max_length")
    targets = baseline_tokenizer(example["Summary"], max_length=max_target_length, truncation=True, padding="max_length")
    inputs["labels"] = targets["input_ids"]
    return inputs

tokenized_dataset = dataset.map(preprocess, batched=True)

In [None]:
tokenized_dataset = tokenized_dataset.train_test_split(test_size=0.1, seed=42)

In [None]:
metric = evaluate.load("bert_score", "en")

def evaluate_baseline_in_batches(dataset, batch_size=16):
    baseline_model.eval()
    all_preds = []
    all_labels = []
    for i in range(0, len(dataset), batch_size):
        batch = dataset[i:i+batch_size]
        input_ids = torch.tensor(batch['input_ids']).to(baseline_model.device)
        with torch.no_grad():
            outputs = baseline_model.generate(
                input_ids=input_ids,
                max_length=max_target_length,
                num_beams=4
            )
        preds = baseline_tokenizer.batch_decode(outputs, skip_special_tokens=True)
        labels = baseline_tokenizer.batch_decode(batch['labels'], skip_special_tokens=True)
        all_preds.extend(preds)
        all_labels.extend(labels)
    return all_preds, all_labels

eval_dataset = tokenized_dataset['test']
preds, labels = evaluate_baseline_in_batches(eval_dataset)

decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in preds]
decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in labels]
result = metric.compute(predictions=decoded_preds, references=decoded_labels, lang="en")
aggregated = {
    "precision": float(np.mean(result["precision"])),
    "recall":    float(np.mean(result["recall"])),
    "f1":        float(np.mean(result["f1"])),
}
print(aggregated)

The evaluation results for the baseline model were calculated using Google Colab's T4 GPU

```
{
 'eval_precision': 0.8868377528808735,
 'eval_recall': 0.8776564202926777,
 'eval_f1': 0.8821186708079444,
}
```

# Model Fine-Tuning

For open lightweight domain-focused models for summarization tasks the following models were selected:
- [bert-small-finetuned-cnn](https://huggingface.co/mrm8488/bert-small2bert-small-finetuned-cnn_daily_mail-summarization) 29 M params This model is a warm-started BERT2BERT model fine-tuned on the CNN/Dailymail_summarization dataset.
- [Falconsai/text_summarization](https://huggingface.co/Falconsai/text_summarization) 60,5 М params Fine-Tuned T5 Small designed for the task of text summarization.

In [None]:
metric = evaluate.load("bertscore")

# Function to use for evaluation in the Trainer
def get_metic_func(tokenizer):
    def compute_metrics(eval_preds):
        preds, labels = eval_preds

        # decode preds and labels
        labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
        decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
        decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

        # bertscore expects newline after each sentence
        decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
        decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

        result = metric.compute(predictions=decoded_preds, references=decoded_labels, lang="en")

        aggregated = {
            "precision": float(np.mean(result["precision"])),
            "recall":    float(np.mean(result["recall"])),
            "f1":        float(np.mean(result["f1"])),
        }

        return aggregated
    return compute_metrics

### bert-small-finetuned-cnn

In [None]:
tokenizer = AutoTokenizer.from_pretrained("mrm8488/bert-small2bert-small-finetuned-cnn_daily_mail-summarization")
model = AutoModelForSeq2SeqLM.from_pretrained("mrm8488/bert-small2bert-small-finetuned-cnn_daily_mail-summarization")

In [None]:
max_input_length = 512
max_target_length = 128

def preprocess(example):
    inputs = tokenizer(example["Content"], max_length=max_input_length, truncation=True, padding="max_length")
    targets = tokenizer(example["Summary"], max_length=max_target_length, truncation=True, padding="max_length")
    inputs["labels"] = targets["input_ids"]
    return inputs

tokenized_dataset = dataset.map(preprocess, batched=True)

In [None]:
tokenized_dataset = tokenized_dataset.train_test_split(test_size=0.1, seed=42)

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    learning_rate=2e-4,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=5,
    num_train_epochs=5,
    predict_with_generate=True,
    fp16=True,
    logging_dir='./logs',
)

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=get_metic_func(tokenizer)
)

In [None]:
trainer.train()

In [None]:
trainer.evaluate()

The training and evaluation results for the model `bert-small-finetuned-cnn` were calculated using Google Colab's T4 GPU

```
{
     'eval_loss': 0.1591629683971405,
     'eval_precision': 0.9290257127231079,
     'eval_recall': 0.9352621863981861,
     'eval_f1': 0.9320505570384101,
     'eval_runtime': 545.7026,
     'epoch': 5.0
 }
```

### Falconsai/text_summarization

In [None]:
tokenizer = AutoTokenizer.from_pretrained("Falconsai/text_summarization")
model = AutoModelForSeq2SeqLM.from_pretrained("Falconsai/text_summarization")

In [None]:
max_input_length = 512
max_target_length = 128

def preprocess(example):
    inputs = tokenizer(example["Content"], max_length=max_input_length, truncation=True, padding="max_length")
    targets = tokenizer(example["Summary"], max_length=max_target_length, truncation=True, padding="max_length")
    inputs["labels"] = targets["input_ids"]
    return inputs

tokenized_dataset = dataset.map(preprocess, batched=True)

In [None]:
tokenized_dataset = tokenized_dataset.train_test_split(test_size=0.1, seed=42)

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="results",
    learning_rate=2e-4,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=5,
    num_train_epochs=5,
    predict_with_generate=True,
    fp16=True,
    logging_dir='./logs',
)

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=get_metic_func(tokenizer)
)


In [None]:
trainer.train()

In [None]:
trainer.evaluate()

The training and evaluation results for the model `falconsai/text_summarization` were calculated using Google Colab's T4 GPU

```
{
 'eval_loss': 0.22356651723384857,
 'eval_precision': 0.9129834810936942,
 'eval_recall': 0.839822987892206,
 'eval_f1': 0.8745714848599347,
 'eval_runtime': 137.4924,
 'epoch': 5.0
}
```

# Conclusion

The fine-tuning of the summarization models on our combined dataset has shown promising results. The `bert-small-finetuned-cnn` model achieved a high F1 score of approximately 0.932, while the `falconsai/text_summarization` model achieved a slightly lower F1 score of approximately 0.875.

Both fine-tuned models outperformed the baseline model, indicating that domain-specific fine-tuning can significantly enhance summarization performance in the financial news domain.


| Model                                   | Precision | Recall   | F1      | Loss      | Epochs | Runtime (s) |
|------------------------------------------|-----------|----------|---------|-----------|--------|-------------|
| facebook/bart-large-cnn (Baseline)      | 0.8868    | 0.8777   | 0.8821  | -         | -      | -           |
| bert-small-finetuned-cnn                 | 0.9290    | 0.9353   | 0.9321  | 0.1592    | 5      | 545.70      |
| falconsai/text_summarization             | 0.9130    | 0.8398   | 0.8746  | 0.2236    | 5      | 137.49      |
