\*Please Run in colab

# Setup

### Environment Setup

In [23]:
!pip install instructor



In [24]:
# !cp /content/macro_financial_forecasting/applications/macro_financial_forecasting/data/danidanou_Bloomberg_Financial_News_train /content/
!cd /content
!rm -rf macro_financial_forecasting

In [25]:
!git clone https://github.com/chuanbinp/macro_financial_forecasting.git

Cloning into 'macro_financial_forecasting'...
remote: Enumerating objects: 1931, done.[K
remote: Counting objects: 100% (652/652), done.[K
remote: Compressing objects: 100% (210/210), done.[K
remote: Total 1931 (delta 498), reused 444 (delta 442), pack-reused 1279 (from 2)[K
Receiving objects: 100% (1931/1931), 39.58 MiB | 16.21 MiB/s, done.
Resolving deltas: 100% (1231/1231), done.


In [26]:
# !cp /content/danidanou_Bloomberg_Financial_News_train /content/macro_financial_forecasting/applications/macro_financial_forecasting/data/danidanou_Bloomberg_Financial_News_train

In [27]:
%cd macro_financial_forecasting/applications/macro_financial_forecasting/src

/content/macro_financial_forecasting/applications/macro_financial_forecasting/src/macro_financial_forecasting/applications/macro_financial_forecasting/src


### Mount GDrive

In [28]:
from google.colab import drive

# This will prompt you to authorize Colab to access your Google Drive.
drive.mount('/content/gdrive')
GDRIVE_PATH = "/content/gdrive/MyDrive/macro_financial_forecasting_files/"

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


### Code Setup

In [None]:
from config import Config
from train_data_loader import TrainDataLoader
from data_model.bloomberg_news_entry import BloombergNewsEntry
from google.colab import userdata
import os

os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")
config = Config("config.env")

train_data_loader = TrainDataLoader(config)

# Train Data Loader

In [30]:
print("Starting loading pipeline ...")
print(f"Config: {config}")

train_ds = train_data_loader.load()
print("Loading pipeline completed.")

Starting loading pipeline ...
Config: Config(
  gemini_api_key: !secret!
  openai_api_key: !secret!
  llm_model: openai/gpt-5-nano-2025-08-07
  industries: ['Information Technology', 'Health Care', 'Financials', 'Consumer Discretionary', 'Communication Services', 'Industrials', 'Consumer Staples', 'Energy', 'Utilities', 'Real Estate', 'Materials', 'General Market', 'None']
  dataset_name: danidanou/Bloomberg_Financial_News
  dataset_dir: ../data/
  rss_feeds: ['https://feeds.bloomberg.com/news/news.rss', 'https://feeds.bloomberg.com/markets/news.rss', 'https://feeds.bloomberg.com/business/news.rss', 'https://feeds.bloomberg.com/technology/news.rss', 'https://feeds.bloomberg.com/politics/news.rss', 'https://feeds.bloomberg.com/wealth/news.rss', 'https://feeds.bloomberg.com/economics/news.rss', 'https://feeds.bloomberg.com/green/news.rss', 'https://feeds.bloomberg.com/pursuits/news.rss', 'https://feeds.bloomberg.com/opinion/news.rss', 'https://feeds.bloomberg.com/finance/news.rss', 'http

bloomberg_financial_data.parquet.gzip:   0%|          | 0.00/482M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/446762 [00:00<?, ? examples/s]


--- Download Successful! ---


Map:   0%|          | 0/446762 [00:00<?, ? examples/s]

Training dataset processed.

--- Starting validation of 446762 entries ---


Validating entries: 100%|██████████| 446762/446762 [00:28<00:00, 15910.17it/s]


--- Validation Complete! ---
Training dataset validated.
Saving processed dataset to local cache at '../data/danidanou_Bloomberg_Financial_News_train'...
Total number of rows: 446762
Loading pipeline completed.


# Data Processing Pipeline

Update these 2 variable to specify which indices to process.

In [31]:
DATA_START=50000
DATA_END=None

In [32]:
from processor import NewsProcessor
import nest_asyncio
nest_asyncio.apply()

processor = NewsProcessor(config)
sample = processor.remove_redundant_info(train_ds[DATA_START:DATA_END])
df = processor.enrich_news_entries_with_classifications(sample, save_path=f"{GDRIVE_PATH}processed_news") #Sample size
df = processor.group_by_date_and_industry(df, save_path=f"{GDRIVE_PATH}grouped_news")
df = processor.filter_and_analyze_news(df)
df = processor.extract_impactful_news(df, top_n=3, save_path=f"{GDRIVE_PATH}impact_news")
df = processor.get_consolidated_sentiment(df, save_path=f"{GDRIVE_PATH}sentiment_news")

Device set to use cuda:0


Processing 396762 news entries...


FinBERT Sentiment: 100%|██████████| 1550/1550 [53:04<00:00,  2.05s/batch]
Industry Classification: 100%|██████████| 12399/12399 [6:27:58<00:00,  1.88s/batch]


Completed processing 396762 entries

Dropped 1141 (Industry, Date) pairs with Industry='None'
Remaining pairs: 13591

Summary Statistics:
Total unique (Industry, Date) pairs: 13591
Average articles per pair: 27.91
Max articles in a pair: 1262
Min articles in a pair: 1
25th percentile: 4.0
50th percentile: 13.0
75th percentile: 29.0
Number of pairs with at least 3 articles: 11284
Total articles: 379362


Extracting top 3 impactful news per (Industry, Date) pair...


Processing groups: 100%|██████████| 13591/13591 [00:00<00:00, 15013.76it/s]


Processing 13591 news entries...


FinBERT Sentiment: 100%|██████████| 54/54 [01:46<00:00,  1.96s/batch]


Completed processing 13591 entries


In [None]:
final_df = await processor.get_explanation(df, save_path=f"{GDRIVE_PATH}explanation_news")

Explanation: 100%|██████████| 2207/2207 [11:07<00:00,  3.31it/s]


In [None]:
df

Unnamed: 0,Industry,Date,News,ArticleCount,ImpactfulNews,AvgSentimentScore,SentimentScore,SentimentExplanation
0,Communication Services,2011-10-06,[{'Headline': 'FCC to Revamp Phone Subsidy to ...,2,[{'Headline': 'Euro-Area Leaders to Hold Summi...,0.709191,-0.291917,Overall sentiment for the Communications Servi...
1,Consumer Discretionary,2011-10-06,[{'Headline': 'PepsiCo May Purchase Russian Dr...,1,[{'Headline': 'PepsiCo May Purchase Russian Dr...,0.88174,0.888237,The article’s sentiment is strongly positive f...
2,Consumer Staples,2011-10-06,[{'Headline': 'Ukraine’s Grain Harvest Advance...,1,[{'Headline': 'Ukraine’s Grain Harvest Advance...,-0.918589,-0.917441,Explanation: FinBERT indicates a strongly nega...
3,Energy,2011-10-06,[{'Headline': 'Clean-Tech Companies Should Get...,9,[{'Headline': 'Norway Boosts Mongstad Carbon-S...,0.252093,-0.25285,"The energy-angle sentiment is mildly negative,..."
4,Financials,2011-10-06,[{'Headline': 'Ivory Coast Keeps Cocoa Export ...,51,[{'Headline': 'Remittances to Vietnam Thru Jul...,-0.306989,-0.032282,Combined FinBERT signal for the Financials set...
5,General Market,2011-10-06,[{'Headline': 'Farmland Seen Returning Up to 1...,4,[{'Headline': 'GE Study Finds Recession’s Job ...,-0.256293,-0.873807,Overall sentiment is negative: the combined Fi...
6,Health Care,2011-10-06,[{'Headline': 'House Panel Seeks Details on IR...,2,[{'Headline': 'Emdeon Said to Set Rate on $1.2...,0.666714,0.682417,Score: +0.67. Explanation: The Health Care new...
7,Industrials,2011-10-06,[{'Headline': 'Airbus German Workers Plan Work...,5,"[{'Headline': 'Polish Stocks: Getin, KGHM, Lot...",-0.296801,-0.868415,Explanation: The Industrials sentiment is nega...
8,Information Technology,2011-10-06,[{'Headline': 'Fans Hold IPhone-Lit Vigils for...,2,[{'Headline': 'Fans Hold IPhone-Lit Vigils for...,0.412332,0.002113,Net sentiment for the Information Technology t...
9,Materials,2011-10-06,[{'Headline': 'USDA Boxed Beef Cutout Closing ...,5,[{'Headline': 'Ukraine September Consumer Pric...,0.853697,0.875244,Combined materials-focused sentiment is strong...


In [None]:
import pandas as pd
pd.read_parquet(f"{GDRIVE_PATH}explanation_news")

Unnamed: 0,Industry,Date,News,ArticleCount,ImpactfulNews,AvgSentimentScore,SentimentScore
0,Industrials,2006-10-20,"[{'Article': 'Inco Ltd., the Canadian nickel p...",1,"[{'Article': 'Inco Ltd., the Canadian nickel p...",-0.306854,-0.430676
1,Financials,2006-10-21,[{'Article': 'Jim Cramer recommended that view...,1,[{'Article': 'Jim Cramer recommended that view...,0.467539,0.463383
2,Financials,2007-01-03,"[{'Article': 'Kaye Scholer LLP, the 500-lawyer...",1,"[{'Article': 'Kaye Scholer LLP, the 500-lawyer...",0.878102,0.872732
3,Communication Services,2007-02-12,[{'Article': 'A federal judge declared Adelphi...,1,[{'Article': 'A federal judge declared Adelphi...,0.714189,0.751565
4,Financials,2007-02-12,[{'Article': 'Advanced Marketing Services Inc....,2,[{'Article': 'Advanced Marketing Services Inc....,0.185984,-0.779473
...,...,...,...,...,...,...,...
2202,Real Estate,2013-11-01,[{'Article': 'A three-way bidding war for Aust...,2,[{'Article': 'A three-way bidding war for Aust...,-0.180499,0.004677
2203,Utilities,2013-11-01,[{'Article': 'More than 30 months after an ear...,1,[{'Article': 'More than 30 months after an ear...,0.004326,0.001697
2204,Financials,2013-11-04,[{'Article': 'The U.S. jobs report in the comi...,1,[{'Article': 'The U.S. jobs report in the comi...,0.172938,0.230453
2205,Materials,2013-11-04,[{'Article': 'Grain exports from the French po...,1,[{'Article': 'Grain exports from the French po...,0.008498,0.009371
