______
# Natural Language Processing in Finance
______
When building an NLP pipeline for finance, high-quality textual data such as financial news, earnings call transcipts and analyst commentary is required to feed into models. Below are said accessible sources:

## Free Earnings Call Transcripts
- MarketBeat: free transcripts for S&P 500 earnings calls, available publicly via their website
- Seeking Alpha: comprehensive transcripts including slides and audio (may require sign up)
- Quartr: mobile app offering live and recorded earnings call transcripts with global coverage

## Financial News and Commentary APIs
- MarketAux: free tier API for real-time financial news across 5,000 sources
- FinnHub: free real-time news, along with market data like fundamentals and forex
- FinancialNewsAPI: aggregates millions of financial news articles with free access
- Financial Modelling Prep: free plan for stock news API
- StockNewsAPI.com: free endpoint for aggregated stock market news

## Datasets for Research and Model Training
- NIFTY Financial News Headlines Dataset: Contains 15.7M time aligned headlines and stock prices for 4,775 S&P 500 firms from 1999-2023, available on Hugging face
- FNSPIDL: dataset with 29.7M stock prices and 15.7M financial news records, ideal for sentiment and forecasting analysis
- ECTSum: earnings call transcripts with expert bullet-point summaries - for summarisation tasks
______
For sentiment analysis or financial specific natural language extraction we would employ readily available NLP tools - pre-trained domain models or enterprise grade NLP pipelines.

## NLP tools for finance
- FinBERT: specialised BERT model fine-tuned for financial sentiment analysis, built on top of Hugging Faces transformers. It outputs positive, negative and neutral with corresponding probabilities.
- Spark NLP & Spark OCR: enterprise-grade NLP library optimized for scalability on Spark. Includes pipelines for tokenization, NER, sentiment, document classification, plus OCR integration
- deepset Haystack: open-source framework for building search, QA, summarization, and RAG pipelines. Integrates with large language models via Hugging Face, OpenAI, Cohere
- Contextual AI: enterprise-focused platform for Retrieval-Augmented Generation (RAG), enabling specialised agents for banking and finance
- Cohere: large language model APIs for classification, summarization, and financial applications, now with a secure “North for Banking” platform
______
For the purpose of experimentation we would first be using **MarketAux** for fresh financial news via API, and **FinBERT** for sentiment classification on said financial news.
______

## Importing Necessary Libraries

In [10]:
# import necessary libraries
import os
from dotenv import load_dotenv

# import custom 
from src.utils import get_marketaux_news, get_raw_news_rss, TopicModeling

# load in environment variables
load_dotenv()

# autoreload
%load_ext autoreload
%autoreload 2

  from .autonotebook import tqdm as notebook_tqdm


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


_____
# MaketAux
A free market and financial news API. Gives access to global stock market and finance news, including funds, crypto. 

Link to website and documentation: https://www.marketaux.com/

In [3]:
# load in API key
api_key = os.environ.get("MARKET_AUX_API_TOKEN")

In [4]:
# get news
status, news = get_marketaux_news(["TSLA"], api_key)

In [5]:
news

{'meta': {'found': 74462, 'returned': 3, 'limit': 3, 'page': 1},
 'data': [{'uuid': '6b7a8c25-4164-4c26-bfe6-793500e361b7',
   'title': 'Stock Market Today: Futures Edge Higher Amid Trump Tariff Concerns and Premarket Movers',
   'description': "U.S. stock futures are pointing to a slightly higher open on Wednesday, July 9, 2025, as markets today continue to digest President Donald Trump's latest",
   'keywords': '',
   'snippet': 'Market Overview: Indexes Poised for Cautious Gains\n\nU.S. stock futures are pointing to a slightly higher open on Wednesday, July 9, 2025, as markets today conti...',
   'url': 'https://thestockmarketwatch.com/stock-market-news/stock-market-today-futures-edge-higher-amid-trump-tariff-concerns-and-premarket-movers/50311/',
   'image_url': 'https://thestockmarketwatch.com/stock-market-news/wp-content/uploads/2025/07/cropped-smw-icon-logo.png',
   'language': 'en',
   'published_at': '2025-07-09T13:07:50.000000Z',
   'source': 'thestockmarketwatch.com',
   're

_____
As one can see from above, the API responses from most of these open source API calls would usually return a title with descriptions on the news and not return full news texts. 

This does not suit our project which aims to do analysis of full news with FinBERT. Hence, we would be employing other methods to gather news.
_____

# Methods to Gather Comprehensive Unstructured Financial News

Public news APIs (e.g. MarketAux, NewsAPI) have limited coverage due to licensing, source restrictions, and API call caps. For full daily financial news, we can consider the following:

## 1. Paid News Aggregator APIs with Full Feeds

- **Examples**: Bloomberg Terminal API, Refinitiv (Reuters) Eikon API, FactSet, S&P Capital IQ
- **Advantages**: Near-complete coverage, structured metadata, company tagging
- **Disadvantages**: Very expensive (institution-level subscriptions, thousands per month)

## 2. Direct Web Scraping

- **Use case**: Scrape financial news websites (CNBC, Bloomberg, Reuters, WSJ) for headlines and full articles.
- **Tools**: BeautifulSoup, Scrapy, Selenium (for dynamic pages).
- **Considerations**:
  - Check legal and ethical compliance (Terms of Service).
  - Implement polite scraping (rate limits, rotating proxies).
  - Use `newspaper3k` for article text extraction from URLs.

## 3. RSS Feeds Aggregation

- **Approach**:
  - Aggregate multiple RSS feeds from financial news sites.
  - Store and deduplicate by GUID or URL.
  - Fetch and parse article text daily.
- **Tools**: feedparser (Python), AWS Lambda or scheduled scripts.

## 4. Licensed Data Vendors

- **Use case**: Institutional or professional projects.
- **Providers**: LexisNexis, Dow Jones Factiva.
- **Advantages**: Full news archives, searchable APIs, rich metadata (tickers, topics, timestamps).
- **Disadvantages**: Very high licensing costs.

## 5. Financial Social Media and Alternative Data

- **Include**:
  - Stocktwits API
  - Twitter API (financial influencers and company news)
  - Reddit scraping (e.g. r/investing, r/stocks)
- **Note**: Sentiment-rich but noisier data compared to professional news.

## 6. Partnerships with News Publishers

- Direct B2B licensing agreements for raw feeds (XML, JSON) with specific publishers.
- Feasible for fintech startups or institutional funds with budget.

## 7. Google News or Bing News Scraping

- **Use case**: Search queries (e.g. “site:bloomberg.com AAPL”) via Google News to aggregate links.
- Combine with article scrapers for full text extraction.
- **Disadvantages**: Potential Terms of Service violations if automated at scale; risk of IP bans.

---

### Practical Usage

For full daily coverage without Bloomberg or FactSet:

- Combine **RSS feeds + direct scraping** of priority sites (Reuters, Bloomberg, CNBC, WSJ, Yahoo Finance).
- Store all URLs and article texts with timestamps in a database (Postgres, MongoDB).
- Run FinBERT daily on new articles.

---




# RSS Feeds Aggregation
We will be employing the use of RSS feed aggregation as it is legally safer compared to most methods and easy to automate.

We will be importing a function made to run through a specified list of RSS feeds and saving the contents of each url news into a text dump for further analysis.

In [1]:
# uncomment below to scrape the websites for data
# # specifying rss feeds
# rss_feeds = [
#     # "https://www.reddit.com/r/investing/.rss",
#     # "https://www.reddit.com/r/stocks/.rss",
#     "https://finance.yahoo.com/news/rssindex",
# ]

# # saving the texts into a text file
# get_raw_news_rss(rss_feeds, "financial_news.txt")

___
We finally have proper long texts of data for each source found. However, due to the interactive articles that require a click to access the whole news, we would probably have to make use of Selenium or Playwright.
____

# Playwright
We would be using Playwright instead of Selenium as (according to from what I have read online) it is faster and more stable compared to Selenium. Furthermore:
- Headless by default but can run in headed mode for debugging 
- Excellent stealth capabilities
- Supports Chromium, Firefox, Webkit

We are unable to run the playwright function in the notebook as `asyncio.run()` starts a brand-new event loop. However, below is the output of the txt file when running the `scrape_with_playwright` function in `utils.py`.

In [2]:
# uncomment below to see the outputs of the text file created
# file_path = "financial_news.txt"

# with open(file_path, 'r', encoding='utf-8') as f:
#     for line in f:
#         print(line, end='')

Using the same logic as the same functions for feedparser and the playwright scraping, we can combine them and create a new function called fetch and save articles.

With every url that is obtained with feedparser and the rss link a txt file named with the title of the article and the contents is saved. The following tree is obtained:
```bash
data/
└── articles/
├── 2025-06-28_amd_vs_arista.txt
├── 2025-06-29_sp500_rebounds.txt
├── 2025-06-30_oil_prices_slip.txt
└── 2025-07-01_tesla_earnings.txt
```

We are now free to apply NLP techniques to the texts obtained. Mainly we would be exploring the idea of topic modeling to obtain the top topics in our daily corpus.
___

# Topic Modeling

[Topic modeling](https://www.ibm.com/think/topics/topic-modeling?utm_source=chatgpt.com) is an **unsupervised natural language processing (NLP) technique** that uncovers hidden themes or "topics" within a collection of documents without requiring labeled data. It identifies clusters of words that frequently co-occur, revealing underlying themes in your text corpus.

---

## How it Works

- Each **topic** is represented as a distribution over words (e.g., "dog", "bone", "bark" for a dog topic).
- Each **document** is represented as a mixture of topics, each with a specific weight.
- **Latent Dirichlet Allocation (LDA)** is the most widely used topic modeling algorithm. It treats topics and their word distributions as hidden variables inferred from the data.

---

## Benefits

- Organizes **large unstructured text datasets** by surfacing dominant themes.
- Useful for tasks like **summarization**, **clustering**, **trend detection**, and **information retrieval**.

---

## Popular Algorithms and Tools

- **LDA (Latent Dirichlet Allocation)**
- **NMF (Non-Negative Matrix Factorization)**
- **HDP (Hierarchical Dirichlet Process)**: automatically infers the number of topics
- **PAM (Pachinko Allocation Model)**: models topic correlations
- **Dynamic Topic Models**: captures how topics evolve over time
- **BERTopic**: uses transformer embeddings and class-based TF-IDF for modern, coherent topic extraction

Popular libraries include **Gensim** for efficient LDA, NMF, and LSI, and **BERTopic** for transformer-based approaches.

---

## Practical Workflow for Topic Modeling on Financial News

1. Scrape and store daily articles (we’ve handled this).
2. Preprocess (tokenize, remove stop words, lemmatize).
3. Build a document-term matrix.
4. Train a topic model (e.g., LDA, BERTopic) to extract meaningful topics.
5. Review top keywords per topic and manually label them (e.g., "AI investing", "market analysis").
6. Track daily topic frequencies or their evolution over time using dynamic topic modeling.

---


In [None]:
# instantialise topic modeling class
tm = TopicModeling(data_path="data/articles", n_topics=5, n_top_words=10, model_type="lda")

# Preprocess, build DTM, train, and inspect
tm.build_dtm()
tm.train_lda()
topics = tm.get_top_keywords()
for idx, words in topics.items():
    print(f"Topic {idx}: {', '.join(words)}")