______
# Natural Language Processing in Finance
______
When building an NLP pipeline for finance, high-quality textual data such as financial news, earnings call transcipts and analyst commentary is required to feed into models. Below are said accessible sources:

## Free Earnings Call Transcripts
- MarketBeat: free transcripts for S&P 500 earnings calls, available publicly via their website
- Seeking Alpha: comprehensive transcripts including slides and audio (may require sign up)
- Quartr: mobile app offering live and recorded earnings call transcripts with global coverage

## Financial News and Commentary APIs
- MarketAux: free tier API for real-time financial news across 5,000 sources
- FinnHub: free real-time news, along with market data like fundamentals and forex
- FinancialNewsAPI: aggregates millions of financial news articles with free access
- Financial Modelling Prep: free plan for stock news API
- StockNewsAPI.com: free endpoint for aggregated stock market news

## Datasets for Research and Model Training
- NIFTY Financial News Headlines Dataset: Contains 15.7M time aligned headlines and stock prices for 4,775 S&P 500 firms from 1999-2023, available on Hugging face
- FNSPIDL: dataset with 29.7M stock prices and 15.7M financial news records, ideal for sentiment and forecasting analysis
- ECTSum: earnings call transcripts with expert bullet-point summaries - for summarisation tasks
______
For sentiment analysis or financial specific natural language extraction we would employ readily available NLP tools - pre-trained domain models or enterprise grade NLP pipelines.

## NLP tools for finance
- FinBERT: specialised BERT model fine-tuned for financial sentiment analysis, built on top of Hugging Faces transformers. It outputs positive, negative and neutral with corresponding probabilities.
- Spark NLP & Spark OCR: enterprise-grade NLP library optimized for scalability on Spark. Includes pipelines for tokenization, NER, sentiment, document classification, plus OCR integration
- deepset Haystack: open-source framework for building search, QA, summarization, and RAG pipelines. Integrates with large language models via Hugging Face, OpenAI, Cohere
- Contextual AI: enterprise-focused platform for Retrieval-Augmented Generation (RAG), enabling specialised agents for banking and finance
- Cohere: large language model APIs for classification, summarization, and financial applications, now with a secure “North for Banking” platform
______
For the purpose of experimentation we would first be using **MarketAux** for fresh financial news via API, and **FinBERT** for sentiment classification on said financial news.
______

## Importing Necessary Libraries

In [40]:
# import necessary libraries
import os
from dotenv import load_dotenv

# import custom 
from utils import get_marketaux_news, get_raw_news_rss

# load in environment variables
load_dotenv()

# autoreload
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


_____
# MaketAux
A free market and financial news API. Gives access to global stock market and finance news, including funds, crypto. 

Link to website and documentation: https://www.marketaux.com/

In [32]:
# load in API key
api_key = os.environ.get("MARKET_AUX_API_TOKEN")

In [33]:
# get news
status, news = get_marketaux_news(["TSLA"], api_key)

In [34]:
news

{'meta': {'found': 74098, 'returned': 3, 'limit': 3, 'page': 1},
 'data': [{'uuid': 'd9ee1c1d-6561-4fe8-8bd3-2d6f69de61b1',
   'title': 'Why Are Meta Platforms, Microsoft, and Nvidia Outperforming the "Magnificent Seven" and the S&P 500?',
   'description': 'Meta Platforms, Microsoft, and Nvidia are turning AI investments into concrete results.  Artificial intelligence has helped these companies grow their profit margins.  Meta Platforms (NASDAQ: META), Microsoft (NASDAQ: MSFT), and Nvidia (NASDAQ: NVDA) are knocking on the door of all-time highs, whereas the other four "Magnificent Seven" stocks -- Amazon, Alphabet, Tesla, and Apple -- are down year to date.',
   'keywords': 'Nvidia, Microsoft, Meta Platforms, Microsoft 365, profit margins, Magnificent Seven',
   'snippet': 'The market hates uncertainty, but it also loves companies with a clear vision for growing revenue and converting capital expenditures into free cash flow. Meta,...',
   'url': 'https://finance.yahoo.com/news/why-m

_____
As one can see from above, the API responses from most of these open source API calls would usually return a title with descriptions on the news and not return full news texts. 

This does not suit our project which aims to do analysis of full news with FinBERT. Hence, we would be employing other methods to gather news.
_____

# Methods to Gather Comprehensive Unstructured Financial News

Public news APIs (e.g. MarketAux, NewsAPI) have limited coverage due to licensing, source restrictions, and API call caps. For full daily financial news, we can consider the following:

## 1. Paid News Aggregator APIs with Full Feeds

- **Examples**: Bloomberg Terminal API, Refinitiv (Reuters) Eikon API, FactSet, S&P Capital IQ
- **Advantages**: Near-complete coverage, structured metadata, company tagging
- **Disadvantages**: Very expensive (institution-level subscriptions, thousands per month)

## 2. Direct Web Scraping

- **Use case**: Scrape financial news websites (CNBC, Bloomberg, Reuters, WSJ) for headlines and full articles.
- **Tools**: BeautifulSoup, Scrapy, Selenium (for dynamic pages).
- **Considerations**:
  - Check legal and ethical compliance (Terms of Service).
  - Implement polite scraping (rate limits, rotating proxies).
  - Use `newspaper3k` for article text extraction from URLs.

## 3. RSS Feeds Aggregation

- **Approach**:
  - Aggregate multiple RSS feeds from financial news sites.
  - Store and deduplicate by GUID or URL.
  - Fetch and parse article text daily.
- **Tools**: feedparser (Python), AWS Lambda or scheduled scripts.

## 4. Licensed Data Vendors

- **Use case**: Institutional or professional projects.
- **Providers**: LexisNexis, Dow Jones Factiva.
- **Advantages**: Full news archives, searchable APIs, rich metadata (tickers, topics, timestamps).
- **Disadvantages**: Very high licensing costs.

## 5. Financial Social Media and Alternative Data

- **Include**:
  - Stocktwits API
  - Twitter API (financial influencers and company news)
  - Reddit scraping (e.g. r/investing, r/stocks)
- **Note**: Sentiment-rich but noisier data compared to professional news.

## 6. Partnerships with News Publishers

- Direct B2B licensing agreements for raw feeds (XML, JSON) with specific publishers.
- Feasible for fintech startups or institutional funds with budget.

## 7. Google News or Bing News Scraping

- **Use case**: Search queries (e.g. “site:bloomberg.com AAPL”) via Google News to aggregate links.
- Combine with article scrapers for full text extraction.
- **Disadvantages**: Potential Terms of Service violations if automated at scale; risk of IP bans.

---

### Practical Usage

For full daily coverage without Bloomberg or FactSet:

- Combine **RSS feeds + direct scraping** of priority sites (Reuters, Bloomberg, CNBC, WSJ, Yahoo Finance).
- Store all URLs and article texts with timestamps in a database (Postgres, MongoDB).
- Run FinBERT daily on new articles.

---




# RSS Feeds Aggregation
We will be employing the use of RSS feed aggregation as it is legally safer compared to most methods and easy to automate.

We will be importing a function made to run through a specified list of RSS feeds and saving the contents of each url news into a text dump for further analysis.

In [43]:
# specifying rss feeds
rss_feeds = [
    # "https://www.reddit.com/r/investing/.rss",
    # "https://www.reddit.com/r/stocks/.rss",
    "https://finance.yahoo.com/news/rssindex",
]

# saving the texts into a text file
get_raw_news_rss(rss_feeds, "financial_news.txt")

Processing: https://finance.yahoo.com/news/3-no-brainer-warren-buffett-141500024.html
Saved: 3 No-Brainer Warren Buffett Stocks to Buy Right Now
----------------------------------------------------------------------------------------------------
Processing: https://finance.yahoo.com/news/budget-weekly-pay-freight-inconsistent-110000088.html
Saved: How to Budget Weekly Pay When Freight is Inconsistent
----------------------------------------------------------------------------------------------------
Processing: https://finance.yahoo.com/news/even-markets-rally-trumps-policymaking-101453126.html
Saved: Even as markets rally, Trump's policymaking causes market angst
----------------------------------------------------------------------------------------------------
Processing: https://finance.yahoo.com/news/analysts-amd-stock-close-gap-120002714.html
Saved: Analysts: AMD Stock Will ‘Close the Gap’ With Nvidia by 2026. Should You Buy AMD Stock Here?
---------------------------------------

___
We finally have proper long texts of data for each source found. However, due to the interactive articles that require a click to access the whole news, we would probably have to make use of Selenium or Playwright.
____

# Playwright
We would be using Playwright instead of Selenium as (according to from what I have read online) it is faster and more stable compared to Selenium. Furthermore:
- Headless by default but can run in headed mode for debugging 
- Excellent stealth capabilities
- Supports Chromium, Firefox, Webkit

We are unable to run the playwright function in the notebook as `asyncio.run()` starts a brand-new event loop. However, below is the output of the txt file when running the `scrape_with_playwright` function in `utils.py`.

In [6]:
file_path = "financial_news.txt"

with open(file_path, 'r', encoding='utf-8') as f:
    for line in f:
        print(line, end='')

AMD sees the market for its GPUs climbing more than 60% per year through 2028.

It's in position to take a growing share of that market as well as the server CPU market.

On the other hand, Arista should see a big sales boost from bigger AI data centers.

10 stocks we like better than Advanced Micro Devices ›

WhileNvidiagets all the attention among companies providing chips and equipment to AI data centers, there are dozens of others benefiting from the soaring spending from the industry's hyperscalers.Advanced Micro Devices(NASDAQ: AMD)andArista Networks(NYSE: ANET)have both seen their revenue climb thanks to ongoing AI spending. But if you can buy only one of them, AMD is the stock to own right now.

Both companies are executing well with strong demand for their products and a huge runway as AI spending takes off. But AMD's stock looks more attractive for multiple reasons. Here's what investors need to know.

AMD's management sees the AI accelerator market, which includes GPUs and c