## ⚠️ Data Scouring Halted

> **Note:**  
> Data scouring for news sentiment could not be continued due to the requirement of a paid subscription for the data source.

In [69]:
import os
import finnhub
import pandas as pd
from dotenv import load_dotenv
from datetime import datetime

## Top 10 Tickers in the Dataset

Below is a table showing the top 10 most frequently mentioned ticker symbols in the news dataset. These tickers represent some of the most actively discussed companies in the US capital markets between 2020 and 2024.

| Rank | Ticker Symbol |
|------|:-------------|
| 1    | AAPL         |
| 2    | MSFT         |
| 3    | GOOGL        |
| 4    | AMZN         |
| 5    | META         |
| 6    | TSLA         |
| 7    | NVDA         |
| 8    | JPM          |
| 9    | V            |
| 10   | UNH          |


## Step 1: Data Sourcing
- **Goal:** Obtain the raw financial news data.
- **Action:** Acquire news articles, headlines, and summaries along with their timestamps and associated ticker symbols. This can be done by:
    - Using financial data APIs (like Finnhub, accessing endpoints for news).
    - Downloading pre-compiled datasets (like the one you found on Kaggle, which already provides headlines, summaries, timestamps, and symbols).
    - (Less common for readily usable sentiment) Web scraping financial news websites.
- **Input:** Raw news data (text, timestamps, symbols).
- **Output:** A collection of raw news records, each containing (at a minimum): `datetime`, `headline`, `summary`, and `symbol`.


In [68]:
load_dotenv()
FINNHUB_API_KEY = os.getenv('FINNHUB_API_KEY')
finnhun_client = finnhub.Client(api_key=FINNHUB_API_KEY)

In [167]:
# Source: https://www.kaggle.com/datasets/addarm/us-capital-markets-news-headlines-2020-to-2024?select=news_data.csv

os.chdir('f:\\Upwork Projects\\Trading Bot\\StockBot')
os.getcwd()

kaggle_columns = ['datetime', 'headline', 'summary', 'symbol']
news = pd.read_csv('data/raw/news_data.csv', usecols=kaggle_columns)
# news.to_csv('data/raw/fine_news_data.csv',index=False)
news['datetime'] = pd.to_datetime(news['datetime'])
news.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2529812 entries, 0 to 2529811
Data columns (total 4 columns):
 #   Column    Dtype         
---  ------    -----         
 0   datetime  datetime64[ns]
 1   headline  object        
 2   summary   object        
 3   symbol    object        
dtypes: datetime64[ns](1), object(3)
memory usage: 77.2+ MB


In [34]:
print('Number of Unique Tickers: ', news['symbol'].nunique())

Number of Unique Tickers:  14450


In [88]:
news.sort_values(by=['datetime'])

Unnamed: 0,datetime,headline,summary,symbol
2529811,2023-04-06 01:30:00,Glazers raise Manchester United price to £6bn,The Glazer family are planning to delve deeper...,MARKET
2529799,2023-04-06 01:30:00,OPEC’s oil price manipulation?,An explainer on why OPEC is fiddling with oil ...,USO
2529800,2023-04-06 01:30:00,OPEC’s oil price manipulation?,An explainer on why OPEC is fiddling with oil ...,OIL
2529801,2023-04-06 01:30:00,Swiss order rescued bank Credit Suisse to canc...,Switzerland has instructed Credit Suisse to ca...,CS
2529802,2023-04-06 01:30:00,Glazers raise Manchester United price to £6bn,The Glazer family are planning to delve deeper...,MANU
...,...,...,...,...
12533,2024-04-05 22:42:31,"Upstart NC State ''here to win,'' says coach K...",,ACC
12532,2024-04-05 22:58:00,OTL Electrokart Begins Sales for World''s Firs...,,MARKET
12531,2024-04-05 22:58:00,OTL Electrokart Begins Sales for World''s Firs...,,PSV
12530,2024-04-05 22:58:18,\n\t\t\t\t\tNCAA Tournament Top Plays: Women''...,,MARKET


In [165]:
print(f"Got Data From: {news.sort_values(by=['datetime'])['datetime'].iloc[0]} To: {news.sort_values(by=['datetime'])['datetime'].iloc[-1]}")


Got Data From: 2023-04-06 01:30:00 To: 2024-04-05 22:58:18


In [None]:
# Today Date: 2025/05/15 
finn_columns = ['datetime', 'headline', 'summary', 'related']
MSFT_news = finnhun_client.company_news(symbol=['MSFT'], _from='2024-04-06', to=datetime.now().strftime('%Y-%m-%d'))
AAPL_news = finnhun_client.company_news(symbol=['AAPL'], _from='2024-04-06', to=datetime.now().strftime('%Y-%m-%d'))
TSLA_news = finnhun_client.company_news(symbol=['TSLA'], _from='2024-04-06', to=datetime.now().strftime('%Y-%m-%d'))


In [164]:
temp = pd.DataFrame(AAPL_news)
temp['datetime'] = pd.to_datetime(temp['datetime'].astype(int), unit='s')
print(f"Got Data From: {temp.sort_values(by=['datetime'])['datetime'].iloc[0]} To: {temp.sort_values(by=['datetime'])['datetime'].iloc[-1]}")

Got Data From: 2025-05-07 21:25:00 To: 2025-05-15 02:08:17


We have got data form kaggle `From: 2023-04-06 01:30:00 To: 2024-04-05 22:58:18` and then we have tried to connect FinnHub data, but unfortunatly we have got a one week data `Got Data From: 2025-05-07 21:25:00 To: 2025-05-15 02:08:17`, historical data required `paid subscription`.

## Step 2: Data Cleaning and Preprocessing
- **Goal:** Prepare the raw text data (`headline` and `summary`) for sentiment analysis.
- **Action:** Apply text preprocessing techniques. Since you're planning to use a pre-trained BERT model, your preprocessing should be tailored to the requirements of that model. This typically involves:
    - Handling missing or malformed text entries.
    - Removing unnecessary elements that the BERT model doesn't need (e.g., specific HTML tags if present, though less likely with API/dataset sources).
    - Ensuring the text is in a consistent format.
(Less common for BERT, which handles context well, but sometimes considered depending on the specific BERT variant and task) Basic cleaning like removing extra whitespace.
- **Input:** Raw news data from Step 1.
- **Output:** Cleaned and formatted text data within your news records, suitable for input into your chosen BERT model.

## Step 3: Sentiment Analysis (Applying the Pre-trained BERT Model)

- **Goal:** Determine the sentiment expressed in the cleaned news text for each individual news item.
- **Action:** Load your pre-trained BERT model designed for financial sentiment (e.g., FinBERT). Pass the cleaned `headline` and `summary` text for each news record through the model. The model will analyze the text and output sentiment scores or probabilities (typically for positive, negative, and neutral).
- **Input:** Cleaned news data from Step 2, along with the loaded pre-trained BERT model.
- **Output:** Your dataset now includes the sentiment scores or probabilities for each individual news item, associated with its `datetime` and `symbol`.