# Financial News Sentiment Analysis

This notebook performs sentiment analysis on financial news articles stored in MongoDB using the FinBERT model. The analysis pipeline includes:

1. **Database Connection**: Connect to MongoDB and retrieve news articles
2. **Model Setup**: Initialize the FinBERT model for financial sentiment analysis  
3. **Data Processing**: Combine article titles and content for analysis
4. **Sentiment Analysis**: Apply FinBERT to classify news sentiment (positive, negative, neutral)
5. **Results Analysis**: Examine sentiment distribution and handle any errors

**Model**: [FinBERT](https://huggingface.co/yiyanghkust/finbert-tone) - A BERT model fine-tuned specifically for financial sentiment analysis.

## 1. Database Connection Setup
establish a connection to MongoDB.

In [44]:
# Import required libraries for MongoDB connection
from pymongo import MongoClient
from pymongo.server_api import ServerApi
import os
from dotenv import dotenv_values

# Load environment variables from .env file
config = dotenv_values(".env")
MONGODB_CONNECTION_STRING = config.get("CONNECTION_STRING")

# Initialize MongoDB client with server API version 1
client = MongoClient(MONGODB_CONNECTION_STRING, server_api=ServerApi('1'))

# Test the connection
try:
    client.admin.command('ping')
    print("Successfully connected to db")
except Exception as e:
    print(f"Connection failed: {e}")

Successfully connected to db


## 2. Import Required Libraries

Import the necessary libraries for sentiment analysis and data processing.

In [45]:
# Additional MongoDB utilities
import pymongo
from bson.objectid import ObjectId

# Transformers library for FinBERT model
from transformers import BertTokenizer, BertForSequenceClassification, pipeline

## 3. Initialize FinBERT Model

Load the pre-trained FinBERT model specifically designed for financial sentiment analysis.

In [46]:
# Load FinBERT tokenizer and model for financial sentiment analysis
tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-tone')
model = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone', num_labels=3)

# Create sentiment analysis pipeline
finbert = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
print("FinBERT model loaded successfully!")



FinBERT model loaded successfully!


## 4. Access News Articles Database

Connect to the specific database and collection containing our financial news articles.

In [47]:
# Access the finance news database
db = client["finance_news_db"]

# Access the news articles collection
news_data_collection = db["news_articles"]

## 5. Explore the Dataset

Let's examine the structure and content of our news articles collection.

In [48]:
import pprint as pp

# Check the total number of articles in our collection
total_articles = news_data_collection.count_documents({})
print(f" Total news articles: {total_articles}")

# Display a sample document to understand the data structure
print("\n Sample article structure:")
sample_article = news_data_collection.find_one()
pp.pprint(sample_article)

 Total news articles: 100

 Sample article structure:
{'_id': ObjectId('692ae48a4a7fefe22f224b8a'),
 'authors': 'Kerry Hannon    · Senior Columnist',
 'content': 'Seniors have embraced Medicare Advantage plans for their free or '
            'deeply discounted perks such as eyeglasses, dental coverage, gym '
            'memberships, and reimbursements for — not joking — golf clubs and '
            'pickleball paddles.\n'
            'While those plan perks are a pleasure to imagine using, many '
            'people never touch them.\n'
            "“Medicare Advantage enrollees often don't know which supplemental "
            'benefits are offered by their plans or how to use them,” Gretchen '
            'Jacobson, vice president for Medicare at the Commonwealth Fund, a '
            'nonprofit research foundation, told Yahoo Finance.\n'
            '“Most Medicare Advantage enrollees say that they would like to '
            'receive notifications about unused benefits,” she added

## 6. Load Data into DataFrame

Convert the MongoDB collection into a pandas DataFrame for easier data manipulation.

In [None]:
import pandas as pd

# Convert MongoDB collection to pandas DataFrame
news_df = pd.DataFrame(list(news_data_collection.find()))
print(f"DataFrame shape: {news_df.shape}")
print(f"Columns: {list(news_df.columns)}")

news_df.head()

DataFrame shape: (100, 8)
Columns: ['_id', 'title', 'publisher', 'tickers', 'link', 'authors', 'time_published', 'content']


Unnamed: 0,_id,title,publisher,tickers,link,authors,time_published,content
0,692ae48a4a7fefe22f224b8a,Medicare Advantage woos seniors with plan perk...,Yahoo Finance,"[{'symbol': 'HUM', 'change': '-0.09%'}]",https://finance.yahoo.com/news/medicare-advant...,Kerry Hannon · Senior Columnist,"Sat, November 29, 2025 at 6:57 PM GMT+7",Seniors have embraced Medicare Advantage plans...
1,692ae48b4a7fefe22f224b8b,BofA Tracked Credit and Debit Spending By Gene...,Investopedia,"[{'symbol': 'BAC', 'change': None}]",https://finance.yahoo.com/news/bofa-tracked-cr...,Adam Hayes,"Sat, November 29, 2025 at 6:08 PM GMT+7",PeopleImages / Getty Images\nU.S. card spendin...
2,692ae48c4a7fefe22f224b8c,Palantir uses the '5 Whys' approach to problem...,Business Insider,"[{'symbol': 'PLTR', 'change': None}, {'symbol'...",https://finance.yahoo.com/news/palantir-uses-5...,Brent D. Griffiths,"Sat, November 29, 2025 at 6:01 PM GMT+7",Palantir CEO Alex Karp swears by a method that...
3,692ae48c4a7fefe22f224b8d,"HELOC rates today, November 29, 2025: Rates fa...",Yahoo Personal Finance,[],https://finance.yahoo.com/personal-finance/mor...,"Hal Bundrick, CFP® · Senior Writer Laur...","Sat, November 29, 2025 at 6:00 PM GMT+7",The national average HELOC rate remains under ...
4,692ae48d4a7fefe22f224b8e,Why the World’s Top Coffee Producer is Switchi...,Bloomberg,[],https://finance.yahoo.com/news/why-world-top-c...,Renata Carlos Daou,"Sat, November 29, 2025 at 6:00 PM GMT+7",A worker packs coffee cherries during a harves...


## 7. Prepare Text for Analysis

Combine article titles and content to create a comprehensive text field for sentiment analysis.

In [50]:
# Combine title and content into a single text field for analysis
# Handle missing values by filling with empty strings
news_df['full_text'] = news_df['title'].fillna('') + ". " + news_df['content'].fillna('')

print(f"Created 'full_text' column with combined title and content")
print(f"Average text length: {news_df['full_text'].str.len().mean():.0f} characters")

Created 'full_text' column with combined title and content
Average text length: 2373 characters


## 8. Define Sentiment Analysis Function

Create a robust function to handle sentiment analysis with error handling and text truncation.

In [52]:
def get_sentiment(text):
    # Validate input text
    if not isinstance(text, str) or len(text) < 5:
        return pd.Series([None, 0.0], index=['sentiment_label', 'sentiment_score'])
    
    # Truncate text to prevent BERT token limit overflow
    # ~1500 chars ≈ 512 tokens (rough approximation)
    truncated_text = text[:1500]
    
    try:
        # Run FinBERT sentiment analysis
        result = finbert(truncated_text)[0]
        return pd.Series([result['label'], result['score']], 
                        index=['sentiment_label', 'sentiment_score'])
    except Exception as e:
        # Log error and return error indicator
        print(f"Sentiment analysis failed: {str(e)[:100]}...")
        return pd.Series(["Error", 0.0], index=['sentiment_label', 'sentiment_score'])

## 9. Apply Sentiment Analysis

Process all articles through the FinBERT model to obtain sentiment classifications.

In [53]:
# Apply sentiment analysis to all articles
news_df[['sentiment_label', 'sentiment_score']] = news_df['full_text'].apply(get_sentiment)
print("completed")

Sentiment analysis failed: The size of tensor a (612) must match the size of tensor b (512) at non-singleton dimension 1...
Sentiment analysis failed: The size of tensor a (561) must match the size of tensor b (512) at non-singleton dimension 1...
completed
completed


## 10. Analyze Results

Examine the distribution of sentiment classifications across all news articles.

In [54]:
# Display sentiment distribution
print("Sentiment Classification Results:")
print("=" * 40)
sentiment_counts = news_df['sentiment_label'].value_counts()
print(sentiment_counts)

# Calculate percentages
print("\nSentiment Distribution (%):")
sentiment_percentages = news_df['sentiment_label'].value_counts(normalize=True) * 100
for label, percentage in sentiment_percentages.items():
    print(f"{label}: {percentage:.1f}%")

Sentiment Classification Results:
sentiment_label
Neutral     52
Positive    25
Negative    21
Error        2
Name: count, dtype: int64

Sentiment Distribution (%):
Neutral: 52.0%
Positive: 25.0%
Negative: 21.0%
Error: 2.0%


## 11. Error Analysis

Examine any articles that couldn't be processed to understand potential issues.

In [56]:
# Check for articles that resulted in processing errors
error_articles = news_df[news_df['sentiment_label'] == 'Error']
print(f"Articles with processing errors: {len(error_articles)}")

Articles with processing errors: 2


### Detailed Error Investigation

If errors exist, let's examine the problematic articles in detail to understand what went wrong.

In [59]:
# Detailed analysis of problematic articles
error_articles = news_df[news_df['sentiment_label'] == 'Error']

if len(error_articles) > 0:
    print("Detailed Error Analysis:")
    print("=" * 50)
    
    for idx, row in error_articles.iterrows():
        text = row['full_text']
        print(f"\nArticle Index: {idx}")
        print(f"Text Length: {len(text)} characters")
        print(f"Text Type: {type(text)}")
        print(f"Text: '{text}'")
        print("-" * 30)
else:
    print("No errors to investigate - all articles processed successfully!")

Detailed Error Analysis:

Article Index: 68
Text Length: 3369 characters
Text Type: <class 'str'>
Text: 'BC-OILS. NEW YORK (AP) — Futures trading on the New York Mercantile Exchange Friday:
Open
High
Low
Settle
Chg.
Jan 26
58.58
59.64
58.27
58.55
-.10
Feb 26
58.33
59.34
58.02
58.29
-.13
Mar 26
58.18
59.12
57.83
58.09
-.17
Apr 26
58.14
59.01
57.73
58.00
-.20
May 26
58.12
59.00
57.80
58.01
-.22
Jun 26
59.05
57.81
58.06
-.24
Jul 26
58.31
57.98
58.11
-.25
Aug 26
58.39
59.09
57.97
58.10
-.28
Sep 26
58.22
58.07
-.30
Oct 26
58.67
58.93
57.89
58.03
-.32
Nov 26
58.23
58.92
57.88
-.34
Dec 26
57.84
58.05
-.35
Jan 27
-.37
Feb 27
-.38
Mar 27
58.13
Apr 27
May 27
58.34
-.39
Jun 27
58.61
59.30
58.46
Jul 27
Aug 27
58.64
Sep 27
58.73
-.40
Oct 27
58.84
-.41
Nov 27
58.99
Dec 27
59.41
59.94
58.96
59.13
Jan 28
59.22
Feb 28
59.31
Mar 28
59.42
Apr 28
59.54
May 28
59.69
Jun 28
60.07
59.83
Jul 28
59.91
Aug 28
60.00
Sep 28
60.13
Oct 28
60.26
Nov 28
60.41
Dec 28
60.75
61.16
60.42
60.54
Jan 29
60.62
Feb 29
60.72
M


**Key Findings:**
- The sentiment analysis provides valuable insights into the financial news sentiment landscape
- Error handling ensures robust processing even with problematic text data
- Results can be used for further analysis, visualization, or integration into trading strategies