# Text Processing and Sentiment Engine

This notebook implements the Natural Language Processing (NLP)
and sentiment extraction component of the Market Mood & Moves project.

The objective is to transform raw financial news headlines into
numerical sentiment signals using:
- classical NLP baselines
- a domain-specific transformer model (FinBERT)

The focus is on understanding the NLP pipeline and producing
clean sentiment signals for downstream financial analysis.


In [1]:
import pandas as pd
import numpy as np

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

from transformers import pipeline


In [2]:
news_df = pd.DataFrame({
    "headline": [
        "Apple reports strong quarterly earnings",
        "Apple faces antitrust scrutiny in the European Union",
        "Amazon expands cloud infrastructure in India",
        "Microsoft acquires an artificial intelligence startup",
        "Technology stocks fall amid inflation concerns",
        "Markets remain cautious ahead of Federal Reserve meeting"
    ]
})

news_df


Unnamed: 0,headline
0,Apple reports strong quarterly earnings
1,Apple faces antitrust scrutiny in the European...
2,Amazon expands cloud infrastructure in India
3,Microsoft acquires an artificial intelligence ...
4,Technology stocks fall amid inflation concerns
5,Markets remain cautious ahead of Federal Reser...


In [3]:
news_df["char_length"] = news_df["headline"].apply(len)
news_df["word_count"] = news_df["headline"].apply(lambda x: len(x.split()))

news_df


Unnamed: 0,headline,char_length,word_count
0,Apple reports strong quarterly earnings,39,5
1,Apple faces antitrust scrutiny in the European...,52,8
2,Amazon expands cloud infrastructure in India,44,6
3,Microsoft acquires an artificial intelligence ...,53,6
4,Technology stocks fall amid inflation concerns,46,6
5,Markets remain cautious ahead of Federal Reser...,56,8


In [4]:
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    tokens = [t for t in tokens if t.isalpha()]
    tokens = [t for t in tokens if t not in stop_words]
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    return tokens

news_df["processed_tokens"] = news_df["headline"].apply(preprocess_text)
news_df


Unnamed: 0,headline,char_length,word_count,processed_tokens
0,Apple reports strong quarterly earnings,39,5,"[apple, report, strong, quarterly, earnings]"
1,Apple faces antitrust scrutiny in the European...,52,8,"[apple, face, antitrust, scrutiny, european, u..."
2,Amazon expands cloud infrastructure in India,44,6,"[amazon, expands, cloud, infrastructure, india]"
3,Microsoft acquires an artificial intelligence ...,53,6,"[microsoft, acquires, artificial, intelligence..."
4,Technology stocks fall amid inflation concerns,46,6,"[technology, stock, fall, amid, inflation, con..."
5,Markets remain cautious ahead of Federal Reser...,56,8,"[market, remain, cautious, ahead, federal, res..."


In [5]:
sia = SentimentIntensityAnalyzer()

def vader_sentiment(text):
    return sia.polarity_scores(text)

vader_scores = news_df["headline"].apply(vader_sentiment)
vader_df = pd.DataFrame(list(vader_scores))

news_df = pd.concat([news_df, vader_df], axis=1)
news_df


Unnamed: 0,headline,char_length,word_count,processed_tokens,neg,neu,pos,compound
0,Apple reports strong quarterly earnings,39,5,"[apple, report, strong, quarterly, earnings]",0.0,0.548,0.452,0.5106
1,Apple faces antitrust scrutiny in the European...,52,8,"[apple, face, antitrust, scrutiny, european, u...",0.0,1.0,0.0,0.0
2,Amazon expands cloud infrastructure in India,44,6,"[amazon, expands, cloud, infrastructure, india]",0.0,0.563,0.437,0.2732
3,Microsoft acquires an artificial intelligence ...,53,6,"[microsoft, acquires, artificial, intelligence...",0.0,0.617,0.383,0.4767
4,Technology stocks fall amid inflation concerns,46,6,"[technology, stock, fall, amid, inflation, con...",0.0,1.0,0.0,0.0
5,Markets remain cautious ahead of Federal Reser...,56,8,"[market, remain, cautious, ahead, federal, res...",0.167,0.833,0.0,-0.1027


### FinBERT: Domain-Specific Sentiment Modeling for Finance

FinBERT is a domain-adapted version of BERT designed specifically for financial text.
While standard BERT is pre-trained on general English corpora such as Wikipedia,
it often misinterprets words whose sentiment or meaning shifts in a financial context.
For example, terms like *“liability”*, *“loss”*, or *“cost”* are neutral accounting concepts
in finance but are treated as negative in everyday language.

To address this domain shift, FinBERT applies transfer learning with domain adaptation.
Starting from a general BERT model, it undergoes further pre-training on large-scale
financial news corpora (TRC2-Financial) using the Masked Language Modeling objective.
This enables the model to internalize the statistical structure and semantics of
financial language before being fine-tuned on labeled financial sentiment data.


### How FinBERT Processes Financial Text

FinBERT follows the same transformer-based architecture as BERT.
Input text is first tokenized using WordPiece tokenization, which decomposes complex
financial terms into meaningful subword units. Each token representation is formed as
a sum of token embeddings, positional embeddings, and segment embeddings.

The token sequence is then passed through multiple Transformer encoder layers,
where self-attention allows each word to dynamically adjust its representation
based on surrounding context. This contextualization is critical in finance,
where the meaning of a word depends heavily on its usage within a sentence.

For sentiment analysis, a classification head is attached to the `[CLS]` token.
The model outputs logits corresponding to three classes: Positive, Negative, and Neutral.
These logits are converted into probabilities using a softmax function, allowing the
model to quantify sentiment confidence. In practice, these probabilities can be mapped
to numerical sentiment scores for downstream aggregation and trading strategies.
