# Sentiment analysis for finance

In this notebook, I present three different approaches to sentiment analysis for finance:

1. Dictionary-based approach
2. FinBert and other BERT-based models
3. LLM models (GPT-3.5, Llama2, etc.)

In [1]:
import pandas as pd

First of all, let's load some financial news data to work with.

In [1]:
from langchain_community.document_loaders import NewsURLLoader

In [2]:
urls = [
    "https://sg.finance.yahoo.com/news/ether-etfs-to-start-trading-after-sec-green-light-money-managers-221846599.html",
    "https://sg.finance.yahoo.com/news/crypto-surges-on-political-respect-as-donald-trump-courts-bitcoin-enthusiasts-153720599.html",
    "https://sg.finance.yahoo.com/news/etf-stands-everything-fits-182106129.html",
    "https://sg.finance.yahoo.com/news/boeing-shares-fall-premarket-trade-143500263.html",
    "https://finance.yahoo.com/m/bc30d3bb-c438-3e52-ada4-b1f8c40f756e/solar-stocks-aren%E2%80%99t-having-a.html",
]

In [3]:
loader = NewsURLLoader(urls=urls)
data = loader.load()

In [4]:
data

[Document(metadata={'title': 'Ether ETFs to start trading after SEC green light: Money managers', 'link': 'https://sg.finance.yahoo.com/news/ether-etfs-to-start-trading-after-sec-green-light-money-managers-221846599.html', 'authors': [], 'language': 'en', 'description': 'BlackRock, Fidelity, and other big money managers said they got the final approval from regulators to start offering ether ETFs to everyday investors.', 'publish_date': None}, page_content='Exchange-traded funds that hold ether (ETH-USD) can start trading following final approval from regulators on Monday, according to the money managers that will oversee the new ETFs.\n\nThe Securities and Exchange Commission gave the green light Monday to BlackRock (BLK), Fidelity, Franklin Templeton, Grayscale, and 21 Shares, the companies said.\n\nTrading could begin as early as Tuesday.\n\nThe moves could make ether, the world’s second-largest cryptocurrency, a potential staple in 401(k)s, IRAs, and pension plans and grant the dig

In [5]:
df = pd.DataFrame(
    [{"title": d.metadata["title"], "text": d.page_content} for d in data]
)

df

NameError: name 'pd' is not defined

## Dictionnary-based approach

A dictionary-based approach is a simple way to perform sentiment analysis. It consists of using a list of words with associated sentiment scores. The sentiment score of a sentence is the sum of the sentiment scores of the words it contains. You can also normalize the score by the number of words in the sentence.


For this type of approach, we will need to use a financial sentiment dictionary. I will use the Loughran-McDonald dictionary, which is widely used in finance. Note that this dictionary is designed for financial statements, so it may not be the best choice for news articles.

You can download the dictionary from [here](https://sraf.nd.edu/loughranmcdonald-master-dictionary/) (direct link: [Loughran-McDonald_MasterDictionary_1993-2023.csv](https://drive.google.com/file/d/1ptUgGVeeUGhCbaKL14Ri3Xi5xOKkPkUD/view?usp=sharing))


### Pre-processing

The preprocessing steps to use a dictionnary based approach depend on the dictionnary you are using and the final measure you want to obtain. In this case, we will use the Loughran-McDonald dictionary, which contains variation of similar words, including plural forms, verb forms, etc. Therefore, we do not need to perform stemming or lemmatization, a common step in text preprocessing.

The preprocessing steps we will perform are:
    - Lowercasing
    - Removing punctuation
    - Removing stopwords
    - Removing numbers

The removal of stopwords and numbers is optional, but it will affect the sentiment score of the text as measure as a ratio of the number of words in the text. Other common filtering includes removing URLs, emails, cities, company names, etc.





In [6]:
# Using the same ones as Loughran-McDonald

with open("Stop_Words.txt", "r") as f:
    stopwords = f.read().split("\n")[:-1]
stopwords[:10]

['about', 'and', 'from', 'now', 'where', 'you', 'am', 'until', 'them', 'in']

In [7]:
def preprocess_text(text):
    words = text.split()
    words = [w.lower() for w in words]
    words = [w for w in words if w not in stopwords]
    # Remove punctuation and numbers
    words = [w for w in words if w.isalpha()]
    return " ".join(words)


df["text_clean"] = df["text"].apply(preprocess_text)
df

NameError: name 'df' is not defined

### Dictionary

Next, will load the dictionary and make a list of positive words and a list of negative words.

In [8]:
lm_dict = pd.read_csv("Loughran-McDonald_MasterDictionary_1993-2023.csv")
lm_dict

NameError: name 'pd' is not defined

In [9]:
pos_words = lm_dict[lm_dict["Positive"] != 0]["Word"].str.lower().to_list()
neg_words = lm_dict[lm_dict["Negative"] != 0]["Word"].str.lower().to_list()

pos_words[:10]

NameError: name 'lm_dict' is not defined

In [10]:
neg_words[:10]

NameError: name 'neg_words' is not defined

### Sentiment score

The sentiment score of a text is the sum of the sentiment scores of the words it contains. We can also normalize the score by the number of words in the text.

In [11]:
df["n"] = df["text_clean"].apply(lambda x: len(x.split()))
df["n_pos"] = df["text_clean"].apply(
    lambda x: len([w for w in x.split() if w in pos_words])
)
df["n_neg"] = df["text_clean"].apply(
    lambda x: len([w for w in x.split() if w in neg_words])
)

df

NameError: name 'df' is not defined

In [12]:
df["lm_level"] = df["n_pos"] - df["n_neg"]

df["lm_score1"] = (df["n_pos"] - df["n_neg"]) / df["n"]
df["lm_score2"] = (df["n_pos"] - df["n_neg"]) / (df["n_pos"] + df["n_neg"])

CUTOFF = 0.3
df["lm_sentiment"] = df["lm_score2"].apply(
    lambda x: "positive" if x > CUTOFF else "negative" if x < -CUTOFF else "neutral"
)
df

NameError: name 'df' is not defined

## Bert-based models

BERT-based models are often called "state-of-the-art models" in recent papers in the finance even if the original [Bert paper](https://arxiv.org/abs/1810.04805) dates from 2018 and many more advanced models have come along since. They are pre-trained on a large corpus of text and fine-tuned on a specific task. In this case, we will use FinBert, a BERT model fine-tuned on financial data (see [FinBert paper](https://arxiv.org/abs/1908.10063)).

The way these models work is by taking a sequence of tokens as input and outputting a vector of size 768 (or 1024, depending on the model). This vector can be used as input to a classifier to predict the sentiment of the text. The FinBert model is trained to output softmax outputs (ie, probabilities) for three classes: positive, negative, and neutral.

### Pre-processing

You won't perform any pre-processing on the text before feeding it to the model. The model will take care of tokenizing the text and converting it to a sequence of tokens. Common pre-processing for Bert models include masking some words (date, company names, etc.).


### Usage

Many Bert models, including FinBert, are available in the Hugging Face Transformers library adn can be fecthed automatically.



In [13]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import scipy
import torch

In [14]:
tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

tokenizer_config.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/758 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [15]:
def finbert_sentiment(text: str) -> tuple[float, float, float, str]:
    with torch.no_grad():
        inputs = tokenizer(
            text, return_tensors="pt", padding=True, truncation=True, max_length=512
        )
        outputs = model(**inputs)
        logits = outputs.logits
        scores = {
            k: v
            for k, v in zip(
                model.config.id2label.values(),
                scipy.special.softmax(logits.numpy().squeeze()),
            )
        }
        return (
            scores["positive"],
            scores["negative"],
            scores["neutral"],
            max(scores, key=scores.get),
        )

In [16]:
# Notice that this is the raw text, no preprocessing
df[["finbert_pos", "finbert_neg", "finbert_neu", "finbert_sentiment"]] = (
    df["text"].apply(finbert_sentiment).apply(pd.Series)
)
df["finbert_score"] = df["finbert_pos"] - df["finbert_neg"]

NameError: name 'df' is not defined

In [18]:
df[
    [
        "title",
        "text",
        "finbert_pos",
        "finbert_neg",
        "finbert_neu",
        "finbert_sentiment",
        "finbert_score",
    ]
]

NameError: name 'df' is not defined

## LLM models

LLM models are large language models that are trained on a large corpus of text. They are often used for text generation, but they can also be used for sentiment analysis. The approach is to design a prompt that will make the model output a sentiment score. Langchain is a library that makes it easy to use LLM models for different tasks, including sentiment analysis.
    

In [19]:
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import PydanticOutputParser

In [20]:
from langchain_community.chat_models import ChatOllama

In [21]:
# For handling errors
from tenacity import retry, stop_after_attempt, RetryError

We will first define the desired output format as a Pydantic model, from which we will
create a PydanticOutputParser object. This object will be used to inject output
definition into the prompt and to parse the output of the model. 

In [22]:
class SentimentClassification(BaseModel):
    sentiment: str = Field(
        ...,
        description="The sentiment of the text",
        enum=["positive", "negative", "neutral"],
    )
    score: float = Field(..., description="The score of the sentiment", ge=-1, le=1)
    justification: str = Field(..., description="The justification of the sentiment")
    main_entity: str = Field(..., description="The main entity discussed in the text")

In [25]:
@retry(stop=stop_after_attempt(10))
def run_chain(text: str, chain) -> dict:
    return chain.invoke({"news": text}).dict()


def llm_sentiment(text: str, llm) -> tuple[str, float, str, str]:
    parser = PydanticOutputParser(pydantic_object=SentimentClassification)

    prompt = PromptTemplate(
        template="Describe the sentiment of a text of financial news.\n{format_instructions}\n{news}\n",
        input_variables=["news"],
        partial_variables={"format_instructions": parser.get_format_instructions()},
    )

    chain = prompt | llm | parser

    try:
        result = run_chain(text, chain)

        return (
            result["sentiment"],
            result["score"],
            result["justification"],
            result["main_entity"],
        )
    except RetryError as e:
        print(f"Error: {e}")
        return "error", 0, "", ""

In [26]:
# Replace with the correct model, or use ChatOpenAI if you want to use OpenAI
llama2 = ChatOllama(model="llama2", temperature=0.1)

df[
    ["llama2_sentiment", "llama2_score", "llama2_justification", "llama2_main_entity"]
] = (df["text"].apply(lambda x: llm_sentiment(x, llama2)).apply(pd.Series))

NameError: name 'df' is not defined

In [27]:
df[
    [
        "title",
        "text",
        "llama2_sentiment",
        "llama2_score",
        "llama2_justification",
        "llama2_main_entity",
    ]
]

NameError: name 'df' is not defined

In [46]:
mixtral = ChatOllama(model="dolphin-mixtral:latest", temperature=0.1)

df[
    [
        "mixtral_sentiment",
        "mixtral_score",
        "mixtral_justification",
        "mixtral_main_entity",
    ]
] = (
    df["text"].apply(lambda x: llm_sentiment(x, mixtral)).apply(pd.Series)
)

Error: RetryError[<Future at 0x145bcedd0 state=finished raised ConnectionError>]
Error: RetryError[<Future at 0x145beb2d0 state=finished raised ConnectionError>]
Error: RetryError[<Future at 0x145bb3bd0 state=finished raised ConnectionError>]
Error: RetryError[<Future at 0x145bb5c10 state=finished raised ConnectionError>]
Error: RetryError[<Future at 0x145bea9d0 state=finished raised ConnectionError>]


In [47]:
df[
    [
        "title",
        "text",
        "mixtral_sentiment",
        "mixtral_score",
        "mixtral_justification",
        "mixtral_main_entity",
    ]
]

Unnamed: 0,title,text,mixtral_sentiment,mixtral_score,mixtral_justification,mixtral_main_entity
0,Ether ETFs to start trading after SEC green li...,Exchange-traded funds that hold ether (ETH-USD...,error,0,,
1,Crypto surges on political respect as Donald T...,Crypto is surging again on a new wave of polit...,error,0,,
2,ETF Stands for 'Everything That Fits',One of the main crypto narratives this year ha...,error,0,,
3,Boeing Shares Fall in Premarket Trade After Ea...,Boeing shares fall in premarket trade after ea...,error,0,,
4,Solar Stocks Aren’t Having a Good Run. Why The...,There are no important events for this country...,error,0,,


In [48]:
import textwrap

print(textwrap.fill(df.iloc[0]["text"][:500] + "...") + "\n")
print("Llama2: " + textwrap.fill(df.iloc[0]["llama2_justification"]) + "\n")
print("Mixtral: " + textwrap.fill(df.iloc[0]["mixtral_justification"]))

Exchange-traded funds that hold ether (ETH-USD) can start trading
following final approval from regulators on Monday, according to the
money managers that will oversee the new ETFs.  The Securities and
Exchange Commission gave the green light Monday to BlackRock (BLK),
Fidelity, Franklin Templeton, Grayscale, and 21 Shares, the companies
said.  Trading could begin as early as Tuesday.  The moves could make
ether, the world’s second-largest cryptocurrency, a potential staple
in 401(k)s, IRAs, and...

Llama2: 

Mixtral: 


In [49]:
df[
    [
        "title",
        "text",
        "lm_sentiment",
        "finbert_sentiment",
        "llama2_sentiment",
        "mixtral_sentiment",
    ]
]

Unnamed: 0,title,text,lm_sentiment,finbert_sentiment,llama2_sentiment,mixtral_sentiment
0,Ether ETFs to start trading after SEC green li...,Exchange-traded funds that hold ether (ETH-USD...,positive,positive,error,error
1,Crypto surges on political respect as Donald T...,Crypto is surging again on a new wave of polit...,neutral,positive,error,error
2,ETF Stands for 'Everything That Fits',One of the main crypto narratives this year ha...,neutral,neutral,error,error
3,Boeing Shares Fall in Premarket Trade After Ea...,Boeing shares fall in premarket trade after ea...,neutral,negative,error,error
4,Solar Stocks Aren’t Having a Good Run. Why The...,There are no important events for this country...,neutral,neutral,error,error
