#Sentiment Analysis - Financial Statement

**Portfolio Project**

Using Machine Learning on Financial Statement Commentary to identify sentiment of business using FinBERT.

This was used to identify and analyse patterns and trends in CEO commentary for a project.

#1. Demo Analysis

In [None]:
#Install the Transformers Library
!pip install transformers
!pip install xformers

Collecting transformers
  Using cached transformers-4.30.2-py3-none-any.whl (7.2 MB)
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Using cached huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Using cached tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
Collecting safetensors>=0.3.1 (from transformers)
  Using cached safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
Installing collected packages: tokenizers, safetensors, huggingface-hub, transformers
Successfully installed huggingface-hub-0.16.4 safetensors-0.3.1 tokenizers-0.13.3 transformers-4.30.2
Collecting xformers
  Using cached xformers-0.0.20-cp310-cp310-manylinux2014_x86_64.whl (109.1 MB)
Collecting pyre-extensions==0.0.29 (from xformers)
  Downloading pyre_extensions-0.0.29-py3-none-any.whl (12 kB)
Collecting typing-inspect (from pyre-extensions==0.0.29->xformers)
  Downl

In [None]:
#Import BertTokenizer and BertForSequenceClassification + Create a pipeline
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import pipeline
import numpy as np
import pandas as pd

In [None]:
# There are 3 classifiers - Neg/Neu/Pos
FinBERT = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone',num_labels=3)
Tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-tone')

Downloading (…)lve/main/config.json:   0%|          | 0.00/533 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/439M [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [None]:
# Create a pipeline
nlp = pipeline("sentiment-analysis", model=FinBERT, tokenizer=Tokenizer)

In [None]:
# parse sentences through the pipeline - DEMO TEXT
sentences = ["there is a shortage of capital, and we need extra financing",
             "growth is strong and we have plenty of liquidity",
             "there are doubts about our finances",
             "profits are flat"]

In [None]:
#print the results
results = nlp(sentences)
print(results)

[{'label': 'Negative', 'score': 0.9966173768043518}, {'label': 'Positive', 'score': 1.0}, {'label': 'Negative', 'score': 0.9999710321426392}, {'label': 'Neutral', 'score': 0.9889441728591919}]


In [None]:
# parse sentences through the pipeline - EY TEXT TEST
EYsentence1 = ["With companies facing a convergence of challenges, from climate change and the pandemic to economic uncertainty and shifting consumer habits, the firm is investing in the talent and skills and services needed to help clients transform, grow and build trust with their stakeholders"]

In [None]:
#print the results
EYresults = nlp(EYsentence1)
print(EYresults)

[{'label': 'Positive', 'score': 0.9144045114517212}]


#2. Sentiment Analysis

In [None]:
# Import the dataset
from google.colab import files
uploaded = files.upload()

Saving Capstone Data - Cleansed.csv to Capstone Data - Cleansed.csv


In [None]:
# Creating a dataframe:
df1 = pd.read_csv('Capstone Data - Cleansed.csv',encoding='cp1252')

In [None]:
#Sense check of first 5 rows of the data to check if the data was read in correctly, there seem to be no issues
df1.head()

Unnamed: 0,Year,Commentary
0,2014,"The firm continues to deliver revenue growth, ..."
1,2014,"Revenues have increased from £ 1,721m to £ 1,8..."
2,2014,The profit for the financial period increased ...
3,2014,The distributable profit (page 6) for the fina...
4,2014,Advisory was the fastest growing service line ...


In [None]:
# Define the function to analyse sentiment
def analyse_sentiment(sentence):
    results = nlp(sentence)
    label = results[0]['label']
    score = results[0]['score']
    return label, score

# Apply sentiment analysis to the 'Commentary' column and create new columns
df1[['Sentiment Label', 'Sentiment Probability Score']] = df1['Commentary'].apply(lambda x: pd.Series(analyse_sentiment(x)))

In [None]:
# Calculate the length of commentary in words
df1['Commentary Length'] = df1['Commentary'].apply(lambda x: len(x.split()))

In [None]:
print(df1.to_string(index=False))

 Year                                                                                                                                                                                                                                                                                                                                                                                                       Commentary Sentiment Label  Sentiment Probability Score  Commentary Length
 2014                                                                                                                                                                                                                                                                                                                 The firm continues to deliver revenue growth, despite the economic difficulties faced in the UK.        Positive                     1.000000                 15
 2014                                                     

In [None]:
# Export the DataFrame as a CSV file
df1.to_csv("Output - Cleansed.csv")

#3. Length and Number Analysis

In [None]:
# Import the dataset
from google.colab import files
uploaded = files.upload()

Saving Raw Data.xlsx to Raw Data.xlsx


In [None]:
import xlrd

# Creating a dataframe
df2 = pd.read_excel('Raw Data.xlsx')

In [None]:
#Sense check of first 5 rows of the data to check if the data was read in correctly, there seem to be no issues
df2.head()

Unnamed: 0,Year,Commentary
0,2014,"The firm continues to deliver revenue growth, ..."
1,2015,"The firm continues to deliver revenue growth, ..."
2,2016,"The firm continues to deliver revenue growth, ..."
3,2017,"The firm continues to deliver revenue growth, ..."
4,2018,The firm delivered a moderate revenue growth d...


In [None]:
# Calculate the length of commentary in words
df2['Commentary Length'] = df2['Commentary'].apply(lambda x: len(x.split()))

In [None]:
import nltk

# Download the required nltk resources (only needs to be done once)
nltk.download('punkt')

# Function to count the number of sentences in a text
def count_sentences(text):
    sentences = nltk.sent_tokenize(text)
    return len(sentences)

# Apply the function to the 'Commentary' column
df2['Num of Sentences'] = df2['Commentary'].apply(count_sentences)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
print(df2.to_string(index=False))

 Year                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   

In [None]:
# Export the DataFrame as a CSV file
df2.to_csv("Output - Raw.csv")