# Bank of England Sentiment Analysis
## Employer Project
### Team 8 AnalytIQ, June 2nd, 2025
**Team Members**: Lalitha Vemuri, Christina Tsoulfa, Reka Bodo, Yann Hirsig, Louis Pang, Dr. Karin Agius Ferrante

## Content
1. Approach
2. Load the Data
4. Exploratory Sentiment Analysis & Natural Language Processing (NLP)
5. Exploratory Analysis for Correlation with Economic Indicators
8. Insights & Recommendations

## 1. Approach

The **Bank of England (BoE)**, the UK’s central bank and one of the world’s leading financial institutions, plays a pivotal role in maintaining economic and financial stability, and supporting the UK government’s economic policies. One of its key communication channels with the public and markets is through formal speeches delivered by its representatives. These speeches aim to offer guidance, manage expectations, and provide clarity in times of uncertainty.

However, the effectiveness and impact of these speeches on economic indicators and market behaviour are not fully understood. 

This project seeks to explore whether the sentiment and timing of BoE speeches hold analytical or predictive value, when compared with economic performance and key events.

### Main Business Questions
**Has the tone or sentiment of the BoE’s speeches evolved over time? If so, how?**<br>
**How do sentiments align with events like interest rate changes, policy reports, or major economic releases?**

**Sub-questions**

1.	Are there measurable correlations between speech sentiment and UK economic indicators such as inflation, GDP, employment rates and bond yields?
2. Does the change in sentiment change economic indicators or is the speech sentiment reactive to economic indicators?
3. Can speech sentiment trends be used to predict market reactions or economic outcomes?
4. What broader insights can be drawn to support data-informed communication strategies?

## 2. Load the Data

### 2.1. Import libraries

In [None]:
# Install the necessary libraries.
# !pip install nltk
# !pip install vaderSentiment
# !pip install textblob
# !pip install pandas openpyxl
# !pip install transformers torch

In [None]:
# import nltk
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('words')
# nltk.download('omw-1.4')

In [1]:
# General libraries
import numpy as np                             # Numerical operations and array handling.
import pandas as pd                            # Data manipulation and analysis.
import contractions                            # Expanding/contracting text contractions.
import re                                      # Regular expression operations on strings.
import os                                      # Interacting with the operating system and file handling.
import matplotlib.pyplot as plt                # Create visualisations.
from matplotlib.colors import rgb2hex          # Colour conversion in plots.
import seaborn as sns                          # Enhanced statistical data visualisations.
import math                                    # Mathematical functions and constants.
from IPython.display import display, Markdown  # Rich output in Jupyter.
from functools import reduce
from sklearn.feature_extraction.text import CountVectorizer
from collections import  Counter
import plotly.express as px
from statsmodels.tsa.seasonal import seasonal_decompose   # Seasonality

In [2]:
# Text and Sentiment Analysis 
from wordcloud import WordCloud                                       # Generating visual word frequency clouds from text.
import nltk                                                           # Natural language processing tasks.
from nltk import word_tokenize, pos_tag                               # Splitting text into words and tags with part of speech
from nltk import defaultdict                                          # Providing default values for nonexistent keys.
from nltk.probability import FreqDist                                 # Calculating frequency distribution of tokens.
from nltk.corpus import stopwords                                     # Providing list of common words to exclude from analysis.
from nltk.corpus import words
from nltk.corpus import wordnet as wn                                 # Lexical database for retrieving word relationships & meanings.
from nltk.stem import WordNetLemmatizer, PorterStemmer                # Reducing words to base or root form.
from nltk.probability import FreqDist
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer  # Assessing sentiment intensity in text.
from textblob import TextBlob                                         # API for text processing tasks including sentiment analysis.
import contractions                                                   # Expanding/contracting text contractions.
import re                                                             # Regular expression operations on strings.
import ast                                                            # If column contains string representations of lists
from collections import defaultdict  # Creating dictionaries that return default value for nonexistent keys.
from collections import Counter  # Importing Counter for counting hashable objects and efficiently tallying occurrences in an iterable.
from transformers import AutoTokenizer                                # FinBERT Model
from transformers import AutoModelForSequenceClassification           # FinBERT Model
import torch                                                          # FinBERT Model
import torch.nn.functional as F                                       # FinBERT Model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
stop_words = set(stopwords.words('english'))

In [3]:
# Import warnings
import warnings
# Settings for the notebook.
warnings.filterwarnings("ignore")

In [5]:
# Set figure style for seaborn.
sns.set_theme(style='darkgrid')

### 2.2. Define functions

**2.2.a. Charts**

In [9]:
def clean_label(label):
    # If label is a Series, return its name.
    if isinstance(label, pd.Series):
        return label.name.replace('_', ' ').title() if label.name else ' '
    elif isinstance(label, str):
        return label.replace('_', ' ').title()
    return ' '

In [11]:
# Define function for scatterplot.
def generate_scatterplot(df, x_axis, y_axis, title, hue, save_path=None):

    # Set figure size & style for seaborn.
    sns.set_theme(style='darkgrid')
    sns.set(rc={'figure.figsize':(8, 6)})

    # Plot the scatterplot.
    sns.scatterplot(data=df, x=x_axis, y=y_axis, hue=hue, color='#0e1b2c')

    # Customize the plot.
    plt.title(title, fontsize=12, fontweight='bold')
    plt.xlabel(clean_label(x_axis), fontsize=10)
    plt.ylabel(clean_label(y_axis), fontsize=10)

    # Add legend ONLY if hue is not None.
    if hue is not None:
        plt.legend(title='Legend', fontsize=10, bbox_to_anchor=(1.05,1), loc='upper left')
    
    # Save the plot, if save_path is provided.
    if save_path:
        plt.savefig(save_path, dpi=500, bbox_inches='tight')

    # Display the chart.
    plt.tight_layout()
    plt.show()

In [13]:
# Define function to plot a lineplot.
def generate_lineplot(df, x_axis, y_axis, title, ylim=None, save_path=None, \
                      rotate_xticks=False):
    
    # Set figure size & style for seaborn.
    sns.set_theme(style='darkgrid')
    sns.set(rc={'figure.figsize':(14, 8)})
    
    # Ensure time column is in datetime format.
    df[x_axis] = pd.to_datetime(df[x_axis])
    
    # Sort DataFrame by the time column.
    df.sort_values(by=x_axis, inplace=True)
    
    # Plot the lineplot.
    sns.lineplot(data=df, x=x_axis, y=y_axis, label=clean_label(y_axis))
    
    # Customize the plot.
    plt.title(title, fontsize=16, fontweight='bold')
    plt.xlabel(clean_label(x_axis), fontsize=14)
    plt.ylabel(clean_label(y_axis), fontsize=14)
    plt.legend(title='Legend', fontsize=12, bbox_to_anchor=(1.05,1), loc='upper left')
    plt.tick_params(axis='both', labelsize=12)
    
    # Rotate x-tick labels by 45 degrees, if specified.
    if rotate_xticks:
        plt.xticks(rotation=45)
    
    # Set y-axis limits, if provided.
    if ylim:
        plt.ylim(ylim)

    # Save the plot, if save_path is provided.
    if save_path:
        plt.savefig(save_path, dpi=500, bbox_inches='tight')
    
    # Display the chart.
    plt.tight_layout()
    plt.show()

**2.2.b. NLP analysis**

In [15]:
# Preprocessing function
def preprocess_text(text):
    text = contractions.fix(text)  # Expand contractions i.e I'm not good goes to I am not good
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub('#', '', text)         # Remove hashtags
    text = re.sub(r'\W', ' ', text)      # Remove special characters
    text = text.lower()                  # Convert to lowercase
    #Below is to create a set of stop words from the NLTK library's predefined list but not is excluded.
    stop_words = set(stopwords.words('english')) - {'not'} 
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

In [17]:
# Define the tag map for POS tagging.
tag_map = defaultdict(lambda: wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

# Lemmatise the tokens with correct POS tags.
lemma_function = WordNetLemmatizer()

# Lemmatisation function.
def lemmatize_tokens(tokens):
    #For each word in the token list, it lemmatizes the word with the correct part-of-speech
    lemmatized_tokens = [lemma_function.lemmatize(token, tag_map[tag[0]]) for token, tag in pos_tag(tokens)]
    return lemmatized_tokens

In [18]:
# VADER Sentiment Intensity Analyzer.
analyzer = SentimentIntensityAnalyzer()

# Define the function to compute and return sentiment scores.
def analyse_sentiment(text):
    return analyzer.polarity_scores(' '.join(text))

In [19]:
# Define function to label sentiments.
def get_sentiment_label(compound):
    if compound >= 0.05:
        return 'positive'
    elif compound <= -0.05:
        return 'negative'
    else:
        return 'neutral'

In [23]:
# Define a function to extract a polarity score using TextBlob.
def generate_polarity(comment):
    return TextBlob(comment).sentiment[0]

In [25]:
# Define a function to extract a subjectivity score using TextBlob.
def generate_subjectivity(comment):
    return TextBlob(comment).sentiment[1]

### 2.3. Import and review the data

**2.3.a. Import Bank of England Speeches**

In [27]:
# Load the CSV file as speeches.
speeches_original = pd.read_csv('/Users/kaferrante/Documents/Python/_Course4_Project/all_speeches.csv')

# View the data.
speeches_original.head()

Unnamed: 0,reference,country,date,title,author,is_gov,text
0,r901128a_BOA,australia,1990-11-28,A Proper Role for Monetary Policy,fraser,0,They would no doubt argue that to have two obj...
1,r911003a_BOA,australia,1991-10-03,,fraser,0,Today I wish to talk about real interest rates...
2,r920314a_BOA,australia,1992-03-14,,fraser,0,I welcome this opportunity to talk about prosp...
3,r920529a_BOA,australia,1992-05-29,,fraser,0,It is a pleasure to have this opportunity to a...
4,r920817a_BOA,australia,1992-08-17,,fraser,0,"As a long-time fan of Don Sanders, I am deligh..."


In [28]:
# Explore data set.
speeches_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7721 entries, 0 to 7720
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   reference  7721 non-null   object
 1   country    7721 non-null   object
 2   date       7721 non-null   object
 3   title      7721 non-null   object
 4   author     7721 non-null   object
 5   is_gov     7721 non-null   int64 
 6   text       7721 non-null   object
dtypes: int64(1), object(6)
memory usage: 422.4+ KB


In [31]:
# Check for missing values.
speeches_original.isnull().sum()

reference    0
country      0
date         0
title        0
author       0
is_gov       0
text         0
dtype: int64

In [33]:
# Check for duplicates.
speeches_original.duplicated().sum()

0

In [35]:
# Review basic descriptive statistics.
speeches_original.describe()

Unnamed: 0,is_gov
count,7721.0
mean,0.347235
std,0.476122
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


In [37]:
# View the countries.
speeches_original['country'].unique()

array(['australia', 'canada', 'euro area', 'japan', 'sweden',
       'switzerland', 'united kingdom', 'united states'], dtype=object)

In [39]:
# View the author.
speeches_original['author'].unique()

array(['fraser', 'macfarlane', 'lowe', 'stevens', 'no_info', 'ac',
       'thiessen', 'bonin', 'dodge', 'jenkins', 'kennedy', 'macklem',
       'duguay', 'longworth', 'carney', 'murray', 'lane', 'wolf',
       'boivin', 'cote', 'poloz', 'schembri', 'johnson', 'wilkins',
       'chilcott', 'mendes', 'patterson', 'murchison', 'leduc', 'dinis',
       'beaudry', 'gravelle', 'kozicki', 'rogers', 'morrow', 'lamfalussy',
       'duisenberg', 'vienna', 'london', 'tokyo', 'kong', 'bank',
       'schioppa', 'hamalainen', 'main', 'noyer', 'committee', 'solans',
       'francisco', 'istanbul', 'issing', 'hoogduin', 'bankwashington',
       'efma', 'brussels', 'forum', 'workshop', 'quiros', 'papademos',
       'gugerell', 'trichet', 'network', 'delivered', 'paramo',
       'strasbourg', 'rome', 'berlin', 'smaghi', 'sevilla', 'madrid',
       'stark', 'singapore', 'summit', 'washington', 'aires',
       'bratislava', 'ecb', 'constancio', 'posen', 'praet', 'draghi',
       'coeure', 'asmussen', 'mer

**2.3.b. Import Lexicon Sentiment based on BoE Wordlist**

In [41]:
# Load the Excel file of BoE sentiment labelled wordlist.
speeches = pd.read_csv('/Users/kaferrante/Documents/Python/_Course4_Project/speeches_sentiment.csv')

# View the data.
speeches.head()

Unnamed: 0,reference,country,date,title,author,is_gov,text,text_cleaned,text_tokenised,text_lemmatised,...,negative,positive,uncertainty,litigious,strong,weak,constraining,word_count_sentiment,sentiment_lexicon_simple,sentiment_lexicon_weighted
0,r901128a_BOA,australia,1990-11-28,A Proper Role for Monetary Policy,fraser,0,They would no doubt argue that to have two obj...,would doubt argue two objectives like trying c...,"['would', 'doubt', 'argue', 'two', 'objectives...","['would', 'doubt', 'argue', 'two', 'objective'...",...,84,58,32,5,10,15,13,217,-0.119816,0.112442
1,r911003a_BOA,australia,1991-10-03,,fraser,0,Today I wish to talk about real interest rates...,today wish talk real interest rates mainly his...,"['today', 'wish', 'talk', 'real', 'interest', ...","['today', 'wish', 'talk', 'real', 'interest', ...",...,53,28,35,2,3,16,12,149,-0.167785,0.014094
2,r920314a_BOA,australia,1992-03-14,,fraser,0,I welcome this opportunity to talk about prosp...,welcome opportunity talk prospects banks austr...,"['welcome', 'opportunity', 'talk', 'prospects'...","['welcome', 'opportunity', 'talk', 'prospect',...",...,43,67,33,8,11,16,13,191,0.125654,0.421466
3,r920529a_BOA,australia,1992-05-29,,fraser,0,It is a pleasure to have this opportunity to a...,pleasure opportunity address influential gathe...,"['pleasure', 'opportunity', 'address', 'influe...","['pleasure', 'opportunity', 'address', 'influe...",...,62,56,43,6,7,20,8,202,-0.029703,0.227228
4,r920817a_BOA,australia,1992-08-17,,fraser,0,"As a long-time fan of Don Sanders, I am deligh...",long time fan sanders delighted participating ...,"['long', 'time', 'fan', 'sanders', 'delighted'...","['long', 'time', 'fan', 'sander', 'delight', '...",...,72,62,42,6,12,27,13,234,-0.042735,0.22735


In [42]:
# Explore data set.
speeches.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7721 entries, 0 to 7720
Data columns (total 23 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   reference                   7721 non-null   object 
 1   country                     7721 non-null   object 
 2   date                        7721 non-null   object 
 3   title                       7721 non-null   object 
 4   author                      7721 non-null   object 
 5   is_gov                      7721 non-null   int64  
 6   text                        7721 non-null   object 
 7   text_cleaned                7721 non-null   object 
 8   text_tokenised              7721 non-null   object 
 9   text_lemmatised             7721 non-null   object 
 10  text_lemmatised_str         7721 non-null   object 
 11  word_count_text             7721 non-null   int64  
 12  word_count_text_cleaned     7721 non-null   int64  
 13  negative                    7721 

In [43]:
# Check for missing values.
speeches.isnull().sum()

reference                     0
country                       0
date                          0
title                         0
author                        0
is_gov                        0
text                          0
text_cleaned                  0
text_tokenised                0
text_lemmatised               0
text_lemmatised_str           0
word_count_text               0
word_count_text_cleaned       0
negative                      0
positive                      0
uncertainty                   0
litigious                     0
strong                        0
weak                          0
constraining                  0
word_count_sentiment          0
sentiment_lexicon_simple      0
sentiment_lexicon_weighted    0
dtype: int64

In [44]:
# Check for duplicates.
speeches.duplicated().sum()

0

In [45]:
# Review basic descriptive statistics.
speeches.describe()

Unnamed: 0,is_gov,word_count_text,word_count_text_cleaned,negative,positive,uncertainty,litigious,strong,weak,constraining,word_count_sentiment,sentiment_lexicon_simple,sentiment_lexicon_weighted
count,7721.0,7721.0,7721.0,7721.0,7721.0,7721.0,7721.0,7721.0,7721.0,7721.0,7721.0,7721.0,7721.0
mean,0.347235,3113.002072,1776.809222,68.278073,55.249968,45.351768,11.074343,5.342831,18.962958,11.482709,215.74265,-0.02359,0.201127
std,0.476122,2047.79703,1174.241676,55.675494,37.151788,43.173626,15.827917,4.827564,18.883414,12.094693,154.677042,0.1791,0.233665
min,0.0,16.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.782609,-0.833333
25%,0.0,1906.0,1081.0,31.0,31.0,19.0,3.0,2.0,8.0,5.0,120.0,-0.146341,0.03913
50%,0.0,2904.0,1656.0,57.0,49.0,35.0,6.0,4.0,15.0,9.0,195.0,-0.039823,0.178968
75%,1.0,3879.0,2219.0,91.0,71.0,60.0,14.0,7.0,25.0,15.0,277.0,0.082902,0.341406
max,1.0,37522.0,23119.0,1251.0,1042.0,893.0,351.0,70.0,422.0,262.0,3206.0,1.0,1.5


In [46]:
# Check the number of unique values.
speeches.nunique()

reference                     7721
country                          8
date                          4410
title                         6218
author                         325
is_gov                           2
text                          7692
text_cleaned                  7691
text_tokenised                7691
text_lemmatised               7691
text_lemmatised_str           7691
word_count_text               4183
word_count_text_cleaned       3074
negative                       293
positive                       225
uncertainty                    245
litigious                      120
strong                          42
weak                           126
constraining                    98
word_count_sentiment           681
sentiment_lexicon_simple      5426
sentiment_lexicon_weighted    7276
dtype: int64

In [47]:
# Create a normalized version of the 'text' column
speeches['text_norm'] = speeches['text'].str.strip().str.lower()

# Find duplicate 'text_norm' entries
duplicate_mask = speeches['text_norm'].duplicated(keep=False)

# Extract all duplicates based on normalized text
duplicates = speeches[duplicate_mask]

In [51]:
# Show the 'text' of these duplicates
duplicates

Unnamed: 0,reference,country,date,title,author,is_gov,text,text_cleaned,text_tokenised,text_lemmatised,...,positive,uncertainty,litigious,strong,weak,constraining,word_count_sentiment,sentiment_lexicon_simple,sentiment_lexicon_weighted,text_norm
564,r101026a_BOC,canada,2010-10-26,Opening Statement before the House of Commons ...,carney,1,"Governor of the Bank of Canada Good afternoon,...",governor bank canada good afternoon mr chairma...,"['governor', 'bank', 'canada', 'good', 'aftern...","['governor', 'bank', 'canada', 'good', 'aftern...",...,20,19,0,0,3,4,61,0.081967,0.3,"governor of the bank of canada good afternoon,..."
565,r101027a_BOC,canada,2010-10-27,Opening Statement before the Standing Senate C...,carney,1,"Governor of the Bank of Canada Good afternoon,...",governor bank canada good afternoon mr chairma...,"['governor', 'bank', 'canada', 'good', 'aftern...","['governor', 'bank', 'canada', 'good', 'aftern...",...,20,19,0,0,3,4,61,0.081967,0.3,"governor of the bank of canada good afternoon,..."
610,r120424a_BOC,canada,2012-04-24,Opening Statement before the House of Commons ...,carney,1,Governor of the Bank of Canada Good afternoon....,governor bank canada good afternoon tiff pleas...,"['governor', 'bank', 'canada', 'good', 'aftern...","['governor', 'bank', 'canada', 'good', 'aftern...",...,21,20,0,0,8,0,60,0.166667,0.475,governor of the bank of canada good afternoon....
611,r120425a_BOC,canada,2012-04-25,Opening Statement before the Senate Standing C...,carney,1,Governor of the Bank of Canada Good afternoon....,governor bank canada good afternoon tiff pleas...,"['governor', 'bank', 'canada', 'good', 'aftern...","['governor', 'bank', 'canada', 'good', 'aftern...",...,21,20,0,0,8,0,60,0.166667,0.475,governor of the bank of canada good afternoon....
624,r121030a_BOC,canada,2012-10-30,Opening Statement before the House of Commons ...,carney,1,Governor of the Bank of Canada Good afternoon....,governor bank canada good afternoon tiff pleas...,"['governor', 'bank', 'canada', 'good', 'aftern...","['governor', 'bank', 'canada', 'good', 'aftern...",...,23,16,1,0,6,2,68,0.044118,0.286765,governor of the bank of canada good afternoon....
625,r121031a_BOC,canada,2012-10-31,Opening Statement before the Standing Senate C...,carney,1,Governor of the Bank of Canada Good afternoon....,governor bank canada good afternoon tiff pleas...,"['governor', 'bank', 'canada', 'good', 'aftern...","['governor', 'bank', 'canada', 'good', 'aftern...",...,23,16,1,0,6,2,68,0.044118,0.286765,governor of the bank of canada good afternoon....
668,r140429a_BOC,canada,2014-04-29,Opening Statement before the House of Commons ...,poloz,1,Governor of the Bank of Canada Thank you for t...,governor bank canada thank opportunity tiff to...,"['governor', 'bank', 'canada', 'thank', 'oppor...","['governor', 'bank', 'canada', 'thank', 'oppor...",...,31,21,0,1,6,3,83,0.120482,0.393976,governor of the bank of canada thank you for t...
669,r140430a_BOC,canada,2014-04-30,Opening Statement before the Senate Standing C...,poloz,1,Governor of the Bank of Canada Thank you for t...,governor bank canada thank opportunity tiff to...,"['governor', 'bank', 'canada', 'thank', 'oppor...","['governor', 'bank', 'canada', 'thank', 'oppor...",...,31,21,0,1,6,3,83,0.120482,0.393976,governor of the bank of canada thank you for t...
1182,r020121a_ECB,euro area,2002-01-21,Securities and banking: bridges and walls,no_info,0,I once again find myself speaking at the Londo...,find speaking london school economics mileston...,"['find', 'speaking', 'london', 'school', 'econ...","['find', 'speaking', 'london', 'school', 'econ...",...,108,157,70,8,46,25,557,-0.062837,0.105745,i once again find myself speaking at the londo...
1186,r020221a_ECB,euro area,2002-02-21,Securities and banking: bridges and walls,schioppa,0,I once again find myself speaking at the Londo...,find speaking london school economics mileston...,"['find', 'speaking', 'london', 'school', 'econ...","['find', 'speaking', 'london', 'school', 'econ...",...,108,157,70,8,46,25,557,-0.062837,0.105745,i once again find myself speaking at the londo...


In [None]:
# Export to csv
# duplicates.to_csv('/Users/kaferrante/Documents/Python/_Course4_Project/duplicates_full_rows.csv', index=False)

**2.3.c. Import BoE Wordlist**

In [57]:
# Load the Excel file of BoE sentiment labelled wordlist.
sentiment_lexicon = pd.read_excel('/Users/kaferrante/Documents/Python/_Course4_Project/sentiment_labelled_wordlist.xlsx')

# View the data.
sentiment_lexicon.head()

Unnamed: 0,Word,Negative,Positive,Uncertainty,Litigious,Strong,Weak,Constraining
0,ABANDON,1,0,0,0,0,0,0
1,ABANDONED,1,0,0,0,0,0,0
2,ABANDONING,1,0,0,0,0,0,0
3,ABANDONMENT,1,0,0,0,0,0,0
4,ABANDONMENTS,1,0,0,0,0,0,0


In [59]:
# Explore data set.
sentiment_lexicon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3880 entries, 0 to 3879
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Word          3880 non-null   object
 1   Negative      3880 non-null   int64 
 2   Positive      3880 non-null   int64 
 3   Uncertainty   3880 non-null   int64 
 4   Litigious     3880 non-null   int64 
 5   Strong        3880 non-null   int64 
 6   Weak          3880 non-null   int64 
 7   Constraining  3880 non-null   int64 
dtypes: int64(7), object(1)
memory usage: 242.6+ KB


In [61]:
# Check for missing values.
sentiment_lexicon.isnull().sum()

Word            0
Negative        0
Positive        0
Uncertainty     0
Litigious       0
Strong          0
Weak            0
Constraining    0
dtype: int64

In [63]:
# Review basic descriptive statistics.
sentiment_lexicon.describe()

Unnamed: 0,Negative,Positive,Uncertainty,Litigious,Strong,Weak,Constraining
count,3880.0,3880.0,3880.0,3880.0,3880.0,3880.0,3880.0
mean,0.606959,0.092268,0.076546,0.233247,0.004897,0.006959,0.047423
std,0.488489,0.289441,0.265905,0.422953,0.069815,0.083139,0.212569
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0


**2.3.d. Import UK Economic Indicators (1998-2025)**

In [65]:
# Load the Excel file for UK Economic Indicators
uk_economic_indicators = pd.read_excel('/Users/kaferrante/Documents/Python/_Course4_Project/Consolidated_Eco_KPI _V3.xlsx')

# View the data.
uk_economic_indicators.head()

Unnamed: 0,year,month,year_month,uk_inflation_rate_CPIH,uk_unemployment_rate,uk_gdp_growth,uk_interest_rate,uk_consumer_confidence,gbp_usd_fx,ftse_250,gilts_short,gilts_medium,gilts_long,uk_credit_growth_no_cc,uk_credit_growth_only_cc,avg_price_all_property_types
0,1998,4,1998-04,1.815,6.3,0.6,7.25,1.1,1.67327,5554.720972,5.91,5.7,5.71,14.1,24.7,64258
1,1998,5,1998-05,2.039,6.3,0.6,7.25,1.2,1.636589,5799.256322,5.82,5.57,5.55,14.4,24.5,64258
2,1998,6,1998-06,1.675,6.3,0.6,7.5,-1.3,1.650718,5739.277233,6.17,5.64,5.43,13.9,25.5,64258
3,1998,7,1998-07,1.443,6.3,0.3,7.5,-4.3,1.643657,5595.919582,6.06,5.57,5.38,14.6,25.6,67057
4,1998,8,1998-08,1.327,6.2,0.3,7.5,-6.5,1.63195,5173.355054,5.52,5.19,5.11,14.6,26.1,67057


In [67]:
# Explore data set.
uk_economic_indicators.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 321 entries, 0 to 320
Data columns (total 16 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   year                          321 non-null    int64  
 1   month                         321 non-null    int64  
 2   year_month                    321 non-null    object 
 3   uk_inflation_rate_CPIH        321 non-null    float64
 4   uk_unemployment_rate          321 non-null    float64
 5   uk_gdp_growth                 321 non-null    float64
 6   uk_interest_rate              321 non-null    float64
 7   uk_consumer_confidence        321 non-null    float64
 8   gbp_usd_fx                    321 non-null    float64
 9   ftse_250                      321 non-null    float64
 10  gilts_short                   321 non-null    float64
 11  gilts_medium                  321 non-null    float64
 12  gilts_long                    321 non-null    float64
 13  uk_cr

In [69]:
# Check for missing values.
uk_economic_indicators.isnull().sum()

year                            0
month                           0
year_month                      0
uk_inflation_rate_CPIH          0
uk_unemployment_rate            0
uk_gdp_growth                   0
uk_interest_rate                0
uk_consumer_confidence          0
gbp_usd_fx                      0
ftse_250                        0
gilts_short                     0
gilts_medium                    0
gilts_long                      0
uk_credit_growth_no_cc          0
uk_credit_growth_only_cc        0
avg_price_all_property_types    0
dtype: int64

In [71]:
# View column types.
uk_economic_indicators.dtypes

year                              int64
month                             int64
year_month                       object
uk_inflation_rate_CPIH          float64
uk_unemployment_rate            float64
uk_gdp_growth                   float64
uk_interest_rate                float64
uk_consumer_confidence          float64
gbp_usd_fx                      float64
ftse_250                        float64
gilts_short                     float64
gilts_medium                    float64
gilts_long                      float64
uk_credit_growth_no_cc          float64
uk_credit_growth_only_cc        float64
avg_price_all_property_types      int64
dtype: object

### 2.4. Date Transformation

**Speeches**

In [73]:
# Determine the date format for speeches.
speeches.date.head()

0    1990-11-28
1    1991-10-03
2    1992-03-14
3    1992-05-29
4    1992-08-17
Name: date, dtype: object

In [75]:
# Change date format from 'object' to 'datetime64' and display in a new column
speeches['date_format'] = speeches['date'].astype('datetime64[ns]')

In [77]:
# Add a new column for year and month
speeches['year_month'] = pd.to_datetime(speeches['date_format']).dt.to_period('M')

In [79]:
# Add a new column for year only
speeches['year'] = pd.to_datetime(speeches.date).dt.year

In [81]:
# Add a column for year_month in date format
speeches['year_month_dt'] = speeches['year_month'].dt.to_timestamp()

In [83]:
# View the DataFrame.
speeches.head()

Unnamed: 0,reference,country,date,title,author,is_gov,text,text_cleaned,text_tokenised,text_lemmatised,...,weak,constraining,word_count_sentiment,sentiment_lexicon_simple,sentiment_lexicon_weighted,text_norm,date_format,year_month,year,year_month_dt
0,r901128a_BOA,australia,1990-11-28,A Proper Role for Monetary Policy,fraser,0,They would no doubt argue that to have two obj...,would doubt argue two objectives like trying c...,"['would', 'doubt', 'argue', 'two', 'objectives...","['would', 'doubt', 'argue', 'two', 'objective'...",...,15,13,217,-0.119816,0.112442,they would no doubt argue that to have two obj...,1990-11-28,1990-11,1990,1990-11-01
1,r911003a_BOA,australia,1991-10-03,,fraser,0,Today I wish to talk about real interest rates...,today wish talk real interest rates mainly his...,"['today', 'wish', 'talk', 'real', 'interest', ...","['today', 'wish', 'talk', 'real', 'interest', ...",...,16,12,149,-0.167785,0.014094,today i wish to talk about real interest rates...,1991-10-03,1991-10,1991,1991-10-01
2,r920314a_BOA,australia,1992-03-14,,fraser,0,I welcome this opportunity to talk about prosp...,welcome opportunity talk prospects banks austr...,"['welcome', 'opportunity', 'talk', 'prospects'...","['welcome', 'opportunity', 'talk', 'prospect',...",...,16,13,191,0.125654,0.421466,i welcome this opportunity to talk about prosp...,1992-03-14,1992-03,1992,1992-03-01
3,r920529a_BOA,australia,1992-05-29,,fraser,0,It is a pleasure to have this opportunity to a...,pleasure opportunity address influential gathe...,"['pleasure', 'opportunity', 'address', 'influe...","['pleasure', 'opportunity', 'address', 'influe...",...,20,8,202,-0.029703,0.227228,it is a pleasure to have this opportunity to a...,1992-05-29,1992-05,1992,1992-05-01
4,r920817a_BOA,australia,1992-08-17,,fraser,0,"As a long-time fan of Don Sanders, I am deligh...",long time fan sanders delighted participating ...,"['long', 'time', 'fan', 'sanders', 'delighted'...","['long', 'time', 'fan', 'sander', 'delight', '...",...,27,13,234,-0.042735,0.22735,"as a long-time fan of don sanders, i am deligh...",1992-08-17,1992-08,1992,1992-08-01


In [85]:
# View column types.
speeches.dtypes

reference                             object
country                               object
date                                  object
title                                 object
author                                object
is_gov                                 int64
text                                  object
text_cleaned                          object
text_tokenised                        object
text_lemmatised                       object
text_lemmatised_str                   object
word_count_text                        int64
word_count_text_cleaned                int64
negative                               int64
positive                               int64
uncertainty                            int64
litigious                              int64
strong                                 int64
weak                                   int64
constraining                           int64
word_count_sentiment                   int64
sentiment_lexicon_simple             float64
sentiment_

**Indicators**

In [87]:
# Add a new column for year and month.
uk_economic_indicators['year_month'] = pd.to_datetime(uk_economic_indicators['year_month']).dt.to_period('M')

In [89]:
# View the DataFrame.
uk_economic_indicators.head()

Unnamed: 0,year,month,year_month,uk_inflation_rate_CPIH,uk_unemployment_rate,uk_gdp_growth,uk_interest_rate,uk_consumer_confidence,gbp_usd_fx,ftse_250,gilts_short,gilts_medium,gilts_long,uk_credit_growth_no_cc,uk_credit_growth_only_cc,avg_price_all_property_types
0,1998,4,1998-04,1.815,6.3,0.6,7.25,1.1,1.67327,5554.720972,5.91,5.7,5.71,14.1,24.7,64258
1,1998,5,1998-05,2.039,6.3,0.6,7.25,1.2,1.636589,5799.256322,5.82,5.57,5.55,14.4,24.5,64258
2,1998,6,1998-06,1.675,6.3,0.6,7.5,-1.3,1.650718,5739.277233,6.17,5.64,5.43,13.9,25.5,64258
3,1998,7,1998-07,1.443,6.3,0.3,7.5,-4.3,1.643657,5595.919582,6.06,5.57,5.38,14.6,25.6,67057
4,1998,8,1998-08,1.327,6.2,0.3,7.5,-6.5,1.63195,5173.355054,5.52,5.19,5.11,14.6,26.1,67057


In [91]:
# View column types.
uk_economic_indicators.dtypes

year                                int64
month                               int64
year_month                      period[M]
uk_inflation_rate_CPIH            float64
uk_unemployment_rate              float64
uk_gdp_growth                     float64
uk_interest_rate                  float64
uk_consumer_confidence            float64
gbp_usd_fx                        float64
ftse_250                          float64
gilts_short                       float64
gilts_medium                      float64
gilts_long                        float64
uk_credit_growth_no_cc            float64
uk_credit_growth_only_cc          float64
avg_price_all_property_types        int64
dtype: object

### 2.5. Data Correction

In [93]:
# speeches given by Edward George are wrongly not stated as is_gov
def correct_is_gov_column(speeches_df: pd.DataFrame):
    # Make sure date is datetime first
    speeches['date'] = pd.to_datetime(speeches['date'], errors='coerce')
    
    # Apply correction
    condition = (
        (speeches['author'].str.lower() == 'george') &
        (speeches['date'].dt.year > 1993) &
        (speeches['date'].dt.year < 2004)
    )
    speeches.loc[condition, 'is_gov'] = 1  # 1 means Governor
    
    return speeches

# Correct the is_gov column
speeches = correct_is_gov_column(speeches)

# View the DataFrame
display(speeches[speeches['author'].str.lower() == 'george'].head())

Unnamed: 0,reference,country,date,title,author,is_gov,text,text_cleaned,text_tokenised,text_lemmatised,...,weak,constraining,word_count_sentiment,sentiment_lexicon_simple,sentiment_lexicon_weighted,text_norm,date_format,year_month,year,year_month_dt
4961,r980915a_BOE,united kingdom,1998-09-15,Speech,george,1,"Thank you, Chairman. I'm actually very pleased...",thank chairman actually pleased opportunity re...,"['thank', 'chairman', 'actually', 'pleased', '...","['thank', 'chairman', 'actually', 'pleased', '...",...,16,2,160,-0.14375,0.179375,"thank you, chairman. i'm actually very pleased...",1998-09-15,1998-09,1998,1998-09-01
4962,r981021b_BOE,united kingdom,1998-10-21,Britain in Europe,george,1,It's a great pleasure to be here in the beauti...,great pleasure beautiful city bruges honoured ...,"['great', 'pleasure', 'beautiful', 'city', 'br...","['great', 'pleasure', 'beautiful', 'city', 'br...",...,28,17,280,0.028571,0.291071,it's a great pleasure to be here in the beauti...,1998-10-21,1998-10,1998,1998-10-01
4966,r981119a_BOE,united kingdom,1998-11-19,Speech,george,1,Let me put some of the recent newspaper headli...,let put recent newspaper headlines alongside f...,"['let', 'put', 'recent', 'newspaper', 'headlin...","['let', 'put', 'recent', 'newspaper', 'headlin...",...,14,3,162,-0.185185,0.069136,let me put some of the recent newspaper headli...,1998-11-19,1998-11,1998,1998-11-01
4969,r990112a_BOE,united kingdom,1999-01-12,Speech,george,1,I am only too well aware of the pressure curre...,well aware pressure currently facing large par...,"['well', 'aware', 'pressure', 'currently', 'fa...","['well', 'aware', 'pressure', 'currently', 'fa...",...,20,3,195,-0.194872,0.022051,i am only too well aware of the pressure curre...,1999-01-12,1999-01,1999,1999-01-01
4970,r990118a_BOE,united kingdom,1999-01-18,Speech,george,1,It would be a masterly understatement to descr...,would masterly understatement describe past tw...,"['would', 'masterly', 'understatement', 'descr...","['would', 'masterly', 'understatement', 'descr...",...,11,9,168,-0.059524,0.149405,it would be a masterly understatement to descr...,1999-01-18,1999-01,1999,1999-01-01


## 3. Exploratory Sentiment Analysis & Natural Language Processing (NLP)

### 3.1. Prepare the data

**3.1.a. Filter for UK only**

In [None]:
# Bank OF England (UK) Speeches Only  
boe_speeches = speeches[speeches['country'].str.lower() == 'united kingdom'].copy()

# View the Dataframe
boe_speeches.head()

In [None]:
# View column types.
boe_speeches.dtypes

**3.1.b. Transformation to lowercase and removal of punctuation**
- Remove elements such as hashtags and urls
- Remove any special characters and punctuation
- Convert text to lower case
- Remove stopwords

In [None]:
# Apply the cleaning function
boe_speeches['text_cleaned'] = boe_speeches['text'].apply(preprocess_text)

# Review the result.
boe_speeches.head()

**3.1.c. Tokenisation of the data**<br>
Split the cleaned text into individual words, so that text can be analysed at word level.

In [None]:
# Apply the cleaning function
boe_speeches['text_tokenised'] = boe_speeches['text_cleaned'].apply(word_tokenize)

# Review the result.
boe_speeches.head()

**3.1.d. Lemmatisation of the data**<br>
Reduce words to its base or dictionary form (the lemma).

In [None]:
# Apply the cleaning function
boe_speeches['text_lemmatised'] = boe_speeches['text_tokenised'].apply(lemmatize_tokens)

# Review the result.
boe_speeches.head()

In [None]:
# Convert list of words into a string
boe_speeches['text_lemmatised_str'] = boe_speeches['text_lemmatised'].apply(lambda x: ' '.join(x) if isinstance(x, list) else x)

# View the DataFrame.
boe_speeches.head()

**3.1.e. Include wordcount**

In [None]:
# Assuming your DataFrame is 'df' and the column is 'lemmatised_text'
boe_speeches['word_count'] = boe_speeches['text_lemmatised_str'].str.split().apply(len)

# View the DataFrame.
boe_speeches.head()

### 3.2. View data in a wordclouds 

In [None]:
def show_wordcloud(counter):
    # Generate and display the word cloud
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(counter)
    plt.figure(figsize=(12, 6))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.tight_layout()
    plt.show()

def plot_top_non_stopwords_wordcloud(text):
    stop = set(stopwords.words('english'))
    
    new = text.str.split()
    new = new.values.tolist()
    corpus = [word for i in new for word in i]

    counter = Counter(corpus)
    # Show the word cloud
    show_wordcloud(counter)

In [None]:
# Create wordcloud of lemmatised text
plot_top_non_stopwords_wordcloud(boe_speeches['text_lemmatised_str'])

In [None]:
# Define bar chart for top words
def top_words_barchart(text):
    stop=set(stopwords.words('english'))
    
    new= text.str.split()
    new=new.values.tolist()
    corpus=[word for i in new for word in i]

    counter=Counter(corpus)
    most=counter.most_common()
    x, y=[], []
    for word,count in most[:40]:
        if (word not in stop):
            x.append(word)
            y.append(count)
            
    # Set plot size
    plt.figure(figsize=(12, 8))
    
    # Plot
    sns.barplot(x=y, y=x)
    
    # Set label font sizes
    plt.xlabel('Count', fontsize=14)
    plt.ylabel('Words', fontsize=14)
    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)
    
    plt.title('Top Non-Stopword Words', fontsize=16)
    plt.tight_layout()
    plt.show()

In [None]:
# Create bar chart to show top words
top_words_barchart(boe_speeches['text_lemmatised_str'])

In [None]:
# Define bar chart for top word groups
def top_word_group_barchart(text, n=2):
    stop=set(stopwords.words('english'))

    new= text.str.split()
    new=new.values.tolist()
    corpus=[word for i in new for word in i]

    def _get_top_ngram(corpus, n=None):
        vec = CountVectorizer(ngram_range=(n, n)).fit(corpus)
        bag_of_words = vec.transform(corpus)
        sum_words = bag_of_words.sum(axis=0) 
        words_freq = [(word, sum_words[0, idx]) 
                      for word, idx in vec.vocabulary_.items()]
        words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
        return words_freq[:10]

    top_n_bigrams=_get_top_ngram(text,n)[:10]
    x,y=map(list,zip(*top_n_bigrams))
    sns.barplot(x=y,y=x)

In [None]:
# Plot top phrases with 2 words
top_word_group_barchart(boe_speeches['text_lemmatised_str'],2)

In [None]:
# Plot top phrases with 3 words
top_word_group_barchart(boe_speeches['text_lemmatised_str'],3)

In [None]:
# Plot top phrases with 4 words
top_word_group_barchart(boe_speeches['text_lemmatised_str'],4)

In [None]:
# Convert tokens into a single string.
boe_speeches_text = ' '.join(boe_speeches['text_lemmatised_str'])

# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white', random_state=42).generate(boe_speeches_text)

# Display the word cloud
plt.figure(figsize=(8, 4))
plt.imshow(wordcloud, interpolation='bilinear')

# Hide the axis.
plt.axis('off') 

# Dispaly the word cloud.
plt.tight_layout()
plt.show()

### Sentiment Analysis using VADER Sentiment Intensity Analyzer 

In [None]:
# Apply sentiment analysis to the columns using the lemmatised data converted into strings.
boe_speeches['sentiment_score_vader'] = boe_speeches['text_lemmatised'].apply(analyse_sentiment)

# View the DataFrame.
boe_speeches.head()

In [None]:
# Extract individual sentiment scores for speeches.
boe_speeches['text_neg'] = boe_speeches['sentiment_score_vader'].apply(lambda x: x['neg'])
boe_speeches['text_neu'] = boe_speeches['sentiment_score_vader'].apply(lambda x: x['neu'])
boe_speeches['text_pos'] = boe_speeches['sentiment_score_vader'].apply(lambda x: x['pos'])
boe_speeches['text_compound'] = boe_speeches['sentiment_score_vader'].apply(lambda x: x['compound'])

In [None]:
# View the DataFrame.
boe_speeches.head()

In [None]:
# Categories VADER sentiment according to compound_score
def vader_sentiment(compound_score):
    if compound_score >= 0.05:
        return 'Positive'
    elif compound_score <= -0.05:
        return 'Negative'
    else:
        return 'Neutral'

# Apply the sentiment labels to speeches
boe_speeches['vader_sentiment_score'] = boe_speeches['text_compound'].apply(vader_sentiment)

# View the DataFrame.
boe_speeches.head()

In [None]:
# Plot a histogram of the vader sentiment score for summary
# Set the number of bins.
num_bins = 15
# Set the plot area.
plt.figure(figsize=(8,5))

# Define the bars.
n, bins, patches = plt.hist(boe_speeches['text_compound'], num_bins, facecolor='#3bd5d7', alpha=0.6)

# Set the labels.
plt.xlabel('BoE Wordlist Sentiment Score', fontsize=10)
plt.ylabel('Count', fontsize=10)
plt.title('Histogram of Vader Sentiment Score for BoE Speeches', fontsize=12, fontweight='bold')

# Display the chart.
plt.tight_layout()
plt.show()

In [None]:
# Standardise the sentiment score
# Calculate mean and standard deviation
mean_score_vader = boe_speeches['text_compound'].mean()
std_score_vader = boe_speeches['text_compound'].std()

# Create a new column for standardized scores
boe_speeches['sentiment_score_vader_std'] = (boe_speeches['text_compound'] - mean_score_vader) / std_score_vader

# View the DataFrame.
boe_speeches.head()

In [None]:
# Plot a histogram of the vader sentiment score for summary
# Set the number of bins.
num_bins = 15
# Set the plot area.
plt.figure(figsize=(8,5))

# Define the bars.
n, bins, patches = plt.hist(boe_speeches['sentiment_score_vader_std'], num_bins, facecolor='#3bd5d7', alpha=0.6)

# Set the labels.
plt.xlabel('BoE Wordlist Sentiment Score', fontsize=10)
plt.ylabel('Count', fontsize=10)
plt.title('Histogram of standardised Vader Sentiment Score for BoE Speeches', fontsize=12, fontweight='bold')

# Display the chart.
plt.tight_layout()
plt.show()

### 3.2. Sentiment Analysis with BoE Sentiment Wordlist for BoE speeches

In [None]:
# Prepare the lexicon
sentiment_lexicon = sentiment_lexicon.copy()

# Define categories
categories = [
     'Negative',
     'Positive',
     'Uncertainty',
     'Litigious',
     'Strong',
     'Weak',
     'Constraining',
 ]

# Create dictionary of categories, containing words that belong to that category based on your sentiment lexicon.
word_sets = {
    cat: set(sentiment_lexicon.loc[sentiment_lexicon[cat] == 1, 'Word'].str.lower())
    for cat in categories
}

In [None]:
# Define function to apply the lexicon to the text
def lexicon_counts(tokens):
    return pd.Series({
        cat: sum(t in word_sets[cat] for t in tokens)
        for cat in categories
    })

# Compute counts and add new columns for each category
boe_speeches = pd.concat(
    [boe_speeches, boe_speeches['text_lemmatised'].apply(lexicon_counts)], axis=1
 )

boe_speeches.head()

In [None]:
# Initialise a dictionary to store word counts per category
word_counts_in_category = {cat: {} for cat in categories}

# Loop through each tokenised text
for tokens in boe_speeches['text_lemmatised']:
    tokens_lower = [t.lower() for t in tokens]
    for cat in categories:
        category_words = word_sets[cat]
        for t in tokens_lower:
            if t in category_words:
                # Count occurrences
                word_counts_in_category[cat][t] = word_counts_in_category[cat].get(t, 0) + 1

# Create a new DataFrame
records = []

for cat in categories:
    for word, count in word_counts_in_category[cat].items():
        records.append({'Word': word, 'Category': cat, 'Count': count})

words_df = pd.DataFrame(records)

# Sort alphabetically or by count
words_df = words_df.sort_values(['Category', 'Word'])

# Display the DataFrame
words_df

In [None]:
# Export the wordlist to Excel
words_df.to_excel('found_words_counts.xlsx', index=False)

print("DataFrame was exported successfully.")

In [None]:
# Filter the data for governor speeches only
boe_speeches_gov = boe_speeches[boe_speeches['is_gov'] == 1]

# View the DataFrame
boe_speeches_gov.head()

In [None]:
# Initialise a dictionary to store word counts per category for governor speeches only
word_counts_in_category = {cat: {} for cat in categories}

# Loop through each tokenized text
for tokens in boe_speeches_gov['text_lemmatised']:
    tokens_lower = [t.lower() for t in tokens]
    for cat in categories:
        category_words = word_sets[cat]
        for t in tokens_lower:
            if t in category_words:
                # Count occurrences
                word_counts_in_category[cat][t] = word_counts_in_category[cat].get(t, 0) + 1

# Create a DataFrame from this data
records_gov = []

for cat in categories:
    for word, count in word_counts_in_category[cat].items():
        records_gov.append({'Word': word, 'Category': cat, 'Count': count})

words_df_gov = pd.DataFrame(records)

# Optional: sort alphabetically or by count
words_df_gov = words_df_gov.sort_values(['Category', 'Word'])

# Display the DataFrame
words_df_gov

In [None]:
# Export to Excel
words_df_gov.to_excel('found_words_counts_gov.xlsx', index=False)

print("DataFrame was exported successfully.")

**Observations**: The percentages share of negative (32%) and positive words (25%) does not change between governor and non-governor speeches.

In [None]:
# Calculate the number of words found in each category in all the speeches
category_sums = boe_speeches[['Negative', 'Positive', 'Uncertainty', 'Litigious', 'Strong', 'Weak']].sum()

# Sort the sums in descending order
category_sums_sorted = category_sums.sort_values(ascending=False)

# View the results
category_sums_sorted

**3.2.a. BoE Sentiment Score based on Positive & Negative Scores**

In [None]:
# Calculate the sentiment score by subtracting the negative score from the 
#positive score and dividing by the total number of words
boe_speeches['sentiment_score_lexicon'] = (boe_speeches['Positive'] - boe_speeches['Negative'])/ boe_speeches['word_count']

# View the DataFrame.
boe_speeches.head()

In [None]:
# Plot a histogram of the vader sentiment score for summary
# Set the number of bins.
num_bins = 15
# Set the plot area.
plt.figure(figsize=(8,5))

# Define the bars.
n, bins, patches = plt.hist(boe_speeches['sentiment_score_lexicon'], num_bins, facecolor='#3bd5d7', alpha=0.6)

# Set the labels.
plt.xlabel('BoE Wordlist Sentiment Score', fontsize=10)
plt.ylabel('Count', fontsize=10)
plt.title('Histogram of BoE Wordlist Sentiment Score for BoE Speeches', fontsize=12, fontweight='bold')

# Display the chart.
plt.tight_layout()
plt.show()

In [None]:
# View distribution of sentiment scores
boe_speeches['sentiment_score_lexicon'].describe()

In [None]:
# Standardise the sentiment score
# Calculate mean and standard deviation
mean_score = boe_speeches['sentiment_score_lexicon'].mean()
std_score = boe_speeches['sentiment_score_lexicon'].std()

# Create a new column for standardized scores
boe_speeches['sentiment_score_lexicon_std'] = (boe_speeches['sentiment_score_lexicon'] - mean_score) / std_score

# View the DataFrame.
boe_speeches.head()

In [None]:
# Plot a histogram of the vader sentiment score for summary
# Set the number of bins.
num_bins = 15
# Set the plot area.
plt.figure(figsize=(8,5))

# Define the bars.
n, bins, patches = plt.hist(boe_speeches['sentiment_score_lexicon_std'], num_bins, facecolor='#3bd5d7', alpha=0.6)

# Set the labels.
plt.xlabel('BoE Wordlist Standardised Sentiment Score', fontsize=10)
plt.ylabel('Count', fontsize=10)
plt.title('Histogram of BoE Wordlist Standardised Sentiment Score for BoE Speeches', fontsize=12, fontweight='bold')

# Display the chart.
plt.tight_layout()
plt.show()

In [None]:
# View distribution of sentiment scores
boe_speeches['sentiment_score_lexicon_std'].describe()

**3.2.b. BoE Sentiment Score based on all Categories**

In [None]:
# Assign weights to the categories
category_weights = {
    'Negative': -1,
    'Positive': 1.5,
    'Uncertainty': 0.2,
    'Litigious': -0.2,
    'Strong': 1.5,
    'Weak': 0.5,
    'Constraining': -0.5
}

In [None]:
# Define function to apply the lexicon to the text
def lexicon_score_weighted(tokens):
    score = 0
    for cat in categories:
        count = sum(t in word_sets[cat] for t in tokens)
        score += count * category_weights[cat]
    return score

# Compute counts and store as a new column
boe_speeches['sentiment_score_lexicon_weighted'] = boe_speeches['text_lemmatised'].apply(lexicon_score_weighted) \
                                                    / boe_speeches['word_count']

# View the DataFrame
boe_speeches.head()

In [None]:
# Plot a histogram of the vader sentiment score for summary
# Set the number of bins.
num_bins = 15
# Set the plot area.
plt.figure(figsize=(8,5))

# Define the bars.
n, bins, patches = plt.hist(boe_speeches['sentiment_score_lexicon_weighted'], num_bins, facecolor='#3bd5d7', alpha=0.6)

# Set the labels.
plt.xlabel('BoE Wordlist Weighted Sentiment Score', fontsize=10)
plt.ylabel('Count', fontsize=10)
plt.title('Histogram of BoE Wordlist Weighted Sentiment Score for BoE Speeches', fontsize=12, fontweight='bold')

# Display the chart.
plt.tight_layout()
plt.show()

In [None]:
# View distribution of sentiment scores
boe_speeches['sentiment_score_lexicon_weighted'].describe()

In [None]:
# Standardise the sentiment score
# Calculate mean and standard deviation
mean_score_weighted = boe_speeches['sentiment_score_lexicon_weighted'].mean()
std_score_weighted = boe_speeches['sentiment_score_lexicon_weighted'].std()

# Create a new column for standardized scores
boe_speeches['sentiment_score_lexicon_weighted_std'] = (boe_speeches['sentiment_score_lexicon_weighted'] - mean_score_weighted) / std_score_weighted

# View the DataFrame.
boe_speeches.head()

In [None]:
# View distribution of sentiment scores
boe_speeches['sentiment_score_lexicon_weighted_std'].describe()

In [None]:
# Plot a histogram of the vader sentiment score for summary
# Set the number of bins.
num_bins = 15
# Set the plot area.
plt.figure(figsize=(8,5))

# Define the bars.
n, bins, patches = plt.hist(boe_speeches['sentiment_score_lexicon_weighted_std'], num_bins, facecolor='#3bd5d7', alpha=0.6)

# Set the labels.
plt.xlabel('BoE Wordlist Weighted Sentiment Score', fontsize=10)
plt.ylabel('Count', fontsize=10)
plt.title('Histogram of BoE Wordlist Weighted Sentiment Score for BoE Speeches', fontsize=12, fontweight='bold')

# Display the chart.
plt.tight_layout()
plt.show()

**3.2.c. BoE Sentiment Score Labelling**

In [None]:
# Categories sentiment according to sentiment score
def categorise_sentiment(sentiment_score, pos_threshold=1, neg_threshold=-1):
    if sentiment_score >= pos_threshold:
        return 'Positive'
    elif sentiment_score <= neg_threshold:
        return 'Negative'
    else:
        return 'Neutral'

In [None]:
# Apply the sentiment labels to speeches
boe_speeches['lexicon_label'] = boe_speeches['sentiment_score_lexicon_std'].apply(categorise_sentiment)
boe_speeches['lexicon_label_2'] = boe_speeches['sentiment_score_lexicon_std'].apply(categorise_sentiment, pos_threshold=0.5, neg_threshold=-0.5)
boe_speeches['lexicon_label_weighted'] = boe_speeches['sentiment_score_lexicon_weighted_std'].apply(categorise_sentiment)
boe_speeches['lexicon_label_weighted_2'] = boe_speeches['sentiment_score_lexicon_weighted_std'].apply(categorise_sentiment, pos_threshold=0.5, neg_threshold=-0.5)

# View the DataFrame.
boe_speeches.head()

In [None]:
# Categories sentiment according to percentile thresholds
lower_thresh = boe_speeches['sentiment_score_lexicon_std'].quantile(0.20)
upper_thresh = boe_speeches['sentiment_score_lexicon_std'].quantile(0.80)

def classify_score(z):
    if z >= upper_thresh:
        return 'Positive'
    elif z <= lower_thresh:
        return 'Negative'
    else:
        return 'Neutral'

In [None]:
# Apply the sentiment labels to speeches
boe_speeches['lexicon_label_percentile'] = boe_speeches['sentiment_score_lexicon_std'].apply(classify_score)

# View the DataFrame.
boe_speeches.head()

In [None]:
# Plot the sentiments distribution for lexicon label
sentiment_labels_lexicon = boe_speeches['lexicon_label']

# Create a figure
plt.figure(figsize=(5, 4))

# Calculate the counts and percentages
sentiment_counts = pd.Series(sentiment_labels_lexicon.value_counts())
sentiment_percentages = sentiment_counts / sentiment_counts.sum() * 100  # Calculate percentages

# Plot the bar chart
sentiment_counts.plot(kind='bar', color=['#339848', '#482173', '#557cbb'])

# Add labels.
plt.title('Sentiment Distribution for Lexicon Sentiment Label', fontsize=12, fontweight='bold')
plt.xlabel('Sentiment Label', fontsize=10)
plt.ylabel('Number of Reviews', fontsize=10)
plt.xticks(rotation=0)

# Annotate the bars with percentages
for index, value in enumerate(sentiment_counts):
    plt.text(index, value + 0.5, f'{sentiment_percentages[index]:.0f}%', ha='center', fontsize=10)

# Save the plot.
# plt.savefig('Fig_Sentiment_Reviews.png', dpi=500, bbox_inches='tight')

# Display the plot.
plt.tight_layout()
plt.show()

In [None]:
# Plot the sentiments distribution for lexicon label
sentiment_labels_lexicon = boe_speeches['lexicon_label_2']

# Create a figure
plt.figure(figsize=(5, 4))

# Calculate the counts and percentages
sentiment_counts = pd.Series(sentiment_labels_lexicon.value_counts())
sentiment_percentages = sentiment_counts / sentiment_counts.sum() * 100  # Calculate percentages

# Plot the bar chart
sentiment_counts.plot(kind='bar', color=['#339848', '#482173', '#557cbb'])

# Add labels.
plt.title('Sentiment Distribution for Lexicon Sentiment Label', fontsize=12, fontweight='bold')
plt.xlabel('Sentiment Label', fontsize=10)
plt.ylabel('Number of Reviews', fontsize=10)
plt.xticks(rotation=0)

# Annotate the bars with percentages
for index, value in enumerate(sentiment_counts):
    plt.text(index, value + 0.5, f'{sentiment_percentages[index]:.0f}%', ha='center', fontsize=10)

# Save the plot.
# plt.savefig('Fig_Sentiment_Reviews.png', dpi=500, bbox_inches='tight')

# Display the plot.
plt.tight_layout()
plt.show()

In [None]:
# Plot the sentiments distribution for lexicon label
sentiment_labels_lexicon = boe_speeches['lexicon_label_weighted']

# Create a figure
plt.figure(figsize=(5, 4))

# Calculate the counts and percentages
sentiment_counts = pd.Series(sentiment_labels_lexicon.value_counts())
sentiment_percentages = sentiment_counts / sentiment_counts.sum() * 100  # Calculate percentages

# Plot the bar chart
sentiment_counts.plot(kind='bar', color=['#339848', '#482173', '#557cbb'])

# Add labels.
plt.title('Sentiment Distribution for Lexicon Sentiment Label', fontsize=12, fontweight='bold')
plt.xlabel('Sentiment Label', fontsize=10)
plt.ylabel('Number of Reviews', fontsize=10)
plt.xticks(rotation=0)

# Annotate the bars with percentages
for index, value in enumerate(sentiment_counts):
    plt.text(index, value + 0.5, f'{sentiment_percentages[index]:.0f}%', ha='center', fontsize=10)

# Save the plot.
# plt.savefig('Fig_Sentiment_Reviews.png', dpi=500, bbox_inches='tight')

# Display the plot.
plt.tight_layout()
plt.show()

In [None]:
# Plot the sentiments distribution for lexicon label
sentiment_labels_lexicon = boe_speeches['lexicon_label_weighted_2']

# Create a figure
plt.figure(figsize=(5, 4))

# Calculate the counts and percentages
sentiment_counts = pd.Series(sentiment_labels_lexicon.value_counts())
sentiment_percentages = sentiment_counts / sentiment_counts.sum() * 100  # Calculate percentages

# Plot the bar chart
sentiment_counts.plot(kind='bar', color=['#339848', '#482173', '#557cbb'])

# Add labels.
plt.title('Sentiment Distribution for Lexicon Sentiment Label', fontsize=12, fontweight='bold')
plt.xlabel('Sentiment Label', fontsize=10)
plt.ylabel('Number of Reviews', fontsize=10)
plt.xticks(rotation=0)

# Annotate the bars with percentages
for index, value in enumerate(sentiment_counts):
    plt.text(index, value + 0.5, f'{sentiment_percentages[index]:.0f}%', ha='center', fontsize=10)

# Save the plot.
# plt.savefig('Fig_Sentiment_Reviews.png', dpi=500, bbox_inches='tight')

# Display the plot.
plt.tight_layout()
plt.show()

In [None]:
# Plot the sentiments distribution for lexicon percentile
sentiment_labels_lexicon_percentile = boe_speeches['lexicon_label_percentile']

# Create a figure
plt.figure(figsize=(5, 4))

# Calculate the counts and percentages
sentiment_counts = pd.Series(sentiment_labels_lexicon_percentile.value_counts())
sentiment_percentages = sentiment_counts / sentiment_counts.sum() * 100  # Calculate percentages

# Plot the bar chart
sentiment_counts.plot(kind='bar', color=['#339848', '#482173', '#557cbb'])

# Add labels.
plt.title('Sentiment Distribution for Lexicon Sentiment Label Percentile', fontsize=12, fontweight='bold')
plt.xlabel('Sentiment Label', fontsize=10)
plt.ylabel('Number of Reviews', fontsize=10)
plt.xticks(rotation=0)

# Annotate the bars with percentages
for index, value in enumerate(sentiment_counts):
    plt.text(index, value + 0.5, f'{sentiment_percentages[index]:.0f}%', ha='center', fontsize=10)

# Save the plot.
# plt.savefig('Fig_Sentiment_Reviews.png', dpi=500, bbox_inches='tight')

# Display the plot.
plt.tight_layout()
plt.show()

### 3.3. Sentiment Analysis with GPT Sentiment

In [None]:
# View the DataFrame
gpt_sentiment.head()

In [None]:
# Define the mapping dictionary
sentiment_mapping = {
    'Positive': 1,
    'Neutral': 0,
    'Negative': -1
}

# Apply the mapping to the sentiment column
gpt_sentiment['gpt_sentiment_numeric'] = gpt_sentiment['gpt_sentiment'].map(sentiment_mapping)

# View the DataFrame
gpt_sentiment.head()

In [None]:
# Plot the sentiments distribution for lexicon percentile
sentiment_labels_gpt = gpt_sentiment['gpt_sentiment']

# Create a figure
plt.figure(figsize=(5, 4))

# Calculate the counts and percentages
sentiment_counts = pd.Series(sentiment_labels_gpt.value_counts())
sentiment_percentages = sentiment_counts / sentiment_counts.sum() * 100  # Calculate percentages

# Plot the bar chart
sentiment_counts.plot(kind='bar', color=['#339848', '#482173', '#557cbb'])

# Add labels.
plt.title('Sentiment Distribution for Lexicon Sentiment Label Percentile', fontsize=12, fontweight='bold')
plt.xlabel('Sentiment Label', fontsize=10)
plt.ylabel('Number of Reviews', fontsize=10)
plt.xticks(rotation=0)

# Annotate the bars with percentages
for index, value in enumerate(sentiment_counts):
    plt.text(index, value + 0.5, f'{sentiment_percentages[index]:.0f}%', ha='center', fontsize=10)

# Save the plot.
# plt.savefig('Fig_Sentiment_Reviews.png', dpi=500, bbox_inches='tight')

# Display the plot.
plt.tight_layout()
plt.show()

In [None]:
# Standardise the sentiment score
# Calculate mean and standard deviation
mean_score_gpt = gpt_sentiment['gpt_sentiment_numeric'].mean()
std_score = gpt_sentiment['gpt_sentiment_numeric'].std()

# Create a new column for standardized scores
gpt_sentiment['gpt_sentiment_std'] = (gpt_sentiment['gpt_sentiment_numeric'] - mean_score) / std_score

# View the DataFrame.
gpt_sentiment.head()

### 3.4. Sentiment Analysis with FinBERT for BoE speeches using yiyanghkurts model

In [None]:
# Define a function to predict probabilities in batches
def predict_batch(texts, tokenizer, model, max_length=128):
    inputs = tokenizer(texts, padding=True, truncation=True, max_length=max_length, return_tensors='pt')  # Tokenize batch of texts
    inputs = {k: v.to(device) for k, v in inputs.items()}                 # Move inputs to the device (GPU or CPU)
    with torch.no_grad():                                                 # Get model outputs without computing gradients
        outputs = model(**inputs)
    probs = F.softmax(outputs.logits, dim=1)                              # Convert logits to probabilities
    return probs.cpu().numpy()

In [None]:
# Check the order of labels in the model
model_yiyang.config.id2label

In [None]:
# Column that the model should be applied to
texts = boe_speeches['text_lemmatised'].astype(str).tolist()

# Specify batch size for efficiency
batch_size = 32
all_probs = []

for i in range(0, len(texts), batch_size):
    batch_texts = texts[i:i + batch_size]
    batch_probs = predict_batch(batch_texts, tokenizer_yiyang, model_yiyang)
    all_probs.extend(batch_probs)

# Store the predicted probabilities back into your DataFrame
boe_speeches['yiyang_probs'] = all_probs

# Extract top labels using the order established above
labels = ['Neutral', 'Positive', 'Negative']

def get_probs_yiyang_dict(probs):
    # Assumes probs is an array/list like [neutral_score, positive_score, negative_score]
    return {
        'yiyang_neutral': probs[0],
        'yiyang_positive': probs[1],
        'yiyang_negative': probs[2]
    }

def get_top_label(probs):
    idx = probs.argmax()
    return labels[idx], probs.max()

In [None]:
# Create a DataFrame with all class probabilities
probs_yiyang = boe_speeches['yiyang_probs'].apply(lambda x: get_probs_yiyang_dict(x)).apply(pd.Series)

# Assign back to your main DataFrame
boe_speeches = pd.concat([boe_speeches, probs_yiyang], axis=1)

# Apply and extract label + confidence
boe_speeches[['yiyang_label', 'yiyang_confidence']] = boe_speeches['yiyang_probs'].apply(lambda x: get_top_label(x)).apply(pd.Series)

# Now, your DataFrame has the probabilities, top label, and confidence score
boe_speeches.head()

In [None]:
# Define weights 
weights = {
    'Positive': 1,
    'Neutral': 0,
    'Negative': -1
}

In [None]:
# Define function to calculate one sentiment score
def compute_tone_score(probs):
    class_labels = ['Neutral', 'Positive', 'Negative']
    prob_dict = dict(zip(class_labels, probs))
    return (
        prob_dict['Positive'] * weights['Positive'] +
        prob_dict['Neutral'] * weights['Neutral'] +
        prob_dict['Negative'] * weights['Negative']
    )

In [None]:
# Apply to all rows
boe_speeches['sentiment_score_yiyang'] = boe_speeches['yiyang_probs'].apply(compute_tone_score)

# View the dataframe
boe_speeches.head()

In [None]:
# Plot a histogram of the confidence score for BoE speeches
# Set the number of bins.
num_bins = 15

# Set the plot area.
plt.figure(figsize=(8,5))

# Define the plot.
n, bins, patches = plt.hist(boe_speeches['sentiment_score_yiyang'], num_bins, facecolor='#3bd5d7', alpha=0.6)

# Set the labels.
plt.xlabel('Yiyang Sentiment Score', fontsize=10)
plt.ylabel('Count', fontsize=10)
plt.title('Histogram of Yiyang Sentiment Score for BoE Speeches', fontsize=12, fontweight='bold')

# Display the chart.
plt.tight_layout()
plt.show()

In [None]:
# Standardise the sentiment score
# Calculate mean and standard deviation
mean_score_yiyang = boe_speeches['sentiment_score_yiyang'].mean()
std_score_yiyang = boe_speeches['sentiment_score_yiyang'].std()

# Create a new column for standardized scores
boe_speeches['sentiment_score_yiyang_std'] = (boe_speeches['sentiment_score_yiyang'] - mean_score_yiyang) / std_score_yiyang

# View the DataFrame.
boe_speeches.head()

In [None]:
# Plot a histogram of the standardised sentiment score for BoE speeches
# Set the number of bins.
num_bins = 15

# Set the plot area.
plt.figure(figsize=(8,5))

# Define the plot.
n, bins, patches = plt.hist(boe_speeches['sentiment_score_yiyang_std'], num_bins, facecolor='#3bd5d7', alpha=0.6)

# Set the labels.
plt.xlabel('Standardised Yiyang Sentiment Score', fontsize=10)
plt.ylabel('Count', fontsize=10)
plt.title('Histogram of Standardised Yiyang Sentiment Score for BoE Speeches', fontsize=12, fontweight='bold')

# Display the chart.
plt.tight_layout()
plt.show()

In [None]:
# Plot the sentiments distribution for yiyang
sentiment_labels_yiyang = boe_speeches['yiyang_label']

# Create a figure
plt.figure(figsize=(5, 4))

# Calculate the counts and percentages
sentiment_counts = pd.Series(sentiment_labels_yiyang.value_counts())
sentiment_percentages = sentiment_counts / sentiment_counts.sum() * 100  # Calculate percentages

# Plot the bar chart
sentiment_counts.plot(kind='bar', color=['#339848', '#482173', '#557cbb'])

# Add labels.
plt.title('Sentiment Distribution for Yiyang Sentiment Label', fontsize=12, fontweight='bold')
plt.xlabel('Sentiment Label', fontsize=10)
plt.ylabel('Number of Speeches', fontsize=10)
plt.xticks(rotation=0)

# Annotate the bars with percentages
for index, value in enumerate(sentiment_counts):
    plt.text(index, value + 0.5, f'{sentiment_percentages[index]:.0f}%', ha='center', fontsize=10)

# Save the plot.
# plt.savefig('Fig_Sentiment_Reviews.png', dpi=500, bbox_inches='tight')

# Display the plot.
plt.tight_layout()
plt.show()

In [None]:
# View the weighted Sentiment scores.
boe_speeches['sentiment_score_yiyang'].describe()

### 3.5. Compare sentiment scores

Create a new dataframe with only relevant indices.

In [None]:
boe_speeches.columns

In [None]:
# Merge gpt_sentiment with boe_speeches
boe_speeches_new = pd.merge(boe_speeches, gpt_sentiment[['reference', 'gpt_sentiment', 'gpt_sentiment_numeric', \
                                                         'gpt_sentiment_std']], on='reference', how='left')

# View the DataFrame
boe_speeches_new.head()

In [None]:
# Create a new DataFrame with all sentiment scores to include for comparison.
boe_speeches_sentiment = boe_speeches_new[['reference', 'country', 'date', 'title', 'author', 'is_gov', 'text',
                                           'date_format', 'year_month', 'year_month_dt', 'year',
                                           'text_lemmatised', 'text_lemmatised_str', 'word_count',
                                           'sentiment_score_lexicon', 'sentiment_score_lexicon_std', 'sentiment_score_lexicon_weighted_std',
                                           'lexicon_label', 'lexicon_label_2', 'lexicon_label_weighted', 'lexicon_label_weighted_2', 
                                           'lexicon_label_percentile',
                                           'sentiment_score_yiyang', 'sentiment_score_yiyang_std', 'yiyang_label', 
                                           'gpt_sentiment', 'gpt_sentiment_numeric', 'gpt_sentiment_std',
                                          'text_neg', 'text_neu', 'text_pos', 'text_compound', 'vader_sentiment_score',
                                           'sentiment_score_vader_std']]

# View the DataFrame.
boe_speeches_sentiment.head()

In [None]:
# Review the data
boe_speeches_sentiment.info()

In [None]:
# Correlation between BoE Wordlist and Finbert sentiment scores
from scipy.stats import pearsonr
corr, p_value = pearsonr(boe_speeches_sentiment['sentiment_score_lexicon_std'], boe_speeches_sentiment['sentiment_score_yiyang_std'])
print(f'Correlation: {corr:.2f}')

In [None]:
# Scatterplot for BoE dictionary and Finbert sentiment scores
sns.scatterplot(x='sentiment_score_lexicon_std', y='sentiment_score_yiyang_std', data=boe_speeches_sentiment)
plt.title('Comparison of Standardized Tone Indices')
plt.show()

In [None]:
# Correlation between GPT sentiment and Finbert
from scipy.stats import pearsonr
corr, p_value = pearsonr(boe_speeches_sentiment['gpt_sentiment_std'], boe_speeches_sentiment['sentiment_score_yiyang_std'])
print(f'Correlation: {corr:.2f}')

In [None]:
# Correlation between GPT sentiment and BoE Wordlist
from scipy.stats import pearsonr
corr, p_value = pearsonr(boe_speeches_sentiment['gpt_sentiment_std'], boe_speeches_sentiment['sentiment_score_lexicon_std'])
print(f'Correlation: {corr:.2f}')

In [None]:
# Correlation between GPT sentiment and BoE Wordlist
from scipy.stats import pearsonr
corr, p_value = pearsonr(boe_speeches_sentiment['gpt_sentiment_std'], boe_speeches_sentiment['sentiment_score_lexicon_weighted_std'])
print(f'Correlation: {corr:.2f}')

In [None]:
# Correlation between GPT sentiment and BoE Wordlist
from scipy.stats import pearsonr
corr, p_value = pearsonr(boe_speeches_sentiment['gpt_sentiment_std'], boe_speeches_sentiment['sentiment_score_vader_std'])
print(f'Correlation: {corr:.2f}')

In [None]:
# Correlation between GPT sentiment and BoE Wordlist
from scipy.stats import pearsonr
corr, p_value = pearsonr(boe_speeches_sentiment['sentiment_score_lexicon_std'], boe_speeches_sentiment['sentiment_score_vader_std'])
print(f'Correlation: {corr:.2f}')

In [None]:
# Correlation between GPT sentiment and BoE Wordlist
from scipy.stats import pearsonr
corr, p_value = pearsonr(boe_speeches_sentiment['sentiment_score_lexicon_weighted_std'], boe_speeches_sentiment['sentiment_score_vader_std'])
print(f'Correlation: {corr:.2f}')

In [None]:
# Export the DataFrame to an Excel file
# boe_speeches_sentiment.to_excel('boe_speeches_sentiment.xlsx', index=False)

# print("DataFrame was exported successfully.")

In [None]:
# Create a new column for agreement (1 for agree, 0 for disagree)
boe_speeches_sentiment['agreement_lexicon_yiyang'] = (boe_speeches_sentiment['lexicon_label'] == boe_speeches_sentiment['yiyang_label']).astype(int)

boe_speeches_sentiment.head()

In [None]:
# Step 3: Visualize the agreement vs disagreement
agreement_count = boe_speeches_sentiment['agreement_lexicon_yiyang'].value_counts()

# Plotting
plt.figure(figsize=(8, 6))
agreement_count.plot(kind='bar', color=['red', 'green'])

# Adding labels and title
plt.title('Agreement vs Disagreement between Lexicon and FinBERT', fontsize=16)
plt.xlabel('Agreement (1: Agree, 0: Disagree)', fontsize=14)
plt.ylabel('Number of Speeches', fontsize=14)
plt.xticks([0, 1], ['Disagree (0)', 'Agree (1)'], rotation=0)
plt.tight_layout()

# Display the plot
plt.show()

In [None]:
# Create a new column for agreement (1 for agree, 0 for disagree)
boe_speeches_sentiment['agreement_gpt_yiyang'] = (boe_speeches_sentiment['gpt_sentiment'] == boe_speeches_sentiment['yiyang_label']).astype(int)

boe_speeches_sentiment.head()

In [None]:
# Step 3: Visualize the agreement vs disagreement
agreement_count = boe_speeches_sentiment['agreement_gpt_yiyang'].value_counts()

# Plotting
plt.figure(figsize=(8, 6))
agreement_count.plot(kind='bar', color=['red', 'green'])

# Adding labels and title
plt.title('Agreement vs Disagreement between GPT results and FinBERT', fontsize=16)
plt.xlabel('Agreement (1: Agree, 0: Disagree)', fontsize=14)
plt.ylabel('Number of Speeches', fontsize=14)
plt.xticks([0, 1], ['Disagree (0)', 'Agree (1)'], rotation=0)
plt.tight_layout()

# Display the plot
plt.show()

In [None]:
# Create a new column for agreement (1 for agree, 0 for disagree)
boe_speeches_sentiment['agreement_gpt_lexicon'] = (boe_speeches_sentiment['gpt_sentiment'] == boe_speeches_sentiment['lexicon_label']).astype(int)

boe_speeches_sentiment.head()

In [None]:
# Visualize the agreement vs disagreement
agreement_count = boe_speeches_sentiment['agreement_gpt_lexicon'].value_counts()

# Plotting
plt.figure(figsize=(8, 6))
agreement_count.plot(kind='bar', color=['red', 'green'])

# Adding labels and title
plt.title('Agreement vs Disagreement between GPT results and Lexicon', fontsize=16)
plt.xlabel('Agreement (1: Agree, 0: Disagree)', fontsize=14)
plt.ylabel('Number of Speeches', fontsize=14)
plt.xticks([0, 1], ['Disagree (0)', 'Agree (1)'], rotation=0)
plt.tight_layout()

# Display the plot
plt.show()

In [None]:
# Create a new column for agreement (1 for agree, 0 for disagree)
boe_speeches_sentiment['agreement_gpt_lexicon_2'] = (boe_speeches_sentiment['gpt_sentiment'] == boe_speeches_sentiment['lexicon_label_2']).astype(int)

boe_speeches_sentiment.head()

In [None]:
# Visualize the agreement vs disagreement
agreement_count = boe_speeches_sentiment['agreement_gpt_lexicon_2'].value_counts()

# Plotting
plt.figure(figsize=(8, 6))
agreement_count.plot(kind='bar', color=['red', 'green'])

# Adding labels and title
plt.title('Agreement vs Disagreement between GPT results and Lexicon', fontsize=16)
plt.xlabel('Agreement (1: Agree, 0: Disagree)', fontsize=14)
plt.ylabel('Number of Speeches', fontsize=14)
plt.xticks([0, 1], ['Disagree (0)', 'Agree (1)'], rotation=0)
plt.tight_layout()

# Display the plot
plt.show()

**Accuracy Test**

In [None]:
# Randomly sample 5 reviews from the DataFrame.
# accuracy_test = boe_speeches_sentiment.sample(n=5, random_state=101)

# Extract only the original review and summary columns.
# accuracy_test = accuracy_test[['year_month','text', 'lexicon_label', 'yiyang_label', 'gpt_sentiment']]

# Add columns for manually labelling the sentiment.
# accuracy_test['sentiment_labelled'] = ''

# Display the sampled reviews.
# accuracy_test

In [None]:
# Extract a CSV file.
#  accuracy_test.to_csv('sentiment_accuracy_test.csv', index = True)

# print("DataFrame was exported successfully.")

## 4. Display the data

In [None]:
boe_speeches_sentiment.dtypes

**Filter for 2012 to 2022**

In [None]:
# Filter the data for 2012 to 2022
start_date = '2012-01'
end_date = '2022-12'

In [None]:
# Filter the data for specified period
boe_speeches_sentiment_12_22 = boe_speeches_sentiment[(boe_speeches_sentiment['year_month'] >= start_date) & (boe_speeches_sentiment['year_month'] <= end_date)]

# View the DataFrame
boe_speeches_sentiment_12_22

### Speech Statistics

**Word Count per Year**

In [None]:
# Group and aggregate sentiment scores by month
wordcount_monthly = boe_speeches_sentiment_12_22.groupby('year_month_dt')['word_count'].mean().reset_index()

# View the DataFrame
wordcount_monthly.head()

In [None]:
# Plot a line chart of average word count per month
plt.figure(figsize=(10, 6))
plt.plot(wordcount_monthly['year_month_dt'], wordcount_monthly['word_count'])
plt.xlabel('Month')
plt.ylabel('Average Word Count')
plt.title('Average Word Count per Month')
plt.xticks(rotation=45)

# Display the plot
plt.tight_layout()
plt.show()

In [None]:
# Group and aggregate sentiment scores by year
wordcount_yearly = boe_speeches_sentiment_12_22.groupby('year')['word_count'].mean().reset_index()

# View the DataFrame
wordcount_yearly.head()

In [None]:
# Plot a line chart of average word count per year
plt.figure(figsize=(10, 6))
plt.plot(wordcount_yearly['year'], wordcount_yearly['word_count'])
plt.xlabel('Year')
plt.ylabel('Average Word Count')
plt.title('Average Word Count per Year')
plt.xticks(rotation=45)

# Display the plot
plt.tight_layout()
plt.show()

In [None]:
# Group and aggregate sentiment scores by year
wordcount_count_yearly = boe_speeches_sentiment_12_22.groupby('year').agg({'word_count': 'mean',
                                                                           'reference': 'count'
                                                                          }).reset_index().rename(columns={'reference': 'speech_count'})

# View the DataFrame
wordcount_count_yearly.head()

In [None]:
# Create the main plot
fig, ax1 = plt.subplots(figsize=(10, 6))

# Plot 'word_count' on the primary y-axis
ax1.plot(wordcount_count_yearly['year'], wordcount_count_yearly['word_count'], color='blue', label='Average Word Count')
ax1.set_xlabel('Year')
ax1.set_ylabel('Average Word Count')
ax1.tick_params(axis='y')

# Create a secondary y-axis sharing the same x-axis
ax2 = ax1.twinx()

# Plot 'speech_count' on the secondary y-axis
ax2.plot(wordcount_count_yearly['year'], wordcount_count_yearly['speech_count'], color='green', label='Number of Speeches')
ax2.set_ylabel('Number of Speeches')
ax2.tick_params(axis='y')

# Add titles and grid if needed
plt.title('Avg Word Count and Number of Speeches per Year')

# Optionally, add a legend
lines, labels = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(lines + lines2, labels + labels2, loc='upper left')

# Display the plot
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

**Number of Speeches per Year**

In [None]:
# Group and aggregate number of speeches by governor by year
governor_speeches_monthly = boe_speeches_sentiment_12_22.groupby(['year', 'is_gov']).size().reset_index(name='count')

# View the DataFrame
governor_speeches_monthly.head()

In [None]:
# Pivot the data for plotting
governor_speeches_monthly_pivot = governor_speeches_monthly.pivot(index='year', columns='is_gov', values='count').fillna(0)

# View the DataFrame
governor_speeches_monthly_pivot.head()

In [None]:
# Plot a bar chart per month
ax = governor_speeches_monthly_pivot.plot(kind='bar', stacked=True, figsize=(12, 6))

# Customize plot
ax.set_xlabel('Year')
ax.set_ylabel('Count of Speeches')
ax.set_title('Count of Speeches by BoE Govenor and other Staff per Year')
ax.legend(title='Staff Memmber')

# Display the plot
plt.show()

In [None]:
boe_speeches_sentiment_12_22.head()

In [None]:
# Group and aggregate sentiment scores by year
gpt_sentiment_yearly= boe_speeches_sentiment_12_22.groupby(['year', 'gpt_sentiment']).size().reset_index(name='count')

# View the DataFrame
gpt_sentiment_yearly.head()

In [None]:
# Pivot the data for plotting
gpt_sentiment_yearly_pivot = gpt_sentiment_yearly.pivot(index='year', columns='gpt_sentiment', values='count').fillna(0)

# View the DataFrame
gpt_sentiment_yearly_pivot.head()

In [None]:
# Plot a bar chart per month
ax = gpt_sentiment_yearly_pivot.plot(kind='bar', stacked=True, figsize=(12, 6))

# Customize plot
ax.set_xlabel('Year')
ax.set_ylabel('Count of GPT Sentiments')
ax.set_title('GPT Sentiment Distribution per Year')
ax.legend(title='Sentiment')

# Display the plot
plt.show()

In [None]:
# Group and aggregate sentiment scores by year
lexicon_sentiment_yearly= boe_speeches_sentiment_12_22.groupby(['year', 'lexicon_label']).size().reset_index(name='count')

# View the DataFrame
lexicon_sentiment_yearly.head()

In [None]:
# Pivot the data for plotting
lexicon_sentiment_yearly_pivot = lexicon_sentiment_yearly.pivot(index='year', columns='lexicon_label', values='count').fillna(0)

# View the DataFrame
lexicon_sentiment_yearly_pivot.head()

In [None]:
# Plot a bar chart per month
ax = lexicon_sentiment_yearly_pivot.plot(kind='bar', stacked=True, figsize=(12, 6))

# Customize plot
ax.set_xlabel('Year')
ax.set_ylabel('Count of BoE Wordlist Sentiments')
ax.set_title('BoE Wordlist Sentiment Distribution per Year')
ax.legend(title='Sentiment')

# Display the plot
plt.show()

In [None]:
# Group and aggregate sentiment scores by year
lexicon_sentiment_2_yearly= boe_speeches_sentiment_12_22.groupby(['year', 'lexicon_label_2']).size().reset_index(name='count')

# View the DataFrame
lexicon_sentiment_2_yearly.head()

In [None]:
# Pivot the data for plotting
lexicon_sentiment_2_yearly_pivot = lexicon_sentiment_2_yearly.pivot(index='year', columns='lexicon_label_2', values='count').fillna(0)

# View the DataFrame
lexicon_sentiment_2_yearly_pivot.head()

In [None]:
# Plot a bar chart per month
ax = lexicon_sentiment_2_yearly_pivot.plot(kind='bar', stacked=True, figsize=(12, 6))

# Customize plot
ax.set_xlabel('Year')
ax.set_ylabel('Count of BoE Wordlist Sentiments')
ax.set_title('BoE Wordlist Sentiment Distribution per Year')
ax.legend(title='Sentiment')

# Display the plot
plt.show()

In [None]:
# Group and aggregate sentiment scores by year
lexicon_sentiment_weighted_yearly= boe_speeches_sentiment_12_22.groupby(['year', 'lexicon_label_weighted']).size().reset_index(name='count')

# View the DataFrame
lexicon_sentiment_weighted_yearly.head()

In [None]:
# Pivot the data for plotting
lexicon_sentiment_weighted_yearly_pivot = lexicon_sentiment_weighted_yearly.pivot(index='year', columns='lexicon_label_weighted', values='count').fillna(0)

# View the DataFrame
lexicon_sentiment_weighted_yearly_pivot.head()

In [None]:
# Plot a bar chart per month
ax = lexicon_sentiment_weighted_yearly_pivot.plot(kind='bar', stacked=True, figsize=(12, 6))

# Customize plot
ax.set_xlabel('Year')
ax.set_ylabel('Count of BoE Wordlist Sentiments')
ax.set_title('BoE Wordlist Sentiment Distribution per Year')
ax.legend(title='Sentiment')

# Display the plot
plt.show()

In [None]:
# Group and aggregate sentiment scores by year
lexicon_sentiment_weighted_2_yearly= boe_speeches_sentiment_12_22.groupby(['year', 'lexicon_label_weighted_2']).size().reset_index(name='count')

# View the DataFrame
lexicon_sentiment_weighted_2_yearly.head()

In [None]:
# Pivot the data for plotting
lexicon_sentiment_weighted_2_yearly_pivot = lexicon_sentiment_weighted_2_yearly.pivot(index='year', columns='lexicon_label_weighted_2', values='count').fillna(0)

# View the DataFrame
lexicon_sentiment_weighted_2_yearly_pivot.head()

In [None]:
# Plot a bar chart per month
ax = lexicon_sentiment_weighted_2_yearly_pivot.plot(kind='bar', stacked=True, figsize=(12, 6))

# Customize plot
ax.set_xlabel('Year')
ax.set_ylabel('Count of BoE Wordlist Sentiments')
ax.set_title('BoE Wordlist Sentiment Distribution per Year')
ax.legend(title='Sentiment')

# Display the plot
plt.show()

In [None]:
# Group and aggregate sentiment scores by year
yiyang_sentiment_yearly= boe_speeches_sentiment_12_22.groupby(['year', 'yiyang_label']).size().reset_index(name='count')

# View the DataFrame
yiyang_sentiment_yearly.head()

In [None]:
# Pivot the data for plotting
yiyang_sentiment_yearly_pivot = yiyang_sentiment_yearly.pivot(index='year', columns='yiyang_label', values='count').fillna(0)

# View the DataFrame
yiyang_sentiment_yearly_pivot.head()

In [None]:
# Plot a bar chart per month
ax = yiyang_sentiment_yearly_pivot.plot(kind='bar', stacked=True, figsize=(12, 6))

# Customize plot
ax.set_xlabel('Year')
ax.set_ylabel('Count of BoE Wordlist Sentiments')
ax.set_title('Yiyang Sentiment Distribution per Year')
ax.legend(title='Sentiment')

# Display the plot
plt.show()

### 4.1. Compare the sentiment scores over time

In [None]:
# Group and aggregate sentiment scores by month
sentiment_monthly = boe_speeches_sentiment.groupby('year_month_dt')[['sentiment_score_yiyang_std','sentiment_score_lexicon_std', \
                                    'gpt_sentiment_std', 'sentiment_score_yiyang','sentiment_score_lexicon', \
                                    'gpt_sentiment_numeric', 'sentiment_score_lexicon_weighted_std']].mean().reset_index()
# View the DataFrame
sentiment_monthly.head()

In [None]:
# Review the DataFrame
sentiment_monthly.info()

**4.1.a. Analysis by date**

In [None]:
# Plot sentiment scores over time

# Define the plot
fig = px.line(
     sentiment_monthly,
     x='year_month_dt',
     y=['sentiment_score_yiyang_std','sentiment_score_lexicon_std', 'gpt_sentiment_numeric'],
     title="Average monthly sentiment scores – Bank of England speeches (1997–2022)",
     labels={'value': 'Average score', 'variable': 'Metric'}
)

# Adjust size
fig.update_layout(width=1100, height=600)

# Move the legend
fig.update_layout(
    legend=dict(
        x=0.8,
        y=1,
        xanchor='left',
        yanchor='top'
    )
)

# Display the plot
fig.show()

In [None]:
# Plot sentiment scores over time

# Define the plot
fig = px.line(
     sentiment_monthly,
     x='year_month_dt',
     y=['sentiment_score_yiyang_std','sentiment_score_lexicon_std'],
     title="Average monthly sentiment scores – Bank of England speeches (1997–2022)",
     labels={'value': 'Average score', 'variable': 'Metric'}
)

# Adjust size
fig.update_layout(width=1100, height=600)

# Move the legend
fig.update_layout(
    legend=dict(
        x=0.8,
        y=1,
        xanchor='left',
        yanchor='top'
    )
)

# Display the plot
fig.show()

In [None]:
# Plot sentiment scores over time

# Define the plot
fig = px.line(
     sentiment_monthly,
     x='year_month_dt',
     y=['sentiment_score_lexicon_std','gpt_sentiment_numeric'],
     title="Average monthly BoE Wordlist and GPT sentiment scores – Bank of England speeches (1997–2022)",
     labels={'value': 'Average score', 'variable': 'Metric'}
)

# Adjust size
fig.update_layout(width=1100, height=600)

# Move the legend
fig.update_layout(
    legend=dict(
        x=0.8,
        y=1,
        xanchor='left',
        yanchor='top'
    )
)

# Display the plot
fig.show()

**4.1.b. Analysis by year**

In [None]:
boe_speeches_sentiment.columns

In [None]:
# Group and aggregate sentiment scores by year
sentiment_yearly = boe_speeches_sentiment.groupby('year')[['sentiment_score_yiyang_std','sentiment_score_lexicon_std', \
                                    'gpt_sentiment_std', 'sentiment_score_yiyang','sentiment_score_lexicon', \
                                    'gpt_sentiment_numeric', 'sentiment_score_lexicon_weighted_std', \
                                                           'sentiment_score_vader_std']].mean().reset_index()
# View the DataFrame
sentiment_yearly.head()

In [None]:
# Plot sentiment scores over time

# Define the plot
fig = px.line(
     sentiment_yearly,
     x='year',
     y=['sentiment_score_lexicon_std','gpt_sentiment_std'],
     title="Average yearly BoE Wordlist and GPT sentiment scores – BoE Speeches (1997–2022)",
     labels={'value': 'Average score', 'variable': 'Metric'}
)

# Adjust size
fig.update_layout(width=1100, height=600)

# Move the legend
fig.update_layout(
    legend=dict(
        x=0.8,
        y=1,
        xanchor='left',
        yanchor='top'
    )
)

# Display the plot
fig.show()

In [None]:
# Plot sentiment scores over time

# Define the plot
fig = px.line(
     sentiment_yearly,
     x='year',
     y=['sentiment_score_lexicon_std','gpt_sentiment_std', 'sentiment_score_vader_std'],
     title="Average yearly BoE Wordlist Weighted and GPT sentiment scores – BoE Speeches (1997–2022)",
     labels={'value': 'Average score', 'variable': 'Metric'}
)

# Adjust size
fig.update_layout(width=1100, height=600)

# Move the legend
fig.update_layout(
    legend=dict(
        x=0.8,
        y=1,
        xanchor='left',
        yanchor='top'
    )
)

# Display the plot
fig.show()

In [None]:
# Plot sentiment scores over time

# Define the plot
fig = px.line(
     sentiment_yearly,
     x='year',
     y=['sentiment_score_lexicon_std','gpt_sentiment_std', 'sentiment_score_yiyang_std', 'sentiment_score_vader_std'],
     title="Average yearly BoE Wordlist, GPT and FinBert sentiment scores – BoE Speeches (1997–2022)",
     labels={'value': 'Average score', 'variable': 'Metric'}
)

# Adjust size
fig.update_layout(width=1100, height=600)

# Move the legend
fig.update_layout(
    legend=dict(
        x=0.8,
        y=1,
        xanchor='left',
        yanchor='top'
    )
)

# Display the plot
fig.show()

**4.1.c. Analysis by quarter/ 3 month averages**

In [None]:
# Group and aggregate sentiment scores by quarter
sentiment_quarterly = boe_speeches_sentiment.groupby(pd.Grouper(key='year_month_dt', freq='Q'))[
    ['sentiment_score_yiyang_std', 'sentiment_score_lexicon_std',
     'gpt_sentiment_std', 'sentiment_score_yiyang', 'sentiment_score_lexicon',
     'gpt_sentiment_numeric', 'sentiment_score_vader_std']
].mean().reset_index()

# View the DataFrane
sentiment_quarterly.head()

In [None]:
# View the DataFrane
sentiment_quarterly.info()

In [None]:
# Plot sentiment scores over time

# Define the plot
fig = px.line(
     sentiment_quarterly,
     x='year_month_dt',
     y=['sentiment_score_lexicon_std','gpt_sentiment_std', 'sentiment_score_vader_std'],
     title="Average quaterly BoE Wordlist and GPT sentiment scores – BoE Speeches (1997–2022)",
     labels={'value': 'Average score', 'variable': 'Metric'}
)

# Adjust size
fig.update_layout(width=1100, height=600)

# Move the legend
fig.update_layout(
    legend=dict(
        x=0.8,
        y=1,
        xanchor='left',
        yanchor='top'
    )
)

# Display the plot
fig.show()

In [None]:
# Plot sentiment scores over time

# Define the plot
fig = px.line(
     sentiment_quarterly,
     x='year_month_dt',
     y=['sentiment_score_lexicon_std','gpt_sentiment_std', 'sentiment_score_vader_std'],
     title="Average quaterly BoE Wordlist and GPT sentiment scores – BoE Speeches (1997–2022)",
     labels={'value': 'Average score', 'variable': 'Metric'}
)

# Adjust size
fig.update_layout(width=1100, height=600)

# Move the legend
fig.update_layout(
    legend=dict(
        x=0.8,
        y=1,
        xanchor='left',
        yanchor='top'
    )
)

# Display the plot
fig.show()

In [None]:
# Plot sentiment scores over time

# Define the plot
fig = px.line(
     sentiment_quarterly,
     x='year_month_dt',
     y=['sentiment_score_lexicon_std','gpt_sentiment_std', 'sentiment_score_yiyang_std'],
     title="Average quaterly BoE Wordlist, GPT and FinBert sentiment scores – BoE Speeches (1997–2022)",
     labels={'value': 'Average score', 'variable': 'Metric'}
)

# Adjust size
fig.update_layout(width=1100, height=600)

# Move the legend
fig.update_layout(
    legend=dict(
        x=0.8,
        y=1,
        xanchor='left',
        yanchor='top'
    )
)

# Display the plot
fig.show()

### 4.2.Seasonality Analysis

In [None]:
# View DataFrame
sentiment_monthly.columns

In [None]:
# Filter the data by date for Covid
start_date = '2012-01'
end_date = '2022-12'

In [None]:
# Filter the data for specified period
seasonality = sentiment_monthly[(sentiment_monthly['year_month_dt'] >= start_date) & (sentiment_monthly['year_month_dt'] <= end_date)]

In [None]:
# Set the datetime as index
seasonality.set_index('year_month_dt', inplace=True)

# View the DataFrame
seasonality

In [None]:
# Perform decomposition
decomposed_lexicon = seasonal_decompose(seasonality['sentiment_score_lexicon_std'], model='additive', period=12)

# Plot the decomposition
decomposed_lexicon.plot()
plt.show()

In [None]:
# Perform decomposition
decomposed_gpt = seasonal_decompose(seasonality['gpt_sentiment_std'], model='additive', period=12)

# Plot the decomposition
decomposed_gpt.plot()
plt.show()

In [None]:
# Perform decomposition
decomposed_finbert = seasonal_decompose(seasonality['sentiment_score_yiyang_std'], model='additive', period=12)

# Plot the decomposition
decomposed_finbert.plot()
plt.show()

**Observations Peaks**
- August 2013: Peak with a speech announcing Jane Austen on 10 GBP note and a discussion of the evolution of monetary policy since 2008-2009 crisis
- January 2022: speech about inflation with a sense of urgency about controlling inflation with reassurance that the Bank is actively monitoring and prepared to act responsibly. 

In [None]:
sentiment_quarterly.columns

In [None]:
# Filter the data by date for Covid
start_date = '2012-01'
end_date = '2022-12'

In [None]:
# Filter the data for specified period
seasonality_quarterly = sentiment_quarterly[(sentiment_quarterly['year_month_dt'] >= start_date) & (sentiment_quarterly['year_month_dt'] <= end_date)]

In [None]:
# Set the datetime as index
seasonality_quarterly.set_index('year_month_dt', inplace=True)

# View the DataFrame
seasonality_quarterly

In [None]:
# Perform decomposition
decomposed_lexicon_quarterly = seasonal_decompose(seasonality_quarterly['sentiment_score_lexicon_std'], model='additive', period=12)

# Plot the decomposition
decomposed_lexicon_quarterly.plot()
plt.show()

In [None]:
# Perform decomposition
decomposed_gpt_quarterly = seasonal_decompose(seasonality_quarterly['gpt_sentiment_std'], model='additive', period=12)

# Plot the decomposition
decomposed_gpt_quarterly.plot()
plt.show()

## Governor speeches only

In [None]:
# Filter speeches for governors only
boe_speeches_finbert_gov = boe_speeches_finbert[boe_speeches_finbert['is_gov'] == 1]

# View the DataFrame
boe_speeches_finbert_gov.head()

In [None]:
# Review the DataFrame
boe_speeches_finbert_gov.info()

In [None]:
# Group and aggregate sentiment scores by month
finbert_monthly_gov = boe_speeches_finbert_gov.groupby('year_month')[['yiyang_neutral', 'yiyang_positive', 'yiyang_negative', \
                                                              'yiyang_confidence', 'sentiment_score_yiyang', \
                                                                 'sentiment_score_yiyang_std']].mean().reset_index()
finbert_monthly_gov.head()

In [None]:
# Change the date format
finbert_monthly_gov['year_month_dt'] = finbert_monthly_gov['year_month'].dt.to_timestamp()

In [None]:
# Filter the data for 10 years
start_date = '2012-01'
end_date = '2022-12'

In [None]:
# Filter the data for specified period
seasonality_finbert_gov_10 = finbert_monthly_gov[(finbert_monthly_gov['year_month'] >= start_date) & \
                          (finbert_monthly_gov['year_month'] <= end_date)]

In [None]:
# Set the datetime as index
seasonality_finbert_gov_10.set_index('year_month_dt', inplace=True)

# View the DataFrame
seasonality_finbert_gov_10

In [None]:
# Perform decomposition
decomposed = seasonal_decompose(seasonality_finbert_gov_10['sentiment_score_yiyang_std'], model='additive', period=12)

# Plot the decomposition
decomposed.plot()
plt.show()

### 4.1. Covid

In [None]:
# Filter the data by date for Covid
start_date = '2020-01'
end_date = '2022-12'

In [None]:
# Filter the data for specified period
finbert_covid = finbert_monthly[(finbert_monthly['year_month_dt'] >= start_date) & (finbert_monthly['year_month_dt'] <= end_date)]

In [None]:
fig = px.line(
     finbert_covid,
     x='year_month_dt',
     y=['yiyang_neutral','yiyang_positive', 'yiyang_negative'],
     title='Average monthly Finbert scores – Bank of England speeches during Covid (2020–2022)',
     labels={'value': 'Average score', 'variable': 'Metric'},
    color_discrete_sequence=['blue', 'green', 'red']
)

# Adjust size
fig.update_layout(width=1100, height=600)

# Move the legend
fig.update_layout(
    legend=dict(
        x=0.8,
        y=1,
        xanchor='left',
        yanchor='top'
    )
)

# Show the figure
fig.show()

In [None]:
fig = px.line(
     finbert_covid,
     x='year_month_dt',
     y=['sentiment_score_yiyang_std'],
     title='Average monthly standardised Finbert scores – Bank of England speeches during Covid (2020–2022)',
     labels={'value': 'Average score', 'variable': 'Metric'}
)

# Adjust size
fig.update_layout(width=1100, height=600)

# Move the legend
fig.update_layout(
    legend=dict(
        x=0.8,
        y=1,
        xanchor='left',
        yanchor='top'
    )
)

# Show the figure
fig.show()

In [None]:
# Filter the data for specified period
finbert_gov_covid = finbert_monthly_gov[(finbert_monthly_gov['year_month_dt'] >= start_date) & (finbert_monthly_gov['year_month_dt'] <= end_date)]

In [None]:
fig = px.line(
     finbert_gov_covid,
     x='year_month_dt',
     y=['yiyang_neutral','yiyang_positive', 'yiyang_negative'],
     title='Average monthly Finbert scores – Bank of England speeches during Covid (2020–2022)',
     labels={'value': 'Average score', 'variable': 'Metric'},
    color_discrete_sequence=['blue', 'green', 'red']
)

# Adjust size
fig.update_layout(width=1100, height=600)

# Move the legend
fig.update_layout(
    legend=dict(
        x=0.8,
        y=1,
        xanchor='left',
        yanchor='top'
    )
)

# Show the figure
fig.show()

### 4.2. Brexit vote

In [None]:
# Filter the data by date for before and after Brexit
start_date = '2016-01'
end_date = '2017-06'

In [None]:
# Filter the data for specified period
finbert_brexit = finbert_monthly[(finbert_monthly['year_month_dt'] >= start_date) & (finbert_monthly['year_month_dt'] <= end_date)]

In [None]:
fig = px.line(
     finbert_brexit,
     x='year_month_dt',
     y=['yiyang_neutral','yiyang_positive', 'yiyang_negative'],
     title='Average monthly Finbert scores – Bank of England speeches during Brexit (2016–2017)',
     labels={'value': 'Average score', 'variable': 'Metric'},
    color_discrete_sequence=['blue', 'green', 'red']
)

# Adjust size
fig.update_layout(width=1100, height=600)

# Move the legend
fig.update_layout(
    legend=dict(
        x=0.8,
        y=1,
        xanchor='left',
        yanchor='top'
    )
)

# Show the figure
fig.show()

In [None]:
fig = px.line(
     finbert_covid,
     x='year_month_dt',
     y=['sentiment_score_yiyang_std'],
     title='Average monthly stamdardised Finbert scores – Bank of England speeches during Brexit (2016–2017)',
     labels={'value': 'Average score', 'variable': 'Metric'}
)

# Adjust size
fig.update_layout(width=1100, height=600)

# Move the legend
fig.update_layout(
    legend=dict(
        x=0.8,
        y=1,
        xanchor='left',
        yanchor='top'
    )
)

# Show the figure
fig.show()

### Filter the dataframe

In [None]:
target_words = {
    'inflation': ['inflation'],           
    'monetary policy': ['monetary policy'],               
    'price stability': ['price stability'],
    'exchange rate': ['exchange rate'],
    'growth': ['growth'],
    'financial market': ['finanical market']
}

In [None]:
def count_word(text, word):
    tokens = [w.lower() for w in word_tokenize(str(text))]
    return tokens.count(word.lower())

# Assuming your DataFrame is 'df' and text column is 'speech_text'

for category, words in target_words.items():
    # Since each category only has one word, no need to iterate over list, just take the first
    word = words[0]
    # Count occurrences of the word in each speech
    boe_speeches_sentiment[category + '_count'] = boe_speeches_sentiment['text_lemmatised'].apply(lambda x: count_word(x, word))

# Now, create separate dataframes per category
inflation_df = boe_speeches_sentiment[['text_lemmatised', 'inflation_count']].copy()
monetary_policy_df = boe_speeches_sentiment[['text_lemmatised', 'monetary policy_count']].copy()
price_stability_df = boe_speeches_sentiment[['text_lemmatised', 'price stability_count']].copy()
exchange_rate_df = boe_speeches_sentiment[['text_lemmatised', 'exchange rate_count']].copy()
growth_df = boe_speeches_sentiment[['text_lemmatised', 'growth_count']].copy()
financial_market_df = boe_speeches_sentiment[['text_lemmatised', 'financial market_count']].copy()

# (Optional) filter for speeches where the target word appears at least once
inflation_df = inflation_df[inflation_df['inflation_count'] > 3]
monetary_policy_df = monetary_policy_df[monetary_policy_df['monetary policy_count'] > 3]
price_stability_df = price_stability_df[price_stability_df['price stability_count'] > 3]
exchange_rate_df = exchange_rate_df[exchange_rate_df['exchange rate_count'] > 3]
growth_df = growth_df[growth_df['growth_count'] > 3]
financial_market_df = financial_market_df[financial_market_df['financial market_count'] > 3]

In [None]:
def count_word(text, word):
    tokens = [w.lower() for w in word_tokenize(str(text))]
    return tokens.count(word.lower())

# Assuming your DataFrame is 'df' and text column is 'speech_text'

for category, words in target_words.items():
    # Since each category only has one word, no need to iterate over list, just take the first
    word = words[0]
    # Count occurrences of the word in each speech
    boe_speeches_sentiment[category + '_count'] = boe_speeches_sentiment['text'].apply(lambda x: count_word(x, word))

# Now, create separate dataframes per category
inflation_df = boe_speeches_sentiment[['text', 'inflation_count']].copy()
monetary_policy_df = boe_speeches_sentiment[['text', 'monetary policy_count']].copy()
price_stability_df = boe_speeches_sentiment[['text', 'price stability_count']].copy()
exchange_rate_df = boe_speeches_sentiment[['text', 'exchange rate_count']].copy()
growth_df = boe_speeches_sentiment[['text', 'growth_count']].copy()
financial_market_df = boe_speeches_sentiment[['text', 'financial market_count']].copy()

# (Optional) filter for speeches where the target word appears at least once
inflation_df = inflation_df[inflation_df['inflation_count'] > 3]
monetary_policy_df = monetary_policy_df[monetary_policy_df['monetary policy_count'] > 3]
price_stability_df = price_stability_df[price_stability_df['price stability_count'] > 3]
exchange_rate_df = exchange_rate_df[exchange_rate_df['exchange rate_count'] > 3]
growth_df = growth_df[growth_df['growth_count'] > 3]
financial_market_df = financial_market_df[financial_market_df['financial market_count'] > 3]

In [None]:
inflation_df.head()

In [None]:
monetary_policy_df.head()

In [None]:
price_stability_df.head()

In [None]:
exchange_rate_df.head()

In [None]:
growth_df.head()

In [None]:
financial_market_df.head()

In [None]:
def count_target_words(text, target_words):
    tokens = [w.lower() for w in word_tokenize(str(text))]
    count = sum(tokens.count(word) for word in target_words)
    return count

# Apply and add as a new column
target_words = ['inflation', 'monetary policy', 'price stability', 'exchange rate', 'growth', 'financial market']
boe_speeches_sentiment['target_word_freq'] = boe_speeches_sentiment['text_lemmatised_str'].apply(lambda x: count_target_words(x, target_words))

# View the dataFrame
boe_speeches_sentiment.head()

In [None]:
# For example, speeches where 'inflation' appears more than 3 times
inflation_df = boe_speeches_sentiment[boe_speeches_sentiment['target_word_freq'] > 5]

inflation_df.head()

## 4. Exploratory Analysis for Correlation with Economic Indicators

### 4.1. Prepare the data

In [None]:
# List of DataFrames to merge
dataframes_to_merge = [uk_economic_indicators]

# Use reduce to merge all DataFrames in the list
boe_speeches_indicators = reduce(lambda left, right: left.merge(right, on='year_month', how='left'), dataframes_to_merge, boe_speeches_new)

# View the merged DataFrame
boe_speeches_indicators.head()

In [None]:
# Check for missing values.
boe_speeches_indicators.isnull().sum()

In [None]:
# Explore the DataFrame.
boe_speeches_indicators.info()

### 4.2. Plot the data

In [None]:
# Display all column names.
boe_speeches_indicators.columns

In [None]:
boe_speeches_indicators['year_month'].dtypes

In [None]:
# Convert 'year_month' Period to datetime
# boe_speeches_indicators['date'] = boe_speeches_indicators['year_month'].dt.to_timestamp()

In [None]:
# boe_speeches_indicators['year_month'].dtypes

**Confidence Index**

In [None]:
# Create figure and axis
fig, ax1 = plt.subplots(figsize=(20, 6))

# Plot VADER text_compound on primary y-axis.
ax1.plot(boe_speeches_indicators['date'], boe_speeches_indicators['text_compound'], color='blue', label='VADER compound score')
ax1.set_xlabel('Date')
ax1.set_ylabel('VADER compound score', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')

# Create a second y-axis sharing the same x-axis
ax2 = ax1.twinx()

# Plot  on secondary y-axis
ax2.plot(boe_speeches_indicators['date'], boe_speeches_indicators['confidence_index'], color='red', label='Confidence Index')
ax2.set_ylabel('Confidence Index', color='black')
ax2.tick_params(axis='y', labelcolor='black')

# Add title and legend
plt.title('Bank of England VADER Sentiment Score vs UK Confidence Index')
fig.legend(loc='upper left', bbox_to_anchor=(0.1, 0.9))

# Display the chart
plt.tight_layout()
plt.show()

In [None]:
# Create figure and axis
fig, ax1 = plt.subplots(figsize=(20, 6))

# Plot VADER text_compound on primary y-axis.
ax1.plot(boe_speeches_indicators['date'], boe_speeches_indicators['sentiment_score_lm_weighted'], color='blue', label='sentiment_score_lm_weighted')
ax1.set_xlabel('Date')
ax1.set_ylabel('sentiment_score_lm_weighted', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')

# Create a second y-axis sharing the same x-axis
ax2 = ax1.twinx()

# Plot  on secondary y-axis
ax2.plot(boe_speeches_indicators['date'], boe_speeches_indicators['confidence_index'], color='red', label='Confidence Index')
ax2.set_ylabel('Confidence Index', color='black')
ax2.tick_params(axis='y', labelcolor='black')

# Add title and legend
plt.title('Bank of England Loughran-McDonald Weighted Sentiment Score vs UK Confidence Index')
fig.legend(loc='upper left', bbox_to_anchor=(0.1, 0.9))

# Display the chart
plt.tight_layout()
plt.show()

In [None]:
# Create figure and axis
fig, ax1 = plt.subplots(figsize=(20, 6))

# Plot VADER text_compound on primary y-axis.
ax1.plot(boe_speeches_indicators['date'], boe_speeches_indicators[['text_pos', 'text_neg', 'text_neu']], color='blue', label='VADER compound score')
ax1.set_xlabel('Date')
ax1.set_ylabel('VADER Scores', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')

# Create a second y-axis sharing the same x-axis
ax2 = ax1.twinx()

# Plot  on secondary y-axis
ax2.plot(boe_speeches_indicators['date'], boe_speeches_indicators['confidence_index'], color='red', label='Confidence Index')
ax2.set_ylabel('Confidence Index', color='black')
ax2.tick_params(axis='y', labelcolor='black')

# Add title and legend
plt.title('Bank of England VADER Sentiment Score vs UK Confidence Index')
fig.legend(loc='upper left', bbox_to_anchor=(0.1, 0.9))

# Display the chart
plt.tight_layout()
plt.show()

In [None]:
# Create figure and axis
fig, ax1 = plt.subplots(figsize=(20, 6))

# Plot VADER text_compound on primary y-axis.
ax1.plot(boe_speeches_indicators['date'], boe_speeches_indicators[['Positive', 'Negative', 'Uncertainty']], color='blue', label='Lexicom Sentiment')
ax1.set_xlabel('Date')
ax1.set_ylabel('Lexicon Sentiment', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')

# Create a second y-axis sharing the same x-axis
ax2 = ax1.twinx()

# Plot  on secondary y-axis
ax2.plot(boe_speeches_indicators['date'], boe_speeches_indicators['confidence_index'], color='red', label='Confidence Index')
ax2.set_ylabel('Confidence Index', color='black')
ax2.tick_params(axis='y', labelcolor='black')

# Add title and legend
plt.title('Bank of England VADER Sentiment Score vs UK Confidence Index')
fig.legend(loc='upper left', bbox_to_anchor=(0.1, 0.9))

# Display the chart
plt.tight_layout()
plt.show()

**Inflation**

In [None]:
# Create figure and axis
fig, ax1 = plt.subplots(figsize=(20, 6))

# Plot VADER text_compound on primary y-axis.
ax1.plot(boe_speeches_indicators['date'], boe_speeches_indicators[['negative_lm', 'positive_lm', 'uncertainty_lm']], color='blue', label='weighted_sentiment_LM')
ax1.set_xlabel('Date')
ax1.set_ylabel('weighted_sentiment_LM', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')

# Create a second y-axis sharing the same x-axis
ax2 = ax1.twinx()

# Plot  on secondary y-axis
ax2.plot(boe_speeches_indicators['date'], boe_speeches_indicators['Inflation Rate'], color='red', label='Inflation Rate')
ax2.set_ylabel('Inflation Rate', color='black')
ax2.tick_params(axis='y', labelcolor='black')

# Add title and legend
plt.title('Bank of England Loughran-McDonald Weighted Sentiment Score vs UK Inflation Rate')
fig.legend(loc='upper left', bbox_to_anchor=(0.1, 0.9))

# Display the chart
plt.tight_layout()
plt.show()

In [None]:
# Create figure and axis
fig, ax1 = plt.subplots(figsize=(20, 6))

# Plot VADER text_compound on primary y-axis.
ax1.plot(boe_speeches_indicators['date'], boe_speeches_indicators['sentiment_score_lm_weighted'], color='blue', label='sentiment_score_lm_weighted')
ax1.set_xlabel('Date')
ax1.set_ylabel('weighted_sentiment_LM', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')

# Create a second y-axis sharing the same x-axis
ax2 = ax1.twinx()

# Plot  on secondary y-axis
ax2.plot(boe_speeches_indicators['date'], boe_speeches_indicators['Unemployment rate'], color='red', label='Unemployment Rate')
ax2.set_ylabel('Unemployment Rate', color='black')
ax2.tick_params(axis='y', labelcolor='black')

# Add title and legend
plt.title('Bank of England Loughran-McDonald Weighted Sentiment Score vs UK Unemployment Rate')
fig.legend(loc='upper left', bbox_to_anchor=(0.1, 0.9))

# Display the chart
plt.tight_layout()
plt.show()

**Interest Rates**

In [None]:
# Create figure and axis
fig, ax1 = plt.subplots(figsize=(20, 6))

# Plot VADER text_compound on primary y-axis.
ax1.plot(boe_speeches_indicators['date'], boe_speeches_indicators['sentiment_score_lm_weighted'], color='blue', label='sentiment_score_lm_weighted')
ax1.set_xlabel('Date')
ax1.set_ylabel('weighted_sentiment_LM', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')

# Create a second y-axis sharing the same x-axis
ax2 = ax1.twinx()

# Plot  on secondary y-axis
ax2.plot(boe_speeches_indicators['date'], boe_speeches_indicators['Bank Rate'], color='red', label='Bank Rate')
ax2.set_ylabel('Bank Rate', color='black')
ax2.tick_params(axis='y', labelcolor='black')

# Add title and legend
plt.title('Bank of England Loughran-McDonald Weighted Sentiment Score vs UK Bank Rate')
fig.legend(loc='upper left', bbox_to_anchor=(0.1, 0.9))

# Display the chart
plt.tight_layout()
plt.show()

In [None]:
# Create figure and axis
fig, ax1 = plt.subplots(figsize=(20, 6))

# Plot VADER text_compound on primary y-axis.
ax1.plot(boe_speeches_indicators['date'], boe_speeches_indicators['sentiment_score_lm_weighted'], color='blue', label='sentiment_score_lm_weighted')
ax1.set_xlabel('Date')
ax1.set_ylabel('weighted_sentiment_LM', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')

# Create a second y-axis sharing the same x-axis
ax2 = ax1.twinx()

# Plot  on secondary y-axis
ax2.plot(boe_speeches_indicators['date'], boe_speeches_indicators['Bank Rate'], color='red', label='Bank Rate')
ax2.set_ylabel('Bank Rate', color='black')
ax2.tick_params(axis='y', labelcolor='black')

# Add title and legend
plt.title('Bank of England Loughran-McDonald Weighted Sentiment Score vs UK Bank Rate')
fig.legend(loc='upper left', bbox_to_anchor=(0.1, 0.9))

# Display the chart
plt.tight_layout()
plt.show()

In [None]:
# Create figure and axis
fig, ax1 = plt.subplots(figsize=(20, 6))

# Plot VADER text_compound on primary y-axis.
ax1.plot(boe_speeches_indicators['date'], boe_speeches_indicators[['text_pos', 'text_neg', 'text_neu']], color='blue', label='VADER compound score')
ax1.set_xlabel('Date')
ax1.set_ylabel('VADER Scores', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')

# Create a second y-axis sharing the same x-axis
ax2 = ax1.twinx()

# Plot  on secondary y-axis
ax2.plot(boe_speeches_indicators['date'], boe_speeches_indicators['Bank Rate'], color='red', label='Bank Rate')
ax2.set_ylabel('Bank Rate', color='black')
ax2.tick_params(axis='y', labelcolor='black')

# Add title and legend
plt.title('Bank of England VADER Sentiment Score vs UK Bank Rate')
fig.legend(loc='upper left', bbox_to_anchor=(0.1, 0.9))

# Display the chart
plt.tight_layout()
plt.show()

### 4.3. Initial statistical analysis

In [None]:
boe_speeches_indicators.columns

**4.3.a. GPT Analysis**

In [None]:
# Create a pairplot for GPT sentiment score and all economic indicators.
columns_sentiment_gpt = ['gpt_sentiment_std', 'uk_inflation_rate_CPIH', 'uk_unemployment_rate', 'uk_gdp_growth',
                             'uk_interest_rate', 'uk_consumer_confidence', 'gbp_usd_fx', 'ftse_250', 'gilts_short ', 
                             'gilts_medium ', 'gilts_long ', 'uk_credit_growth_no_cc', 'uk_credit_growth_only_cc',
                             'avg_price_all_property_types']

# Create a pairplot using only the specified columns
sns.pairplot(boe_speeches_indicators[columns_sentiment_gpt], plot_kws={'alpha': 0.5, 'color': '#0e1b2c'})

# Save figure.
# plt.savefig('Fig_Pairplot_Indicators.png', dpi=500)

# Display the plot.
plt.tight_layout()
plt.show()

In [None]:
# Correlation matrix for GPT sentiment score with all economic indicators
correlation_matrix_gpt = boe_speeches_indicators[columns_sentiment_gpt].corr()

# Display the correlation matrix
correlation_matrix_gpt

In [None]:
# Heatmap of the correlation matrix for GPT sentiment score with all economic indicators
plt.figure(figsize=(8, 6)) 
sns.heatmap(correlation_matrix_gpt, annot=True, fmt=".2f", cmap='viridis', cbar=True)

# Customize title and labels
plt.title('Correlation Heatmap of all Economic Indicators with GPT Sentiment')
plt.show()

In [None]:
# Create a pairplot for GPT sentiment score and price/ inflation indicators
columns_sentiment_gpt_price = ['gpt_sentiment_std', 'uk_inflation_rate_CPIH', 'uk_interest_rate',
                             'avg_price_all_property_types']

# Create a pairplot using only the specified columns
sns.pairplot(boe_speeches_indicators[columns_sentiment_gpt_price], plot_kws={'alpha': 0.5, 'color': '#0e1b2c'})

# Save figure.
# plt.savefig('Fig_Pairplot_Indicators.png', dpi=500)

# Display the plot.
plt.tight_layout()
plt.show()

In [None]:
# Correlation matrix for GPT sentiment score with price/ inflation indicators
correlation_matrix_gpt_price = boe_speeches_indicators[columns_sentiment_gpt_price].corr()

# Display the correlation matrix
correlation_matrix_gpt_price

In [None]:
# Heatmap of the correlation matrix with price/ inflation indicators
plt.figure(figsize=(8, 6)) 
sns.heatmap(correlation_matrix_gpt_price, annot=True, fmt=".2f", cmap='viridis', cbar=True)

# Customize title and labels
plt.title('Correlation Heatmap of Price/Inflation Indicators with GPT Sentiment')
plt.show()

In [None]:
# Create a pairplot for GPT sentiment score and macroeconnomic indicators
columns_sentiment_gpt_macro = ['gpt_sentiment_std', 'uk_gdp_growth', 'uk_unemployment_rate', 
                            'uk_credit_growth_no_cc', 'uk_consumer_confidence']

# Create a pairplot using only the specified columns
sns.pairplot(boe_speeches_indicators[columns_sentiment_gpt_macro], plot_kws={'alpha': 0.5, 'color': '#0e1b2c'})

# Save figure.
# plt.savefig('Fig_Pairplot_Indicators.png', dpi=500)

# Display the plot.
plt.tight_layout
plt.show()

In [None]:
# Correlation matrix for GPT sentiment score with macroeconomic indicators
correlation_matrix_gpt_macro = boe_speeches_indicators[columns_sentiment_gpt_macro].corr()

# Display the correlation matrix
correlation_matrix_gpt_macro

In [None]:
# Heatmap of the correlation matrix with macroeconomic indicators
plt.figure(figsize=(8, 6)) 
sns.heatmap(correlation_matrix_gpt_macro, annot=True, fmt=".2f", cmap='viridis', cbar=True)

# Customize title and labels
plt.title('Correlation Heatmap of Macroeconomic Indicators with GPT Sentiment')
plt.show()

In [None]:
# Create a pairplot for GPT sentiment score and financial indicators
columns_sentiment_gpt_finance = ['gpt_sentiment_std', 'gbp_usd_fx', 'ftse_250', 'gilts_short ', 
                             'gilts_medium ', 'gilts_long ']

# Create a pairplot using only the specified columns
sns.pairplot(boe_speeches_indicators[columns_sentiment_gpt_finance], plot_kws={'alpha': 0.5, 'color': '#0e1b2c'})

# Save figure.
# plt.savefig('Fig_Pairplot_Indicators.png', dpi=500)

# Display the plot.
plt.tight_layout()
plt.show()

In [None]:
# Correlation matrix for GPT sentiment score with financial indicators
correlation_matrix_gpt_finance = boe_speeches_indicators[columns_sentiment_gpt_finance].corr()

# Display the correlation matrix
correlation_matrix_gpt_finance

In [None]:
# Heatmap of the correlation matrix with finance indicators
plt.figure(figsize=(8, 6)) 
sns.heatmap(correlation_matrix_gpt_finance, annot=True, fmt=".2f", cmap='viridis', cbar=True)

# Customize title and labels
plt.title('Correlation Heatmap of Financial Indicators with GPT Sentiment')
plt.show()

In [None]:
# Create a pairplot for GPT sentiment score and most impactful indicators
columns_sentiment_gpt_top = ['gpt_sentiment_std', 'uk_inflation_rate_CPIH', 'uk_unemployment_rate', 
                            'uk_credit_growth_no_cc', 'uk_consumer_confidence', 'ftse_250']

# Create a pairplot using only the specified columns
sns.pairplot(boe_speeches_indicators[columns_sentiment_gpt_top], plot_kws={'alpha': 0.5, 'color': '#0e1b2c'})

# Save figure.
# plt.savefig('Fig_Pairplot_Indicators.png', dpi=500)

# Display the plot.
plt.tight_layout()
plt.show()

In [None]:
# Correlation matrix for GPT sentiment score with and most impactful indicators
correlation_matrix_gpt_top = boe_speeches_indicators[columns_sentiment_gpt_top].corr()

# Display the correlation matrix
correlation_matrix_gpt_top

In [None]:
# Heatmap of the correlation matrix with most impactful indicators
plt.figure(figsize=(8, 6)) 
sns.heatmap(correlation_matrix_gpt_top, annot=True, fmt=".2f", cmap='viridis', cbar=True)

# Customize title and labels
plt.title('Correlation Heatmap of Most Impactful Indicators with GPT Sentiment')
plt.show()

**4.3.c. FinBert Analysis**

In [None]:
# Create a pairplot for sentiment score, unemployment, inflation and exchange rates.
columns_sentiment_finbert = ['sentiment_score_yiyang_std', 'uk_inflation_rate_CPIH', 'uk_unemployment_rate', 'uk_gdp_growth',
                             'uk_interest_rate', 'uk_consumer_confidence', 'gbp_usd_fx', 'ftse_250', 'gilts_short ', 
                             'gilts_medium ', 'gilts_long ', 'uk_credit_growth_no_cc', 'uk_credit_growth_only_cc',
                             'avg_price_all_property_types']

# Create a pairplot using only the specified columns
sns.pairplot(boe_speeches_indicators[columns_sentiment_finbert], plot_kws={'alpha': 0.5, 'color': '#0e1b2c'})

# Save figure.
plt.savefig('Fig_Pairplot_Indicators.png', dpi=500)

# Display the plot.
plt.tight_layout()
plt.show()

In [None]:
# Create the correlation matrix for the sentiment score with the indicators only.
correlation_matrix_finbert = boe_speeches_indicators[columns_sentiment_finbert].corr()

# Display the correlation matrix
correlation_matrix_finbert

In [None]:
# Create a heatmap of the correlation matrix
plt.figure(figsize=(8, 6)) 
sns.heatmap(correlation_matrix_finbert, annot=True, fmt=".2f", cmap='viridis', cbar=True)

# Customize title and labels
plt.title('Correlation Heatmap of Economic Indicators')
plt.show()

In [None]:
# Create a pairplot for sentiment score, unemployment, inflation and exchange rates.
columns_sentiment_finbert_inflation = ['sentiment_score_yiyang_std', 'uk_inflation_rate_CPIH', 'uk_interest_rate',
                             'avg_price_all_property_types']

# Create a pairplot using only the specified columns
sns.pairplot(boe_speeches_indicators[columns_sentiment_finbert_inflation], plot_kws={'alpha': 0.5, 'color': '#0e1b2c'})

# Save figure.
# plt.savefig('Fig_Pairplot_Indicators.png', dpi=500)

# Display the plot.
plt.tight_layout()
plt.show()

In [None]:
# Create the correlation matrix for the sentiment score with the indicators only.
correlation_matrix_finbert_inflation = boe_speeches_indicators[columns_sentiment_finbert_inflation].corr()

# Display the correlation matrix
correlation_matrix_finbert_inflation

In [None]:
# Create a heatmap of the correlation matrix
plt.figure(figsize=(8, 6)) 
sns.heatmap(correlation_matrix_finbert_inflation, annot=True, fmt=".2f", cmap='viridis', cbar=True)

# Customize title and labels
plt.title('Correlation Heatmap of Economic Indicators on Price & Inflation')
plt.show()

In [None]:
# Create a pairplot for sentiment score, unemployment, inflation and exchange rates.
columns_sentiment_finbert_macro = ['sentiment_score_yiyang_std', 'uk_gdp_growth', 'uk_unemployment_rate', 
                            'uk_credit_growth_no_cc', 'uk_consumer_confidence']

# Create a pairplot using only the specified columns
sns.pairplot(boe_speeches_indicators[columns_sentiment_finbert_macro], plot_kws={'alpha': 0.5, 'color': '#0e1b2c'})

# Save figure.
# plt.savefig('Fig_Pairplot_Indicators.png', dpi=500)

# Display the plot.
plt.tight_layout
plt.show()

In [None]:
# Create the correlation matrix for the sentiment score with the indicators only.
correlation_matrix_finbert_macro = boe_speeches_indicators[columns_sentiment_finbert_macro].corr()

# Display the correlation matrix
correlation_matrix_finbert_macro

In [None]:
# Create a heatmap of the correlation matrix
plt.figure(figsize=(8, 6)) 
sns.heatmap(correlation_matrix_finbert_macro, annot=True, fmt=".2f", cmap='viridis', cbar=True)

# Customize title and labels
plt.title('Correlation Heatmap of Macroeconomic Indicators ')
plt.show()

In [None]:
# Create a pairplot for sentiment score, unemployment, inflation and exchange rates.
columns_sentiment_finbert_finance = ['sentiment_score_yiyang_std', 'gbp_usd_fx', 'ftse_250', 'gilts_short ', 
                             'gilts_medium ', 'gilts_long ']

# Create a pairplot using only the specified columns
sns.pairplot(boe_speeches_indicators[columns_sentiment_finbert_finance], plot_kws={'alpha': 0.5, 'color': '#0e1b2c'})

# Save figure.
# plt.savefig('Fig_Pairplot_Indicators.png', dpi=500)

# Display the plot.
plt.tight_layout()
plt.show()

In [None]:
# Create the correlation matrix for the sentiment score with the indicators only.
correlation_matrix_finbert_finance = boe_speeches_indicators[columns_sentiment_finbert_finance].corr()

# Display the correlation matrix
correlation_matrix_finbert_finance

## 5. Random Forests

In [None]:
# List of DataFrames to merge
dataframes_to_merge = [uk_economic_indicators]

# Use reduce to merge all DataFrames in the list
boe_speeches_indicators = reduce(lambda left, right: left.merge(right, on='year_month', how='left'), dataframes_to_merge, boe_speeches_new)

# View the merged DataFrame
boe_speeches_indicators.head()

In [None]:
boe_speeches_indicators.columns

In [None]:
boe_speeches_indicators['date_format'].dtypes

In [None]:
# Copy the DataFranme for further manipulation
boe_rf = boe_speeches_indicators.copy()

In [None]:
# Step 1: Convert 'date' to datetime if it's not already
boe_rf['date_time'] = pd.to_datetime(boe_rf['date'])

In [None]:
# Set 'date' as index for resampling
boe_rf.set_index('date_time', inplace=True)

In [None]:
# View the DataFrame
boe_rf.head()

### 5.1. Monthly Analysis with BoE Wordlist Sentiment Score

In [None]:
# Aggregate sentiment scores and consumer confidence monthly
boe_rf_monthly = boe_rf.resample('M').agg({
            'sentiment_score_lexicon_std': 'mean',
            'gpt_sentiment_std': 'mean',
            'uk_consumer_confidence': 'mean',
            'uk_inflation_rate_CPIH': 'mean',
            'uk_unemployment_rate': 'mean',
            'uk_gdp_growth': 'mean',
            'uk_interest_rate': 'mean',
            'uk_consumer_confidence': 'mean', 
            'gbp_usd_fx': 'mean',
            'ftse_250': 'mean',
            'gilts_short ': 'mean',
            'gilts_medium ': 'mean', 
            'gilts_long ': 'mean', 
            'uk_credit_growth_no_cc': 'mean',
            'uk_credit_growth_only_cc': 'mean',
            'avg_price_all_property_types': 'mean'
})

# Reset index to turn 'date' back into a column
boe_rf_monthly.reset_index(inplace=True)

In [None]:
# View the DataFrame
boe_rf_monthly.head()

In [None]:
# Create date-related features
boe_rf_monthly['month'] = boe_rf_monthly['date_time'].dt.month
boe_rf_monthly['quarter'] = boe_rf_monthly['date_time'].dt.quarter
boe_rf_monthly['year'] = boe_rf_monthly['date_time'].dt.year

In [None]:
# View the DataFrame
boe_rf_monthly.head()

In [None]:
# Create lagged feature for sentiment score with 1 month lag
boe_rf_monthly['sentiment_score_lexicon_std_lag_1m'] = boe_rf_monthly['sentiment_score_lexicon_std'].shift(1)

In [None]:
# Create lagged feature for sentiment score with 1 month lag
boe_rf_monthly['sentiment_score_lexicon_std_lag_3m'] = boe_rf_monthly['sentiment_score_lexicon_std'].shift(3)

In [None]:
# View the DataFrame
boe_rf_monthly.head()

In [None]:
# Drop first row(s) with NaN values due to lagging
boe_rf_monthly.dropna(inplace=True)

In [None]:
# Prepare features (X)
feature_cols = ['sentiment_score_lexicon_std', 'sentiment_score_lexicon_std_lag_1m', 'sentiment_score_lexicon_std_lag_3m', 'month', 'quarter', 'year']
X = boe_rf_monthly[feature_cols]

# For each indicator (target)
for target in ['uk_consumer_confidence',
            'uk_inflation_rate_CPIH',
            'uk_unemployment_rate',
            'uk_gdp_growth',
            'uk_interest_rate',
            'uk_consumer_confidence', 
            'gbp_usd_fx',
            'ftse_250',
            'gilts_short ',
            'gilts_medium ', 
            'gilts_long ', 
            'uk_credit_growth_no_cc',
            'uk_credit_growth_only_cc',
            'avg_price_all_property_types']:
            y = boe_rf_monthly[target]

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score


# List of targets
targets = [
    'uk_consumer_confidence',
    'uk_inflation_rate_CPIH',
    'uk_unemployment_rate',
    'uk_gdp_growth',
    'uk_interest_rate',
    'uk_consumer_confidence', 
    'gbp_usd_fx',
    'ftse_250',
    'gilts_short ',
    'gilts_medium ', 
    'gilts_long ', 
    'uk_credit_growth_no_cc',
    'uk_credit_growth_only_cc',
    'avg_price_all_property_types'
]

# Features
feature_cols = [
    'sentiment_score_lexicon_std', 
    'sentiment_score_lexicon_std_lag_1m',
    'sentiment_score_lexicon_std_lag_3m',
    'month', 'quarter', 'year'
]

# Loop through each target
for target in targets:
    print(f"\nTraining model for: {target}")
    y = boe_rf_monthly[target]
    X = boe_rf_monthly[feature_cols]

    # Time-series aware split without shuffling
    split_idx = int(len(boe_rf_monthly) * 0.8)
    X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
    y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]

    # Initialize and train the model
    rf = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
    rf.fit(X_train, y_train)

    # Make predictions
    y_pred = rf.predict(X_test)

    # Evaluate
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)

    print(f"RMSE: {rmse:.3f}")
    print(f"R^2: {r2:.3f}")

### 5.2. Quarterly Analysis woth BoE Wordlist Sentiment Score

In [None]:
# Aggregate sentiment scores and consumer confidence monthly
boe_rf_quarterly = boe_rf.resample('Q').agg({
            'sentiment_score_lexicon_std': 'mean',
            'gpt_sentiment_std': 'mean',
            'uk_consumer_confidence': 'mean',
            'uk_inflation_rate_CPIH': 'mean',
            'uk_unemployment_rate': 'mean',
            'uk_gdp_growth': 'mean',
            'uk_interest_rate': 'mean',
            'uk_consumer_confidence': 'mean', 
            'gbp_usd_fx': 'mean',
            'ftse_250': 'mean',
            'gilts_short ': 'mean',
            'gilts_medium ': 'mean', 
            'gilts_long ': 'mean', 
            'uk_credit_growth_no_cc': 'mean',
            'uk_credit_growth_only_cc': 'mean',
            'avg_price_all_property_types': 'mean'
})

# Reset index to turn 'date' back into a column
boe_rf_quarterly.reset_index(inplace=True)

In [None]:
# Create date-related features
boe_rf_quarterly['month'] = boe_rf_quarterly['date_time'].dt.month
boe_rf_quarterly['quarter'] = boe_rf_quarterly['date_time'].dt.quarter
boe_rf_quarterly['year'] = boe_rf_quarterly['date_time'].dt.year

In [None]:
# Create lagged features (3 months lag is now 1 quarter lag)
boe_rf_quarterly['sentiment_score_lexicon_std_lag_1q'] = boe_rf_quarterly['sentiment_score_lexicon_std'].shift(1)

In [None]:
# View the DataFrame
boe_rf_quarterly.head()

In [None]:
# Drop NaNs due to lag
boe_rf_quarterly.dropna(inplace=True)

In [None]:
# Prepare features (X)
feature_cols = ['sentiment_score_lexicon_std', 'sentiment_score_lexicon_std_lag_1q', 'month', 'quarter', 'year']
X = boe_rf_quarterly[feature_cols]

# For each indicator (target)
for target in ['uk_consumer_confidence',
            'uk_inflation_rate_CPIH',
            'uk_unemployment_rate',
            'uk_gdp_growth',
            'uk_interest_rate',
            'uk_consumer_confidence', 
            'gbp_usd_fx',
            'ftse_250',
            'gilts_short ',
            'gilts_medium ', 
            'gilts_long ', 
            'uk_credit_growth_no_cc',
            'uk_credit_growth_only_cc',
            'avg_price_all_property_types']:
            y = boe_rf_quarterly[target]

In [None]:
# List of targets
targets = [
    'uk_consumer_confidence',
    'uk_inflation_rate_CPIH',
    'uk_unemployment_rate',
    'uk_gdp_growth',
    'uk_interest_rate',
    'uk_consumer_confidence', 
    'gbp_usd_fx',
    'ftse_250',
    'gilts_short ',
    'gilts_medium ', 
    'gilts_long ', 
    'uk_credit_growth_no_cc',
    'uk_credit_growth_only_cc',
    'avg_price_all_property_types'
]

# Features
feature_cols = [
    'sentiment_score_lexicon_std', 
    'sentiment_score_lexicon_std_lag_1q',
    'month', 'quarter', 'year'
]

# Loop through each target
for target in targets:
    print(f"\nTraining model for: {target}")
    y = boe_rf_quarterly[target]
    X = boe_rf_quarterly[feature_cols]

    # Time-series aware split without shuffling
    split_idx = int(len(boe_rf_quarterly) * 0.8)
    X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
    y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]

    # Initialize and train the model
    rf = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
    rf.fit(X_train, y_train)

    # Make predictions
    y_pred = rf.predict(X_test)

    # Evaluate
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)

    print(f"RMSE: {rmse:.3f}")
    print(f"R^2: {r2:.3f}")

### 5.3. Monthly Analysis with GPT Sentiment Score

In [None]:
# Aggregate sentiment scores and consumer confidence monthly
boe_rf_monthly_gpt = boe_rf.resample('M').agg({
            'gpt_sentiment_std': 'mean',
            'gpt_sentiment_std': 'mean',
            'uk_consumer_confidence': 'mean',
            'uk_inflation_rate_CPIH': 'mean',
            'uk_unemployment_rate': 'mean',
            'uk_gdp_growth': 'mean',
            'uk_interest_rate': 'mean',
            'uk_consumer_confidence': 'mean', 
            'gbp_usd_fx': 'mean',
            'ftse_250': 'mean',
            'gilts_short ': 'mean',
            'gilts_medium ': 'mean', 
            'gilts_long ': 'mean', 
            'uk_credit_growth_no_cc': 'mean',
            'uk_credit_growth_only_cc': 'mean',
            'avg_price_all_property_types': 'mean'
})

# Reset index to turn 'date' back into a column
boe_rf_monthly_gpt.reset_index(inplace=True)

In [None]:
# View the DataFrame
boe_rf_monthly_gpt.head()

In [None]:
# Create date-related features
boe_rf_monthly_gpt['month'] = boe_rf_monthly_gpt['date_time'].dt.month
boe_rf_monthly_gpt['quarter'] = boe_rf_monthly_gpt['date_time'].dt.quarter
boe_rf_monthly_gpt['year'] = boe_rf_monthly_gpt['date_time'].dt.year

In [None]:
# View the DataFrame
boe_rf_monthly_gpt.head()

In [None]:
# Create lagged feature for sentiment score with 1 month lag
boe_rf_monthly_gpt['gpt_sentiment_std_lag_1m'] = boe_rf_monthly_gpt['gpt_sentiment_std'].shift(1)

In [None]:
# Create lagged feature for sentiment score with 1 month lag
boe_rf_monthly_gpt['gpt_sentiment_std_lag_3m'] = boe_rf_monthly_gpt['gpt_sentiment_std'].shift(3)

In [None]:
# View the DataFrame
boe_rf_monthly_gpt.head()

In [None]:
# Drop first row(s) with NaN values due to lagging
boe_rf_monthly_gpt.dropna(inplace=True)

In [None]:
# Prepare features (X)
feature_cols = ['gpt_sentiment_std', 'gpt_sentiment_std_lag_1m', 'gpt_sentiment_std_lag_3m', 'month', 'quarter', 'year']
X = boe_rf_monthly_gpt[feature_cols]

# For each indicator (target)
for target in ['uk_consumer_confidence',
            'uk_inflation_rate_CPIH',
            'uk_unemployment_rate',
            'uk_gdp_growth',
            'uk_interest_rate',
            'uk_consumer_confidence', 
            'gbp_usd_fx',
            'ftse_250',
            'gilts_short ',
            'gilts_medium ', 
            'gilts_long ', 
            'uk_credit_growth_no_cc',
            'uk_credit_growth_only_cc',
            'avg_price_all_property_types']:
            y = boe_rf_monthly_gpt[target]

In [None]:
# List of targets
targets = [
    'uk_consumer_confidence',
    'uk_inflation_rate_CPIH',
    'uk_unemployment_rate',
    'uk_gdp_growth',
    'uk_interest_rate',
    'uk_consumer_confidence', 
    'gbp_usd_fx',
    'ftse_250',
    'gilts_short ',
    'gilts_medium ', 
    'gilts_long ', 
    'uk_credit_growth_no_cc',
    'uk_credit_growth_only_cc',
    'avg_price_all_property_types'
]

# Features
feature_cols = [
    'gpt_sentiment_std', 
    'gpt_sentiment_std_lag_1m',
    'gpt_sentiment_std_lag_3m',
    'month', 'quarter', 'year'
]

# Loop through each target
for target in targets:
    print(f"\nTraining model for: {target}")
    y = boe_rf_monthly_gpt[target]
    X = boe_rf_monthly_gpt[feature_cols]

    # Time-series aware split without shuffling
    split_idx = int(len(boe_rf_monthly_gpt) * 0.8)
    X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
    y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]

    # Initialize and train the model
    rf = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
    rf.fit(X_train, y_train)

    # Make predictions
    y_pred = rf.predict(X_test)

    # Evaluate
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)

    print(f"RMSE: {rmse:.3f}")
    print(f"R^2: {r2:.3f}")

### 5.2. Quarterly Analysis woth BoE Wordlist Sentiment Score

In [None]:
# Aggregate sentiment scores and consumer confidence monthly
boe_rf_quarterly_gpt = boe_rf.resample('Q').agg({
            'gpt_sentiment_std': 'mean',
            'gpt_sentiment_std': 'mean',
            'uk_consumer_confidence': 'mean',
            'uk_inflation_rate_CPIH': 'mean',
            'uk_unemployment_rate': 'mean',
            'uk_gdp_growth': 'mean',
            'uk_interest_rate': 'mean',
            'uk_consumer_confidence': 'mean', 
            'gbp_usd_fx': 'mean',
            'ftse_250': 'mean',
            'gilts_short ': 'mean',
            'gilts_medium ': 'mean', 
            'gilts_long ': 'mean', 
            'uk_credit_growth_no_cc': 'mean',
            'uk_credit_growth_only_cc': 'mean',
            'avg_price_all_property_types': 'mean'
})

# Reset index to turn 'date' back into a column
boe_rf_quarterly_gpt.reset_index(inplace=True)

In [None]:
# Create date-related features
boe_rf_quarterly_gpt['month'] = boe_rf_quarterly_gpt['date_time'].dt.month
boe_rf_quarterly_gpt['quarter'] = boe_rf_quarterly_gpt['date_time'].dt.quarter
boe_rf_quarterly_gpt['year'] = boe_rf_quarterly_gpt['date_time'].dt.year

In [None]:
# Create lagged features (3 months lag is now 1 quarter lag)
boe_rf_quarterly_gpt['gpt_sentiment_std_lag_1q'] = boe_rf_quarterly_gpt['gpt_sentiment_std'].shift(1)

In [None]:
# View the DataFrame
boe_rf_quarterly_gpt.head()

In [None]:
# Drop NaNs due to lag
boe_rf_quarterly_gpt.dropna(inplace=True)

In [None]:
# Prepare features (X)
feature_cols = ['gpt_sentiment_std', 'gpt_sentiment_std_lag_1q', 'month', 'quarter', 'year']
X = boe_rf_quarterly_gpt[feature_cols]

# For each indicator (target)
for target in ['uk_consumer_confidence',
            'uk_inflation_rate_CPIH',
            'uk_unemployment_rate',
            'uk_gdp_growth',
            'uk_interest_rate',
            'uk_consumer_confidence', 
            'gbp_usd_fx',
            'ftse_250',
            'gilts_short ',
            'gilts_medium ', 
            'gilts_long ', 
            'uk_credit_growth_no_cc',
            'uk_credit_growth_only_cc',
            'avg_price_all_property_types']:
            y = boe_rf_quarterly_gpt[target]

In [None]:
# List of targets
targets = [
    'uk_consumer_confidence',
    'uk_inflation_rate_CPIH',
    'uk_unemployment_rate',
    'uk_gdp_growth',
    'uk_interest_rate',
    'uk_consumer_confidence', 
    'gbp_usd_fx',
    'ftse_250',
    'gilts_short ',
    'gilts_medium ', 
    'gilts_long ', 
    'uk_credit_growth_no_cc',
    'uk_credit_growth_only_cc',
    'avg_price_all_property_types'
]

# Features
feature_cols = [
    'gpt_sentiment_std', 
    'gpt_sentiment_std_lag_1q',
    'month', 'quarter', 'year'
]

# Loop through each target
for target in targets:
    print(f"\nTraining model for: {target}")
    y = boe_rf_quarterly_gpt[target]
    X = boe_rf_quarterly_gpt[feature_cols]

    # Time-series aware split without shuffling
    split_idx = int(len(boe_rf_quarterly_) * 0.8)
    X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
    y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]

    # Initialize and train the model
    rf = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
    rf.fit(X_train, y_train)

    # Make predictions
    y_pred = rf.predict(X_test)

    # Evaluate
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)

    print(f"RMSE: {rmse:.3f}")
    print(f"R^2: {r2:.3f}")