## Bank speeches starter template

The [Kaggle data set](https://www.kaggle.com/datasets/davidgauthier/central-bank-speeches/data) used in this project use speeches from senior central bankers of various influential central banks. The corpus runs from 1997 until 2022. Central banks are institutions that define monetary policy. As such, central bank speeches are widely followed and have a major influence on financial markets.

You can also refer to the raw data set and article used [here](https://www.kaggle.com/datasets/magnushansson/central-bank-speeches).

Note that, due to the large number of speeches contained in the data set, you can expect long running times for processing intensive steps such as sentiment analysis when using the full data set (30-60 mins). It is recommended that you reduce the subset of data under review when creating your prototype and then run the full data set through, if required, once the code behaves as expected. You can reduce the data set to, for example, only process data from the United Kingdom (or other countries) or by looking at specific date ranges.

The code is not extensive and you will be expected to use the provided code as a starting point only. You will also need to use your own creativity and logic to identify useful patterns in the data. You can explore sentiment, polarity and entities/keywords, and should use appropriate levels of granularity and aggregation in order to analyse patterns contained in the data.

In [10]:
# Install the necessary libraries.
#!pip install nltk
#!pip install vaderSentiment
#!pip install textblob

!pip install vaderSentiment



In [12]:
# Import relevant libraries.
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
from datetime import datetime

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('words')
nltk.download('omw-1.4')
from nltk.corpus import stopwords
from nltk.corpus import words
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from !pip install vaderSentiment
from textblob import TextBlob
stop_words = set(stopwords.words('english'))

SyntaxError: invalid syntax (3002609130.py, line 18)

In [None]:
# Load dataset. Change directory as required.
df = pd.read_csv('data2/all_speeches.csv')

In [None]:
df.head()

In [None]:
df.country.value_counts()

In [None]:
df[df['country']=='united kingdom'].sort_values('date').head()

In [None]:
# Demo: Example of adding a column to calculate the string length per speech.
df['len'] = df['text'].str.len()
df

In [None]:
# Demo: Convert to lower case and remove punctuation.
df['text'] = df['text'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['text'] = df['text'].str.replace('[^\w\s]','',regex=True)
df

In [None]:
# Subset the data to reduce processing time.
dfi = df[df['country']=='united kingdom'].sort_values('date').reset_index(drop=True)
dfi.info()

In [None]:
%%time
# Demo: Sentiment intensity analysis using Vadar sentiment and a for loop.
sia = SentimentIntensityAnalyzer()
dft = pd.DataFrame(columns=['neg', 'neu', 'pos', 'compound'])
for j in dfi.index:
    score = sia.polarity_scores(dfi.iloc[j,6])
    new_row = pd.DataFrame({'neg': score['neg'], 'neu': score['neu'], 'pos': score['pos'], 'compound': score['compound']}, index=[j])
    dft = pd.concat([dft, new_row], ignore_index=True)
dfi = pd.concat([dfi, dft], axis=1, join="inner")
dfi

In [None]:
%%time
# Demo: Using a self defined function with Textblob to calculate polarity and subjectivity.
def generate_polarity_subjectivity(dfs):
    dft2 = TextBlob(dfs).sentiment
    return pd.Series([dft2[0], dft2[1]])

# Apply the function to the data and add two new columns
dfi[['polarity','subjectivity']] = dfi['text'].apply(generate_polarity_subjectivity)
dfi.head()

In [None]:
%%time
# Demo: Frequency distribution review of a single speech.

# Tokenise the text data.
stop_words=set(stopwords.words('english'))
filtered_text = []

# Example speech using iloc to reference (Hint: Can be used in loops if required).
tokenized_word = word_tokenize(dfi.iloc[0,6])

# Filter the tokenised words.
for each_word in tokenized_word:
    if each_word.lower() not in stop_words and each_word.isalpha():
        filtered_text.append(each_word.lower())

# Display the filtered list.
#print('Tokenised list without stop words: {}'.format(filtered_text))

# Create a frequency distribution object.
freq_dist_of_words = FreqDist(filtered_text)

# Show the five most common elements in the data set.
freq_dist_of_words.most_common(10)

In [None]:
# Set plotting options and plot the data.
fig, ax = plt.subplots(dpi=100)
fig.set_size_inches(20, 10)
freq_dist_of_words.plot(25, cumulative=False)
plt.show()

## Imported data 
checked for duplicates, missing values, etc 

In [14]:
# Import data from 1997  
# import pandas
import pandas as pd

#define metrics 
df_uk_metrics = pd.read_excel('CPI_Unemployment_RPI_Bank rates_GDP.xlsx')

df_uk_metrics

Unnamed: 0,Year,Month,Unemployment %,CPI rate%,GDP,RPI%,Bank Rate %
0,1997,Jan,7.5,2.1,63.3398,3.2,5.94
1,1997,Feb,7.3,1.9,63.9959,3.3,5.94
2,1997,Mar,7.2,1.7,64.0355,3.3,5.94
3,1997,Apr,7.2,1.6,64.6273,3.3,5.94
4,1997,May,7.2,1.6,64.1371,3.3,6.25
...,...,...,...,...,...,...,...
313,2023,Feb,3.9,10.4,102.5735,19.8,4.00
314,2023,Mar,3.8,10.1,102.2569,20.2,4.25
315,2023,Apr,4.0,8.7,102.4903,13.1,4.25
316,2023,May,4.2,8.7,102.2648,13.7,4.50


In [32]:

filtered_df = df_uk_metrics[(df_uk_metrics['Year'] >= 2007) & (df_uk_metrics['Year'] <= 2022)]

filtered_df

Unnamed: 0,Year,Month,Unemployment %,CPI rate%,GDP,RPI%,Bank Rate %
120,2007,Jan,5.5,2.7,84.7402,6.1,5.25
121,2007,Feb,5.5,2.8,85.0445,6.3,5.25
122,2007,Mar,5.5,3.1,85.2548,5.9,5.25
123,2007,Apr,5.4,2.8,85.3718,4.9,5.25
124,2007,May,5.4,2.5,85.7773,4.6,5.50
...,...,...,...,...,...,...,...
307,2022,Aug,3.6,9.9,102.0639,14.8,2.25
308,2022,Sep,3.7,10.1,101.4602,15.4,1.75
309,2022,Oct,3.7,11.1,102.1466,19.9,1.75
310,2022,Nov,3.7,10.7,102.2050,19.9,3.50


In [33]:
# Convert 'Year' and 'Month' to the 'YYYY-MM' format
filtered_df['YYYY-MM'] = pd.to_datetime(filtered_df['Year'].astype(str) + '-' + filtered_df['Month'], format='%Y-%b').dt.to_period('M')

# Reorder columns with 'YYYY-MM' as the first column
filtered_df = filtered_df[['YYYY-MM'] + [col for col in filtered_df.columns if col not in ['Year', 'Month', 'YYYY-MM']]]

# Display the updated DataFrame
filtered_df



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['YYYY-MM'] = pd.to_datetime(filtered_df['Year'].astype(str) + '-' + filtered_df['Month'], format='%Y-%b').dt.to_period('M')


Unnamed: 0,YYYY-MM,Unemployment %,CPI rate%,GDP,RPI%,Bank Rate %
120,2007-01,5.5,2.7,84.7402,6.1,5.25
121,2007-02,5.5,2.8,85.0445,6.3,5.25
122,2007-03,5.5,3.1,85.2548,5.9,5.25
123,2007-04,5.4,2.8,85.3718,4.9,5.25
124,2007-05,5.4,2.5,85.7773,4.6,5.50
...,...,...,...,...,...,...
307,2022-08,3.6,9.9,102.0639,14.8,2.25
308,2022-09,3.7,10.1,101.4602,15.4,1.75
309,2022-10,3.7,11.1,102.1466,19.9,1.75
310,2022-11,3.7,10.7,102.2050,19.9,3.50


In [16]:
# Import data BASSAM data with FTSE etc. 
# Define metrics 
df_uk_other_metrics = pd.read_excel('Bassam_metrics.xlsx')

df_uk_other_metrics 

Unnamed: 0,Date,FTSE,Unemployment rate (aged 16 and over),CPI,RPI,GDP,BoE Interest Rate,GDP ( Billions of US $),Per Capita (US $),Annual % Change
0,2007-05,38.5,5.4,2.5,4.3,64.1,6.25,3093.0,50438.2,2.6
1,2007-06,36.5,5.3,2.4,4.3,64.5,6.50,3093.0,50438.2,2.6
2,2007-07,35.2,5.3,1.9,4.3,64.8,6.75,3093.0,50438.2,2.6
3,2007-08,37.8,5.3,1.8,4.3,64.9,7.00,3093.0,50438.2,2.6
4,2007-09,39.6,5.2,1.8,4.3,65.0,7.00,3093.0,50438.2,2.6
...,...,...,...,...,...,...,...,...,...,...
183,2022-08,98.2,3.6,9.9,11.6,87.3,0.50,3070.7,45850.4,4.1
184,2022-09,95.4,3.7,10.1,11.6,86.3,0.50,3070.7,45850.4,4.1
185,2022-10,94.7,3.7,11.1,11.6,86.7,0.50,3070.7,45850.4,4.1
186,2022-11,101.9,3.7,10.7,11.6,86.9,0.50,3070.7,45850.4,4.1


In [22]:
# Import GDP quaterly growth figure 
# view

df_gdp_growth = pd.read_excel('all_GDP.xlsx')

df_gdp_growth

Unnamed: 0,Year,Quarter,Quarter on Quarter growth%
0,1956,Q1,1.1
1,1956,Q2,-0.1
2,1956,Q3,-0.1
3,1956,Q4,0.6
4,1957,Q1,1.9
...,...,...,...
266,2022,Q3,-0.1
267,2022,Q4,0.1
268,2023,Q1,0.3
269,2023,Q2,0.2
