## Bank speeches starter template

The [Kaggle data set](https://www.kaggle.com/datasets/davidgauthier/central-bank-speeches/data) used in this project use speeches from senior central bankers of various influential central banks. The corpus runs from 1997 until 2022. Central banks are institutions that define monetary policy. As such, central bank speeches are widely followed and have a major influence on financial markets.

You can also refer to the raw data set and article used [here](https://www.kaggle.com/datasets/magnushansson/central-bank-speeches).

Note that, due to the large number of speeches contained in the data set, you can expect long running times for processing intensive steps such as sentiment analysis when using the full data set (30-60 mins). It is recommended that you reduce the subset of data under review when creating your prototype and then run the full data set through, if required, once the code behaves as expected. You can reduce the data set to, for example, only process data from the United Kingdom (or other countries) or by looking at specific date ranges.

The code is not extensive and you will be expected to use the provided code as a starting point only. You will also need to use your own creativity and logic to identify useful patterns in the data. You can explore sentiment, polarity and entities/keywords, and should use appropriate levels of granularity and aggregation in order to analyse patterns contained in the data.

In [3]:
# Install the necessary libraries.
#!pip install nltk
#!pip install vaderSentiment
#!pip install textblob

!pip install vaderSentiment



In [6]:
# Import relevant libraries.
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
from datetime import datetime

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('words')
nltk.download('omw-1.4')
from nltk.corpus import stopwords
from nltk.corpus import words
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package punkt to C:\Users\David
[nltk_data]     k\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\David
[nltk_data]     k\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package words to C:\Users\David
[nltk_data]     k\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package omw-1.4 to C:\Users\David
[nltk_data]     k\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [None]:
# Load dataset. Change directory as required.
df = pd.read_csv('all_speeches.csv')

In [None]:
df.head()

In [None]:
df.country.value_counts()

In [None]:
df[df['country']=='united kingdom'].sort_values('date').head()

In [None]:
# Demo: Example of adding a column to calculate the string length per speech.
df['len'] = df['text'].str.len()
df

In [None]:
# Demo: Convert to lower case and remove punctuation.
df['text'] = df['text'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['text'] = df['text'].str.replace('[^\w\s]','',regex=True)
df

In [None]:
# Subset the data to reduce processing time.
dfi = df[df['country']=='united kingdom'].sort_values('date').reset_index(drop=True)
dfi.info()

In [None]:
%%time
# Demo: Sentiment intensity analysis using Vadar sentiment and a for loop.
sia = SentimentIntensityAnalyzer()
dft = pd.DataFrame(columns=['neg', 'neu', 'pos', 'compound'])
for j in dfi.index:
    score = sia.polarity_scores(dfi.iloc[j,6])
    new_row = pd.DataFrame({'neg': score['neg'], 'neu': score['neu'], 'pos': score['pos'], 'compound': score['compound']}, index=[j])
    dft = pd.concat([dft, new_row], ignore_index=True)
dfi = pd.concat([dfi, dft], axis=1, join="inner")
dfi

In [None]:
%%time
# Demo: Using a self defined function with Textblob to calculate polarity and subjectivity.
def generate_polarity_subjectivity(dfs):
    dft2 = TextBlob(dfs).sentiment
    return pd.Series([dft2[0], dft2[1]])

# Apply the function to the data and add two new columns
dfi[['polarity','subjectivity']] = dfi['text'].apply(generate_polarity_subjectivity)
dfi.head()

In [None]:
%%time
# Demo: Frequency distribution review of a single speech.

# Tokenise the text data.
stop_words=set(stopwords.words('english'))
filtered_text = []

# Example speech using iloc to reference (Hint: Can be used in loops if required).
tokenized_word = word_tokenize(dfi.iloc[0,6])

# Filter the tokenised words.
for each_word in tokenized_word:
    if each_word.lower() not in stop_words and each_word.isalpha():
        filtered_text.append(each_word.lower())

# Display the filtered list.
#print('Tokenised list without stop words: {}'.format(filtered_text))

# Create a frequency distribution object.
freq_dist_of_words = FreqDist(filtered_text)

# Show the five most common elements in the data set.
freq_dist_of_words.most_common(10)

In [None]:
# Set plotting options and plot the data.
fig, ax = plt.subplots(dpi=100)
fig.set_size_inches(20, 10)
freq_dist_of_words.plot(25, cumulative=False)
plt.show()

## Imported data 
checked for duplicates, missing values, etc 

In [21]:
# Import data from 1997  
# import pandas
import pandas as pd

#define metrics 
df_uk_metrics = pd.read_excel('CPI_Unemployment_RPI_Bank rates_GDP.xlsx')

df_uk_metrics

Unnamed: 0,Year,Month,Unemployment %,CPI rate%,RPI%,Bank Rate %,GDP %,GDP(Billions of US $),Per Capita (US $)
0,1997,Jan,7.5,2.1,3.2,5.94,4.5,1561.7,26779.8
1,1997,Feb,7.3,1.9,3.3,5.94,4.5,1561.7,26779.8
2,1997,Mar,7.2,1.7,3.3,5.94,4.5,1561.7,26779.8
3,1997,Apr,7.2,1.6,3.3,5.94,4.5,1561.7,26779.8
4,1997,May,7.2,1.6,3.3,6.25,4.5,1561.7,26779.8
...,...,...,...,...,...,...,...,...,...
313,2023,Feb,3.9,10.4,19.8,4.00,,,
314,2023,Mar,3.8,10.1,20.2,4.25,,,
315,2023,Apr,4.0,8.7,13.1,4.25,,,
316,2023,May,4.2,8.7,13.7,4.50,,,


In [22]:
# Convert 'Year' and 'Month' to the 'YYYY-MM' format
df_uk_metrics['YYYY-MM'] = pd.to_datetime(df_uk_metrics['Year'].astype(str) + '-' + df_uk_metrics['Month'], format='%Y-%b').dt.to_period('M')

# Reorder columns with 'YYYY-MM' as the first column
df_uk_metrics = df_uk_metrics[['YYYY-MM'] + [col for col in df_uk_metrics.columns if col not in ['Year', 'Month', 'YYYY-MM']]]

# Display the updated DataFrame
df_uk_metrics



Unnamed: 0,YYYY-MM,Unemployment %,CPI rate%,RPI%,Bank Rate %,GDP %,GDP(Billions of US $),Per Capita (US $)
0,1997-01,7.5,2.1,3.2,5.94,4.5,1561.7,26779.8
1,1997-02,7.3,1.9,3.3,5.94,4.5,1561.7,26779.8
2,1997-03,7.2,1.7,3.3,5.94,4.5,1561.7,26779.8
3,1997-04,7.2,1.6,3.3,5.94,4.5,1561.7,26779.8
4,1997-05,7.2,1.6,3.3,6.25,4.5,1561.7,26779.8
...,...,...,...,...,...,...,...,...
313,2023-02,3.9,10.4,19.8,4.00,,,
314,2023-03,3.8,10.1,20.2,4.25,,,
315,2023-04,4.0,8.7,13.1,4.25,,,
316,2023-05,4.2,8.7,13.7,4.50,,,


In [23]:
#check for missing values 
missing_values = df_uk_metrics.isnull().sum()
print(missing_values)

YYYY-MM                  0
Unemployment %           0
CPI rate%                0
RPI%                     0
Bank Rate %              0
GDP %                    6
GDP(Billions of US $)    6
Per Capita (US $)        6
dtype: int64


In [24]:
#check for nan values 
nan_values = df_uk_metrics.isna().sum()
print(nan_values)

YYYY-MM                  0
Unemployment %           0
CPI rate%                0
RPI%                     0
Bank Rate %              0
GDP %                    6
GDP(Billions of US $)    6
Per Capita (US $)        6
dtype: int64
