:Importing both ECB and FX csv

In [65]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
stop_words = open('stop_words_english.txt','r').read().split()
ECB = pd.read_csv("speeches.csv", sep = '|', usecols = ['date','contents'])
FX = pd.read_csv("fx.csv", skiprows= 6, names = ["date", "value", "status","comment"])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [53]:
FX

Unnamed: 0,date,value,status,comment
0,2021-11-19,1.1271,Normal value (A),
1,2021-11-18,1.1345,Normal value (A),
2,2021-11-17,1.1316,Normal value (A),
3,2021-11-16,1.1368,Normal value (A),
4,2021-11-15,1.1444,Normal value (A),
...,...,...,...,...
5917,1999-01-08,1.1659,Normal value (A),
5918,1999-01-07,1.1632,Normal value (A),
5919,1999-01-06,1.1743,Normal value (A),
5920,1999-01-05,1.1790,Normal value (A),


Merging Datasets into a single dataframe

In [41]:
df = FX.join(ECB.set_index('date'), on='date')
df = df.drop_duplicates(['date']) 
df = df.sort_values(['date'], ascending=[1])
df

Unnamed: 0,date,value,status,comment,contents
5921,1999-01-04,1.1789,Normal value (A),,
5920,1999-01-05,1.1790,Normal value (A),,
5919,1999-01-06,1.1743,Normal value (A),,
5918,1999-01-07,1.1632,Normal value (A),,
5917,1999-01-08,1.1659,Normal value (A),,
...,...,...,...,...,...
4,2021-11-15,1.1444,Normal value (A),,
3,2021-11-16,1.1368,Normal value (A),,
2,2021-11-17,1.1316,Normal value (A),,
1,2021-11-18,1.1345,Normal value (A),,


In [17]:
df.dtypes

date         object
value        object
status       object
comment     float64
contents     object
dtype: object

In [42]:
#change all weird value entries to NaN
df['value'] = df.iloc[:,1].replace(r'-', np.nan)

In [43]:
#there are 62 invalid value entries converted to NaN
df['value'].isna().sum()

62

In [44]:
#replace NaN with previous value
df['value'] = df['value'].fillna(method='ffill')

In [45]:
#include and calculate rate of return
df['return'] = (df.iloc[:,1].astype(float)-df.iloc[:,1].shift().astype(float)) / df.iloc[:,1].shift().astype(float)
df

Unnamed: 0,date,value,status,comment,contents,return
5921,1999-01-04,1.1789,Normal value (A),,,
5920,1999-01-05,1.1790,Normal value (A),,,0.000085
5919,1999-01-06,1.1743,Normal value (A),,,-0.003986
5918,1999-01-07,1.1632,Normal value (A),,,-0.009452
5917,1999-01-08,1.1659,Normal value (A),,,0.002321
...,...,...,...,...,...,...
4,2021-11-15,1.1444,Normal value (A),,,-0.000349
3,2021-11-16,1.1368,Normal value (A),,,-0.006641
2,2021-11-17,1.1316,Normal value (A),,,-0.004574
1,2021-11-18,1.1345,Normal value (A),,,0.002563


In [46]:
#good news and bad news column
df['good_news'] = np.where((df['return'] >= 0.005), '1', '0')
df['bad_news'] = np.where((df['return'] <= -0.005), '1', '0')
df


Unnamed: 0,date,value,status,comment,contents,return,good_news,bad_news
5921,1999-01-04,1.1789,Normal value (A),,,,0,0
5920,1999-01-05,1.1790,Normal value (A),,,0.000085,0,0
5919,1999-01-06,1.1743,Normal value (A),,,-0.003986,0,0
5918,1999-01-07,1.1632,Normal value (A),,,-0.009452,0,1
5917,1999-01-08,1.1659,Normal value (A),,,0.002321,0,0
...,...,...,...,...,...,...,...,...
4,2021-11-15,1.1444,Normal value (A),,,-0.000349,0,0
3,2021-11-16,1.1368,Normal value (A),,,-0.006641,0,1
2,2021-11-17,1.1316,Normal value (A),,,-0.004574,0,0
1,2021-11-18,1.1345,Normal value (A),,,0.002563,0,0


In [47]:
#remove na contents
dffinal = df.dropna(subset=['contents'])
dffinal

Unnamed: 0,date,value,status,comment,contents,return,good_news,bad_news
5913,1999-01-14,1.1653,Normal value (A),,The euro has arrived Speech by the Preside...,-0.007749,0,1
5912,1999-01-15,1.1626,Normal value (A),,European economic and monetary union - lates...,-0.002317,0,0
5911,1999-01-18,1.1612,Normal value (A),,Hearing at the European Parliament's Sub-Com...,-0.001204,0,0
5906,1999-01-25,1.1584,Normal value (A),,Finnish savers and investors in the euro are...,0.001470,0,0
5903,1999-01-28,1.1410,Normal value (A),,The euro - four weeks after the start Prof...,-0.010322,0,1
...,...,...,...,...,...,...,...,...
29,2021-10-11,1.1574,Normal value (A),,SPEECH The monetary policy toolbox and the...,0.000432,0,0
28,2021-10-12,1.1555,Normal value (A),,SPEECH The contribution of finance to comb...,-0.001642,0,0
26,2021-10-14,1.1602,Normal value (A),,SPEECH IMFC Statement Statement by Chri...,0.003460,0,0
23,2021-10-19,1.1655,Normal value (A),,SPEECH “Hic sunt leones” – open research q...,0.004395,0,0


In [84]:
#breaking into 2 data sets
good_news = dffinal[dffinal['good_news'] == '1']
bad_news = dffinal[dffinal['bad_news'] == "1"] 

In [96]:
#weirder stopwords inside the dataset
stop_words2 = ['la','der','und','en']
stop_words.extend(stop_words2)

In [60]:
 good_news["tokens"] = good_news.contents.str.lower()
 good_news["tokens"] = good_news.tokens.apply(nltk.word_tokenize)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [77]:
goodwords = good_news.tokens.tolist()  
goodwords = [word for list_ in goodwords for word in list_ if word.isalnum()]

In [97]:
goodwordsfinal = nltk.FreqDist(w for w in goodwords if w not in stop_words)

In [98]:
good_indicators = pd.DataFrame(goodwordsfinal.most_common(20), columns=['word', 'freq'])
good_indicators

Unnamed: 0,word,freq
0,euro,5142
1,financial,4623
2,policy,4368
3,monetary,4208
4,area,3554
5,ecb,2732
6,central,2682
7,economic,2681
8,market,2565
9,die,2504


In [100]:
 bad_news["tokens"] = bad_news.contents.str.lower()
 bad_news["tokens"] = bad_news.tokens.apply(nltk.word_tokenize)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [101]:
badwords = bad_news.tokens.tolist()  
badwords = [word for list_ in badwords for word in list_ if word.isalnum()]
badwordsfinal =  nltk.FreqDist(w for w in badwords if w not in stop_words)

In [102]:
bad_indicators = pd.DataFrame(badwordsfinal.most_common(20), columns=['word', 'freq'])
bad_indicators

Unnamed: 0,word,freq
0,euro,5946
1,financial,4913
2,policy,4419
3,monetary,4323
4,area,4192
5,banks,2962
6,market,2761
7,central,2757
8,economic,2721
9,ecb,2658


In [103]:
good_indicators.to_csv('good_indicators.csv',encoding='utf-8', index=False)
bad_indicators.to_csv('bad_indicators.csv',encoding='utf-8',index=False)