# Preprocessing data for the Data Challenge

In [4]:
!pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25ldone
[?25h  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993225 sha256=e23d095184074a7fc2df5fcffcc4247c9b14007ec456d793d29b29f0e8bb12cf
  Stored in directory: /Users/solenedebuysere/Library/Caches/pip/wheels/6a/67/f8/9cf1a8ff87e0b37f738769df49cc142a655489a6d27b68089f
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


In [5]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

from sklearn.model_selection import train_test_split

import time
import re
from langdetect import detect

from collections import Counter

import gc

In [6]:
FILENAME = "../data/train_series.csv"
FILENAME_ECB = "../data/ecb_data.csv"
FILENAME_FED = "../data/fed_data.csv"

In [7]:
ecb = pd.read_csv(FILENAME_ECB, index_col=0)
fed = pd.read_csv(FILENAME_FED, index_col=0)

In [8]:
ecb.head()

Unnamed: 0,title,speaker,text
0,Comments by Yves Mersch at Financial Services ...,Yves Mersch,Comments by Yves Mersch at Financial Service...
1,Securing sustained economic growth in the euro...,Vítor Constâncio,Securing sustained economic growth in the eu...
2,The role of monetary policy in addressing the ...,Mario Draghi,The role of monetary policy in addressing th...
3,The pandemic emergency: the three challenges f...,Philip R. Lane,SPEECH The pandemic emergency: the three c...
4,Transmission channels of monetary policy in th...,Peter Praet,Transmission channels of monetary policy in ...


In [9]:
fed.head()

Unnamed: 0,title,speaker,text
0,The Importance of Economic Education and Finan...,Governor Frederic S. Mishkin,As ...
1,Financial Innovation and Consumer Protection,Chairman Ben S. Bernanke,"The concept of financial innovation, it seems..."
2,Implementing Basel II in the United States,Governor Randall S. Kroszner,Good afternoon. I would like to thank Standar...
3,An Assessment of the U.S. Economy,Vice Chair for Supervision Randal K. Quarles,Thank you for the opportunity to take part in...
4,Monetary Policy since the Onset of the Crisis,Chairman Ben S. Bernanke,When we convened in Jackson Hole in August 20...


## Treating NA

In [6]:
ecb.isna().sum()

title       0
speaker     1
text       34
dtype: int64

In [7]:
fed.isna().sum()

title      0
speaker    2
text       0
dtype: int64

In [12]:
test_text = ecb.iloc[5]["text"]
print(test_text[:500])
print("...")
print(test_text[-500:])

  Domestic and cross-border spillovers  of unconventional monetary policies   Remarks by Benoît Cœuré, Member of the Executive Board of the ECB, at the at the SNB-IMF Conference "Monetary Policy Challenges in a Changing World",Zurich, 12 May 2015  ***   Summary   Discussion has recently emerged on the global financial market implications of diverging monetary policy cycles. Central banks in large advanced economies can free themselves from the global financial cycle and regain monetary independe
...
efficiency imply, from a long-run perspective, that the real growth rate exceeds the real interest rate on physical capital.    [13] Tirole, J. ( 1985): “Asset Bubbles and Overlapping Generations” Econometrica, Vol. 53, No. 6 (Nov., 1985), pp. 1499-1528    [14]“Reinforcing financial stability in the euro area”, speech given by Vítor Constâncio, Vice-President of the ECB, at the OMFIF City Lecture, London, 8 May 2015.       SEE ALSO  Find out more about related content   Related information 

In [13]:
def numbered_reference_removal(text):
    # refs are typically in the form [n] in the text.
    matches = re.findall(r'\[[0-9]+?\]', text)
    counter = Counter(matches)
    for match in matches:
        if counter[match] != 2:
            # print(f"Didn't find a reference twice in the text, but {counter[match]} times. Cannot remove.")
            # print(text[:2000], '\n ... \n',text[-2000:])
            # print()
            return text
    
    assert len(matches)%2 == 0
    if len(matches) == 0:
        return text
    N = len(matches)//2
    is_ordered = True
    n = 0
    for s in matches:
        is_ordered = (s == f'[{n%N+1}]')
        if not is_ordered : 
            # print(f"Not ordered, {str(matches)}")
            return text
        n += 1
    
    # Remove all references after [1]
    res = re.findall(
        r'^.*?\[1\].+?\[1\]',
        text
    )
    return res[0]

In [14]:
def reference_removal_en(text):
    res = re.sub(r'(?i)see also.*', '', text)
    res = re.sub(r'References.*', '', res)
    res = re.sub("\[[0-9]+\]", "", res)
    return res

In [15]:
test_nb = 10
counter = 0
for text_ in ecb["text"]:
    if isinstance(text_, str):
        text = numbered_reference_removal(text_)
        text2 = reference_removal_en(text)
        if text2 != text:
            print(text[:500])
            print("...")
            print(text[-300:])
            print("-------------")
            print(text2[:500])
            print("...")
            print(text2[-600:])
            print("====================")
            counter += 1
    if counter == 10:
        break

  The role of monetary policy in addressing the crisis in the euro area   Speech by Mario Draghi, President of the ECB, at the “Room for discussion” of the Study Association SEFA and the Faculty of Economics and Business, Amsterdam,15 April 2013  Introduction There was a time, not too long ago, when central banking was considered to be a rather boring and unexciting occupation. In the era of the “Great Moderation”, mostly seen as the period between the mid-1980s and the beginning of the global f
...
dy taken lies ahead. All parties involved in this comprehensive reform agenda must persevere. And we must all work with conviction and determination towards our shared vision. If we do so, I am confident that restoring stability and ensuring prosperity in the euro area are within our reach.      [1]
-------------
  The role of monetary policy in addressing the crisis in the euro area   Speech by Mario Draghi, President of the ECB, at the “Room for discussion” of the Study Association SEFA a

Sometimes, the title appears in the beginning of an ECB text. We can remove it using regular expressions.

In [16]:
# Before dealing with N/A because otherwise, this edit would just undo the other.
# We can use this to remove the "TRANSCRIPT" and "SPEECH" tags.
# We also notice that most of the time, the content is preceded by a date. It would be useful to make a date parser to remove the first chunk.

def first_date_extractor(text):
    if len(text) > 0:
        res = re.sub('^(.*?)[1-9][0-9]* (?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|'
                     + 'Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?) (19|20)[0-9][0-9]',
                     '', text)
        return res

def remove_title(x):
    if not x["text"] is None and isinstance(x["text"], str):
        res = re.sub(x["title"], '', x["text"]).strip()
        return res
    else:
        return x["title"]

def website_remover(text):
    # Remove websites
    regex = "((http|https)://)(www.)?" \
        + "[a-zA-Z0-9@:%._\\+~#?&//=]{2,256}\\.[a-z]" \
        + "{2,6}\\b([-a-zA-Z0-9@:%._\\+~#?&//=]*)"
    res = re.sub(regex, "", text).strip()
    return res

def tag_removal(text):
    # Remove tags
    res = re.sub('SPEECH', '', text)
    res = re.sub('TRANSCRIPT', '', res)
    res = re.sub("Introduction", "", res)
    res = re.sub("Summary", "", res)
    return res

def pipeline_en(x):
    res = remove_title(x)
    if res is None:
        return res
    res = numbered_reference_removal(res)
    res = reference_removal_en(res)
    res = tag_removal(res).strip()
    res = first_date_extractor(res).strip()

    # print("================================================")
    # print(res[:200])
    # print("...")
    # print(res[-500:])
    return res

ecb["text_preprocessed"] = ecb.apply(pipeline_en, axis=1)

What are the N/A entries for text? Do they also have no speaker or no title?

In [17]:
ecb[ecb.isna().any(axis=1)]

Unnamed: 0,title,speaker,text,text_preprocessed
40,"Die EZB, das Geld und die Wirtschaft: Von A wi...",Sabine Lautenschläger,,"Die EZB, das Geld und die Wirtschaft: Von A wi..."
104,Economic situation and outlook,Peter Praet,,Economic situation and outlook
146,Economic developments in the euro area,Peter Praet,,Economic developments in the euro area
172,Unequal scars – distributional consequences of...,Isabel Schnabel,,Unequal scars – distributional consequences of...
220,The economic outlook for the euro area,Philip R. Lane,,The economic outlook for the euro area
257,Mehr Europa für eine stabile gemeinsame Währung,Isabel Schnabel,,Mehr Europa für eine stabile gemeinsame Währung
275,Better Regulation“ im Finanzsektor - die Sicht...,Gertrude Tumpel-Gugerell,,Better Regulation“ im Finanzsektor - die Sicht...
291,The ECB's monetary policy strategy review - IM...,Philip R. Lane,,The ECB's monetary policy strategy review - IM...
332,Sources of risk and vulnerabilities for financ...,Luis de Guindos,,Sources of risk and vulnerabilities for financ...
413,Policy Frameworks and Strategies for an Open E...,Philip R. Lane,,Policy Frameworks and Strategies for an Open E...


In [18]:
fed[fed.isna().any(axis=1)]

Unnamed: 0,title,speaker,text
461,Reflections on a Year of Crisis,,Chairman Bernanke delivered the same remarks ...
494,Liquidity Provision by the Federal Reserve,,Chairman Bernanke presented identical remarks...


FED is okay. In ECB, if there is no text, there is at least a title, so we can still work with that.

Let us view the speakers.

In [19]:
ecb["speaker"].value_counts()

Jean-Claude Trichet            216
Benoît Cœuré                   191
Mario Draghi                   186
Yves Mersch                    161
Peter Praet                    129
Vítor Constâncio               126
Lorenzo Bini Smaghi            105
Gertrude Tumpel-Gugerell        99
Sabine Lautenschläger           85
José Manuel González-Páramo     84
Jürgen Stark                    80
Luis de Guindos                 57
Jörg Asmussen                   51
Lucas Papademos                 45
Isabel Schnabel                 42
Christine Lagarde               41
Philip R. Lane                  40
Fabio Panetta                   26
Frank Elderson                   7
Name: speaker, dtype: int64

In [20]:
fed["speaker"].value_counts()

Chairman Ben S. Bernanke                        145
Governor Lael Brainard                           68
Governor Daniel K. Tarullo                       54
Governor Jerome H. Powell                        50
Vice Chairman Stanley Fischer                    45
Governor Elizabeth A. Duke                       44
Chair Janet L. Yellen                            44
Vice Chairman Donald L. Kohn                     41
Vice Chair for Supervision Randal K. Quarles     39
Governor Randall S. Kroszner                     36
Chairman Jerome H. Powell                        32
Governor Frederic S. Mishkin                     26
Vice Chairman Richard H. Clarida                 26
Vice Chair Janet L. Yellen                       22
Governor Sarah Bloom Raskin                      17
Governor Kevin Warsh                             16
Governor Jeremy C. Stein                         16
Governor Michelle W. Bowman                      13
Governor Susan S. Bies                            3
Name: speake

Let us view the text languages.

In [21]:
ecb["lang"] = ecb["text_preprocessed"].apply(lambda x : detect(x[:500]), )
fed["lang"] = fed["text"].apply(lambda x : detect(x[:500]))

In [22]:
ecb["lang"].value_counts()

en    1646
de      75
fr      31
es      16
it       4
Name: lang, dtype: int64

In [23]:
fed["lang"].value_counts()

en    739
Name: lang, dtype: int64

In [24]:
ecb[ecb["lang"] != "en"]["text_preprocessed"].str.len().sum(skipna=True)

2565448

In [25]:
ecb.loc[31]["text"]

'  Die Finanzmarktunion als Element einer stabilen Währungsunion?   Rede von Jörg Asmussen, Mitglied des Direktoriums der EZB, Handelsblatt Jahrestagung „Banken im Umbruch“, Frankfurt am Main, 4. September 2012 Sehr geehrte Damen und Herren,  Ein Satz in der Schlusserklärung des Euro-Gipfels vom 28. Juni dieses Jahres sorgte für Aufhorchen und viel Diskussionsstoff über den Sommer – ich zitiere: „ Sobald unter Einbeziehung der EZB ein wirksamer einheitlicher Aufsichts\xadmechanismus für Banken des Euro-Währungs\xadgebiets eingerichtet worden ist, hätte der ESM nach einem ordentlichen Beschluss die Möglich\xadkeit, Banken direkt zu rekapitalisieren.“  Hier wurde der Anfang einer Finanzmarktunion beschlossen. Als Beitrag zu dieser Diskussion möchte ich im Folgenden gerne drei Aspekte aufgreifen:    Wo stehen wir bei der Finanzmarkt\xadregulierung? Wie stellt sich aktuell die Lage an den Finanzmärkten dar?   Warum ist eine Finanzmarktunion not\xadwendig geworden? Wie soll diese aus\xadseh