# Preprocessing data for the Data Challenge

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

from sklearn.model_selection import train_test_split

import time
import re
from langdetect import detect

from collections import Counter

import gc

In [2]:
FILENAME = "../data/train_series.csv"
FILENAME_ECB = "../data/ecb_data.csv"
FILENAME_FED = "../data/fed_data.csv"

In [3]:
ecb = pd.read_csv(FILENAME_ECB, index_col=0)
fed = pd.read_csv(FILENAME_FED, index_col=0)

In [4]:
ecb.head()

Unnamed: 0,title,speaker,text
0,Comments by Yves Mersch at Financial Services ...,Yves Mersch,Comments by Yves Mersch at Financial Service...
1,Securing sustained economic growth in the euro...,Vítor Constâncio,Securing sustained economic growth in the eu...
2,The role of monetary policy in addressing the ...,Mario Draghi,The role of monetary policy in addressing th...
3,The pandemic emergency: the three challenges f...,Philip R. Lane,SPEECH The pandemic emergency: the three c...
4,Transmission channels of monetary policy in th...,Peter Praet,Transmission channels of monetary policy in ...


In [5]:
fed.head()

Unnamed: 0,title,speaker,text
0,The Importance of Economic Education and Finan...,Governor Frederic S. Mishkin,As ...
1,Financial Innovation and Consumer Protection,Chairman Ben S. Bernanke,"The concept of financial innovation, it seems..."
2,Implementing Basel II in the United States,Governor Randall S. Kroszner,Good afternoon. I would like to thank Standar...
3,An Assessment of the U.S. Economy,Vice Chair for Supervision Randal K. Quarles,Thank you for the opportunity to take part in...
4,Monetary Policy since the Onset of the Crisis,Chairman Ben S. Bernanke,When we convened in Jackson Hole in August 20...


##### SUMMARIZE

In [6]:
import gensim.downloader as api
from gensim.summarization import summarize

In [110]:
text = (

"Wandern in der Natur"
"Im Urlaub fahren wir eine Woche zum Wandern in die Berge. Dort ist die Luft besser als in der Stadt. Wir wandern zu einem See und wollen dort mit einem Boot fahren. Auf dem Wanderweg zum See gehen wir zuerst lange durch einen dunklen Wald. Im Wald sind viele Bäume und es riecht nach Erde. Weil wir nicht laut sind, sehen wir ein Reh und beobachten es. Wir kommen an einem großen Felsen vorbei. Auf dem Weg liegen viele Steine und wir brauchen gute Wanderschuhe, damit wir uns nicht verletzen."

"Nach dem Wald kommen wir auf Felder und Wiesen. Das Gras auf der Wiese ist Futter für die Tiere eines Bauern. Auch schöne Blumen wachsen dort und wir pflücken einen kleinen Blumenstrauß beim Heimweg. Zum See führt ein kleiner Bach. In dem Bach gibt es Fische. Ich möchte dort gerne angeln."

"Nach dem Urlaub in den Bergen fliegt die ganze Familie noch ein paar Tage ans Meer. Der Strand ist ganz flach und das Wasser ist nicht tief. Die Kinder spielen gerne im feinen Sand. Die Sonne ist sehr stark und man braucht Sonnencreme. Am Meer bläst immer Wind. Das ist bei der Hitze angenehm. Am Meer ist ein anderes Klima als in der Stadt und die Luft ist sehr feucht. Das Wetter ist fast immer gut und es gibt selten Regen." )
print(summarize(text))

Wandern in der NaturIm Urlaub fahren wir eine Woche zum Wandern in die Berge.
Das Gras auf der Wiese ist Futter für die Tiere eines Bauern.
Ich möchte dort gerne angeln.Nach dem Urlaub in den Bergen fliegt die ganze Familie noch ein paar Tage ans Meer.
Am Meer ist ein anderes Klima als in der Stadt und die Luft ist sehr feucht.


In [26]:
text = (
    "Thomas A. Anderson is a man living two lives. By day he is an "
    "average computer programmer and by night a hacker known as "
    "Neo. Neo has always questioned his reality, but the truth is "
    "far beyond his imagination. Neo finds himself targeted by the "
    "police when he is contacted by Morpheus, a legendary computer "
    "hacker branded a terrorist by the government. Morpheus awakens "
    "Neo to the real world, a ravaged wasteland where most of "
    "humanity have been captured by a race of machines that live "
    "off of the humans' body heat and electrochemical energy and "
    "who imprison their minds within an artificial reality known as "
    "the Matrix. As a rebel against the machines, Neo must return to "
    "the Matrix and confront the agents: super-powerful computer "
    "programs devoted to snuffing out Neo and the entire human "
    "rebellion. "
)
print(text)

Thomas A. Anderson is a man living two lives. By day he is an average computer programmer and by night a hacker known as Neo. Neo has always questioned his reality, but the truth is far beyond his imagination. Neo finds himself targeted by the police when he is contacted by Morpheus, a legendary computer hacker branded a terrorist by the government. Morpheus awakens Neo to the real world, a ravaged wasteland where most of humanity have been captured by a race of machines that live off of the humans' body heat and electrochemical energy and who imprison their minds within an artificial reality known as the Matrix. As a rebel against the machines, Neo must return to the Matrix and confront the agents: super-powerful computer programs devoted to snuffing out Neo and the entire human rebellion. 


In [27]:
print(summarize(text))

Morpheus awakens Neo to the real world, a ravaged wasteland where most of humanity have been captured by a race of machines that live off of the humans' body heat and electrochemical energy and who imprison their minds within an artificial reality known as the Matrix.


In [28]:
print(summarize(text, split=True))

["Morpheus awakens Neo to the real world, a ravaged wasteland where most of humanity have been captured by a race of machines that live off of the humans' body heat and electrochemical energy and who imprison their minds within an artificial reality known as the Matrix."]


In [29]:
print(summarize(text, ratio=0.5))

By day he is an average computer programmer and by night a hacker known as Neo. Neo has always questioned his reality, but the truth is far beyond his imagination.
Morpheus awakens Neo to the real world, a ravaged wasteland where most of humanity have been captured by a race of machines that live off of the humans' body heat and electrochemical energy and who imprison their minds within an artificial reality known as the Matrix.
As a rebel against the machines, Neo must return to the Matrix and confront the agents: super-powerful computer programs devoted to snuffing out Neo and the entire human rebellion.


In [39]:
len(str.split(ecb["text"][0]))

912

In [42]:
ecb["text"].apply(lambda x: len(str.split(str(x)))).median()

2583.0

In [54]:
ecb["text"][ecb["text"].isna()].index

Int64Index([  40,  104,  146,  172,  220,  257,  275,  291,  332,  413,  497,
             509,  528,  649,  695,  739,  747,  753,  788,  829,  832, 1025,
            1058, 1280, 1383, 1412, 1467, 1493, 1594, 1595, 1618, 1697, 1763,
            1765],
           dtype='int64')

In [66]:
oneSentenceTextsIndexes = []
for i in range(ecb["text"].shape[0]):
    try:
        summarize(ecb["text"][i])
    except:
        oneSentenceTextsIndexes.append(i)

In [None]:
oneSentenceTextsIndexes

In [75]:
oneSentenceTextsWithoutNA = set(oneSentenceTextsIndexes).difference(set(ecb["text"][ecb["text"].isna()].index))

In [None]:
ecb["text"].iloc[list(oneSentenceTextsWithoutNA)].apply(lambda x: len(str.split(x)))

In [105]:
newEcb = ecb.copy()
newEcb["text"].iloc[~newEcb.index.isin(oneSentenceTextsIndexes)] =\
        newEcb["text"].iloc[~newEcb.index.isin(oneSentenceTextsIndexes)].apply(lambda x: summarize(x, 
                                                                                                   word_count=500))

In [106]:
newEcb["text"].apply(lambda x: len(str.split(str(x)))).describe()

count    1772.000000
mean      480.100451
std        95.562474
min         0.000000
25%       491.000000
50%       500.000000
75%       508.000000
max       554.000000
Name: text, dtype: float64

In [116]:
ecb["text"].iloc[31]

'  Die Finanzmarktunion als Element einer stabilen Währungsunion?   Rede von Jörg Asmussen, Mitglied des Direktoriums der EZB, Handelsblatt Jahrestagung „Banken im Umbruch“, Frankfurt am Main, 4. September 2012 Sehr geehrte Damen und Herren,  Ein Satz in der Schlusserklärung des Euro-Gipfels vom 28. Juni dieses Jahres sorgte für Aufhorchen und viel Diskussionsstoff über den Sommer – ich zitiere: „ Sobald unter Einbeziehung der EZB ein wirksamer einheitlicher Aufsichts\xadmechanismus für Banken des Euro-Währungs\xadgebiets eingerichtet worden ist, hätte der ESM nach einem ordentlichen Beschluss die Möglich\xadkeit, Banken direkt zu rekapitalisieren.“  Hier wurde der Anfang einer Finanzmarktunion beschlossen. Als Beitrag zu dieser Diskussion möchte ich im Folgenden gerne drei Aspekte aufgreifen:    Wo stehen wir bei der Finanzmarkt\xadregulierung? Wie stellt sich aktuell die Lage an den Finanzmärkten dar?   Warum ist eine Finanzmarktunion not\xadwendig geworden? Wie soll diese aus\xadseh

In [109]:
print(newEcb["text"].iloc[1])

The OECD referred to the present situation as a “self-reinforcing low growth trap” and the IMF”s Maurice Obstfeld stated “Without determined policy action to support economic activity over the short and longer terms, sub-par growth at recent levels risks perpetuating itself – through the negative economic and political forces it is unleashing.”.
Prospects for the euro area economy Turning to recent euro area developments, the recovery is continuing its moderate but steady pace, supported mainly by the ECB’s expansionary policies, which have significantly improved financial conditions, reduced financial fragmentation and supported economic activity and inflation.
One positive aspect is that our policies have contributed to a decrease in financial fragmentation and that the dispersion of GDP growth and inflation across euro area countries is at the lowest level since the beginning of monetary union in 1999.
Indeed, with the five packages of measures we have adopted since June 2014, we ha

## Treating NA

In [82]:
ecb.isna().sum()

title       0
speaker     1
text       34
dtype: int64

In [83]:
fed.isna().sum()

title      0
speaker    2
text       0
dtype: int64

In [84]:
test_text = ecb.iloc[5]["text"]
print(test_text[:500])
print("...")
print(test_text[-5000:])

  Domestic and cross-border spillovers  of unconventional monetary policies   Remarks by Benoît Cœuré, Member of the Executive Board of the ECB, at the at the SNB-IMF Conference "Monetary Policy Challenges in a Changing World",Zurich, 12 May 2015  ***   Summary   Discussion has recently emerged on the global financial market implications of diverging monetary policy cycles. Central banks in large advanced economies can free themselves from the global financial cycle and regain monetary independe
...
feguard financial stability. Macro-prudential instruments can be targeted more efficiently to those sectors and countries where systemic risks may be materialising [14]. Finally, we encourage national authorities to do whatever is in their power to place the euro area on a more dynamic growth path, thereby creating attractive investment projects that generate high, but fundamentally justified, returns. These are the conditions for unconventional monetary policies to bring economies back to 

In [7]:
def numbered_reference_removal(text):
    # refs are typically in the form [n] in the text.
    matches = re.findall(r'\[[0-9]+?\]', text)
    counter = Counter(matches)
    for match in matches:
        if counter[match] != 2:
            # print(f"Didn't find a reference twice in the text, but {counter[match]} times. Cannot remove.")
            # print(text[:2000], '\n ... \n',text[-2000:])
            # print()
            return text
    
    assert len(matches)%2 == 0
    if len(matches) == 0:
        return text
    N = len(matches)//2
    is_ordered = True
    n = 0
    for s in matches:
        is_ordered = (s == f'[{n%N+1}]')
        if not is_ordered : 
            # print(f"Not ordered, {str(matches)}")
            return text
        n += 1
    
    # Remove all references after [1]
    res = re.findall(
        r'^.*?\[1\].+?\[1\]',
        text
    )
    return res[0]

In [8]:
def reference_removal_en(text):
    res = re.sub(r'(?i)see also.*', '', text)
    res = re.sub(r'References.*', '', res)
    res = re.sub("\[[0-9]+\]", "", res)
    return res

In [9]:
test_nb = 10
counter = 0
for text_ in ecb["text"]:
    if isinstance(text_, str):
        text = numbered_reference_removal(text_)
        text2 = reference_removal_en(text)
        if text2 != text:
            print(text[:500])
            print("...")
            print(text[-3000:])
            print("-------------")
            print(text2[:500])
            print("...")
            print(text2[-6000:])
            print("====================")
            counter += 1
    if counter == 10:
        break

  The role of monetary policy in addressing the crisis in the euro area   Speech by Mario Draghi, President of the ECB, at the “Room for discussion” of the Study Association SEFA and the Faculty of Economics and Business, Amsterdam,15 April 2013  Introduction There was a time, not too long ago, when central banking was considered to be a rather boring and unexciting occupation. In the era of the “Great Moderation”, mostly seen as the period between the mid-1980s and the beginning of the global f
...
et up to complement the national shock absorption capacity of euro area countries. The creation of the EFSF and more recently the ESM has addressed this last shortcoming. And in our joint work with the Presidents of the European Council, the European Commission and the Eurogroup, we have set out a vision and a process for building a genuine Economic and Monetary Union. This is designed to cover the other gaps in the institutional architecture that I previously referred to.  The genuine Econ

Sometimes, the title appears in the beginning of an ECB text. We can remove it using regular expressions.

In [16]:
# Before dealing with N/A because otherwise, this edit would just undo the other.
# We can use this to remove the "TRANSCRIPT" and "SPEECH" tags.
# We also notice that most of the time, the content is preceded by a date. It would be useful to make a date parser to remove the first chunk.

def first_date_extractor(text):
    if len(text) > 0:
        res = re.sub('^(.*?)[1-9][0-9]* (?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|'
                     + 'Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?) (19|20)[0-9][0-9]',
                     '', text)
        return res

def remove_title(x):
    if not x["text"] is None and isinstance(x["text"], str):
        res = re.sub(x["title"], '', x["text"]).strip()
        return res
    else:
        return x["title"]

def website_remover(text):
    # Remove websites
    regex = "((http|https)://)(www.)?" \
        + "[a-zA-Z0-9@:%._\\+~#?&//=]{2,256}\\.[a-z]" \
        + "{2,6}\\b([-a-zA-Z0-9@:%._\\+~#?&//=]*)"
    res = re.sub(regex, "", text).strip()
    return res

def tag_removal(text):
    # Remove tags
    res = re.sub('SPEECH', '', text)
    res = re.sub('TRANSCRIPT', '', res)
    res = re.sub("Introduction", "", res)
    res = re.sub("Summary", "", res)
    return res

def summarizeLine(text, tolist=False):
    try:
        res = summarize(text, word_count=500)
        if tolist:
            res = res.split("\n")
    except:
        return text
    return res

def pipeline_en(x, tolist=False):
    res = remove_title(x)
    if res is None:
        return res
    res = numbered_reference_removal(res)
    res = reference_removal_en(res)
    res = tag_removal(res).strip()
    res = first_date_extractor(res).strip()
    res = summarizeLine(res, tolist)

    # print("================================================")
    # print(res[:200])
    # print("...")
    # print(res[-500:])
    return res

ecb["text_preprocessed"] = ecb.apply(lambda x: pipeline_en(x, tolist=True), axis=1)

In [17]:
ecb["text_preprocessed"].apply(lambda x: len(str.split(str(x)))).describe()

count    1772.000000
mean      478.733634
std        98.570976
min         1.000000
25%       491.000000
50%       499.000000
75%       507.000000
max       583.000000
Name: text_preprocessed, dtype: float64

What are the N/A entries for text? Do they also have no speaker or no title?

In [13]:
ecb[ecb.isna().any(axis=1)]

Unnamed: 0,title,speaker,text,text_preprocessed
40,"Die EZB, das Geld und die Wirtschaft: Von A wi...",Sabine Lautenschläger,,"Die EZB, das Geld und die Wirtschaft: Von A wi..."
104,Economic situation and outlook,Peter Praet,,Economic situation and outlook
146,Economic developments in the euro area,Peter Praet,,Economic developments in the euro area
172,Unequal scars – distributional consequences of...,Isabel Schnabel,,Unequal scars – distributional consequences of...
220,The economic outlook for the euro area,Philip R. Lane,,The economic outlook for the euro area
257,Mehr Europa für eine stabile gemeinsame Währung,Isabel Schnabel,,Mehr Europa für eine stabile gemeinsame Währung
275,Better Regulation“ im Finanzsektor - die Sicht...,Gertrude Tumpel-Gugerell,,Better Regulation“ im Finanzsektor - die Sicht...
291,The ECB's monetary policy strategy review - IM...,Philip R. Lane,,The ECB's monetary policy strategy review - IM...
332,Sources of risk and vulnerabilities for financ...,Luis de Guindos,,Sources of risk and vulnerabilities for financ...
413,Policy Frameworks and Strategies for an Open E...,Philip R. Lane,,Policy Frameworks and Strategies for an Open E...


In [98]:
ecb["text"].apply(lambda x: len(str.split(str(x)))).describe()

count     1772.000000
mean      2838.625282
std       1682.959967
min          1.000000
25%       1712.250000
50%       2583.000000
75%       3784.500000
max      13282.000000
Name: text, dtype: float64

In [14]:
fed[fed.isna().any(axis=1)]

Unnamed: 0,title,speaker,text
461,Reflections on a Year of Crisis,,Chairman Bernanke delivered the same remarks ...
494,Liquidity Provision by the Federal Reserve,,Chairman Bernanke presented identical remarks...


FED is okay. In ECB, if there is no text, there is at least a title, so we can still work with that.

Let us view the speakers.

In [15]:
ecb["speaker"].value_counts()

Jean-Claude Trichet            216
Benoît Cœuré                   191
Mario Draghi                   186
Yves Mersch                    161
Peter Praet                    129
Vítor Constâncio               126
Lorenzo Bini Smaghi            105
Gertrude Tumpel-Gugerell        99
Sabine Lautenschläger           85
José Manuel González-Páramo     84
Jürgen Stark                    80
Luis de Guindos                 57
Jörg Asmussen                   51
Lucas Papademos                 45
Isabel Schnabel                 42
Christine Lagarde               41
Philip R. Lane                  40
Fabio Panetta                   26
Frank Elderson                   7
Name: speaker, dtype: int64

In [16]:
fed["speaker"].value_counts()

Chairman Ben S. Bernanke                        145
Governor Lael Brainard                           68
Governor Daniel K. Tarullo                       54
Governor Jerome H. Powell                        50
Vice Chairman Stanley Fischer                    45
Governor Elizabeth A. Duke                       44
Chair Janet L. Yellen                            44
Vice Chairman Donald L. Kohn                     41
Vice Chair for Supervision Randal K. Quarles     39
Governor Randall S. Kroszner                     36
Chairman Jerome H. Powell                        32
Governor Frederic S. Mishkin                     26
Vice Chairman Richard H. Clarida                 26
Vice Chair Janet L. Yellen                       22
Governor Sarah Bloom Raskin                      17
Governor Kevin Warsh                             16
Governor Jeremy C. Stein                         16
Governor Michelle W. Bowman                      13
Governor Susan S. Bies                            3
Name: speake

Let us view the text languages.

In [17]:
ecb["lang"] = ecb["text_preprocessed"].apply(lambda x : detect(x[:500]), )
fed["lang"] = fed["text"].apply(lambda x : detect(x[:500]))

In [18]:
ecb["lang"].value_counts()

en    1646
de      75
fr      31
es      16
it       4
Name: lang, dtype: int64

In [19]:
fed["lang"].value_counts()

en    739
Name: lang, dtype: int64

In [28]:
ecb[ecb["lang"] != "en"]["text_preprocessed"].str.len().sum(skipna=True)

2565448

In [21]:
ecb.loc[31]["text"]

'  Die Finanzmarktunion als Element einer stabilen Währungsunion?   Rede von Jörg Asmussen, Mitglied des Direktoriums der EZB, Handelsblatt Jahrestagung „Banken im Umbruch“, Frankfurt am Main, 4. September 2012 Sehr geehrte Damen und Herren,  Ein Satz in der Schlusserklärung des Euro-Gipfels vom 28. Juni dieses Jahres sorgte für Aufhorchen und viel Diskussionsstoff über den Sommer – ich zitiere: „ Sobald unter Einbeziehung der EZB ein wirksamer einheitlicher Aufsichts\xadmechanismus für Banken des Euro-Währungs\xadgebiets eingerichtet worden ist, hätte der ESM nach einem ordentlichen Beschluss die Möglich\xadkeit, Banken direkt zu rekapitalisieren.“  Hier wurde der Anfang einer Finanzmarktunion beschlossen. Als Beitrag zu dieser Diskussion möchte ich im Folgenden gerne drei Aspekte aufgreifen:    Wo stehen wir bei der Finanzmarkt\xadregulierung? Wie stellt sich aktuell die Lage an den Finanzmärkten dar?   Warum ist eine Finanzmarktunion not\xadwendig geworden? Wie soll diese aus\xadseh