In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

import time
import re
from langdetect import detect

from collections import Counter

import reprlib

import gc

In [2]:
FILENAME = "../data/train_series.csv"
FILENAME_ECB = "../data/ecb_data.csv"
FILENAME_FED = "../data/fed_data.csv"
FILENAME_ECB_P = "../data/ecb_data_preprocessed.csv"
FILENAME_FED_P = "../data/fed_data_preprocessed.csv"

In [3]:
def text_print(text, line_char_lim=150):
    text_ = text.split('\n')
    for subtext in text_:
        n = len(subtext)
        k = 0
        while k <= n:
            print(subtext[k:min(n, k + line_char_lim)])
            k += line_char_lim
        print('\n')

In [4]:
fed = pd.read_csv(FILENAME_FED_P, index_col=0)

In [5]:
# View length statistics
fed["text"].apply(len).describe()

count      739.000000
mean     22104.081191
std      10999.279224
min       2011.000000
25%      13922.000000
50%      22575.000000
75%      28200.500000
max      69492.000000
Name: text, dtype: float64

In [6]:
# View texts

In [13]:
text_print(fed.iloc[1]['text'])

 The concept of financial innovation, it seems, has fallen on hard times. Subprime mortgage loans, credit default swaps, structured investment vehicle
s, and other more-recently developed financial products have become emblematic of our present financial crisis. Indeed, innovation, once held up as th
e solution, is now more often than not perceived as the problem. I think that perception goes too far, and innovation, at its best, has been and will 
continue to be a tool for making our financial system more efficient and more inclusive. But, as we have seen only too clearly during the past two yea
rs, innovation that is inappropriately implemented can be positively harmful. In short, it would be unwise to try to stop financial innovation, but we
 must be more alert to its risks and the need to manage those risks properly. My remarks today will focus on the consumer protection issues raised by 
financial innovation. First, though, I want to say how pleased I am to join you for the sixth 

In [52]:
text_print(fed[~fed["text"].str.contains("ootnote")]["text"].iloc[0])

 Good afternoon. I would like to thank Standard and Poor's for the invitation to speak today at this impressive conference. I am quite pleased to be a
ble to offer some remarks on Basel II implementation in the United States. I am even more pleased that in today's speech I can now talk about U.S. imp
lementation of Basel II in the present tense, since within the past ten days each of the U.S. banking agencies approved the U.S. final rule for Basel 
II. While work on Basel II--for both bankers and supervisors--is far from complete, adoption of the Basel II rule is nevertheless a very important acc
omplishment.  I would also like to offer thanks and extend congratulations to all the parties involved in the successful adoption of Basel II. This in
cludes staff at each of the U.S. banking agencies, who worked tirelessly and with incredible determination and patience to see this rulemaking to its 
completion, as well as the principals at the other agencies, who worked very hard to find comm

In [66]:
def find_footnote(x):
    found = re.match(r"(.*)footnote", x, re.IGNORECASE)
    if not found is None:
        insensitive_footnote = re.compile(re.escape('footnote'), re.IGNORECASE)
        return insensitive_footnote.sub("", found.group()).strip()
    else:
        return x

In [67]:
# Find amount of text we can remove by removing footnotes
without_footnote = fed["text"].apply(
    find_footnote
)

In [68]:
text_print(without_footnote.iloc[1])

The concept of financial innovation, it seems, has fallen on hard times. Subprime mortgage loans, credit default swaps, structured investment vehicles
, and other more-recently developed financial products have become emblematic of our present financial crisis. Indeed, innovation, once held up as the
 solution, is now more often than not perceived as the problem. I think that perception goes too far, and innovation, at its best, has been and will c
ontinue to be a tool for making our financial system more efficient and more inclusive. But, as we have seen only too clearly during the past two year
s, innovation that is inappropriately implemented can be positively harmful. In short, it would be unwise to try to stop financial innovation, but we 
must be more alert to its risks and the need to manage those risks properly. My remarks today will focus on the consumer protection issues raised by f
inancial innovation. First, though, I want to say how pleased I am to join you for the sixth b

In [99]:
without_footnote[without_footnote.str.contains("Math")].iloc[0]



In [155]:
def find_useless_thanks(x):
    if x is not None:
        found = re.findall(r"([^.]*?(thank | congratulat)[^.]*\.)", x, re.IGNORECASE)
    if not found is None:
        res = x
        for substring in found:
            res = re.sub(re.escape(substring[0]), "", res)
        return res
    return x

In [101]:
# Thanks are useless. Let's just remove them.
without_thanks = without_footnote.apply(find_useless_thanks)

[]
[]
[(" I would like to thank Standard and Poor's for the invitation to speak today at this impressive conference.", 'thank '), ('  I would also like to offer thanks and extend congratulations to all the parties involved in the successful adoption of Basel II.', ' congratulat'), (' Of course, I would also like to thank the many industry participants--some of whom may be here today--who spent considerable time and effort providing valuable comments on our proposals over the past several years.', 'thank ')]
[(' Thank you for the opportunity to take part in this important and influential conference.', 'Thank ')]
[]
[(' Thank you again for the opportunity to speak with you today.', 'Thank ')]
[]
[('  I would like to thank NeighborWorks America for inviting me to be with you today to continue the very important discussion of community stabilization in the wake of rising foreclosures in neighborhoods across the country.', 'thank ')]
[]
[('         Thank you.', 'Thank '), ("              Th

In [117]:
fed["text"].str.split(" ").apply(len).describe()

count      739.000000
mean      3543.518268
std       1730.606814
min        309.000000
25%       2256.500000
50%       3592.000000
75%       4530.500000
max      10676.000000
Name: text, dtype: float64

In [118]:
without_thanks.str.split(" ").apply(len).describe()

count      739.000000
mean      3435.964817
std       1693.646100
min        286.000000
25%       2175.500000
50%       3458.000000
75%       4461.000000
max      10676.000000
Name: text, dtype: float64

In [114]:
text_print(without_thanks[without_thanks.str.split(" ").apply(len) > 10000].iloc[0])

 As one would expect of a piece of legislation that has sixteen titles and runs 849 pages in the Statutes at Large, the Dodd-Frank Wall Street Reform 
and Consumer Protection Act ranges widely in addressing problems both directly and indirectly associated with the financial crisis. Taken as a whole, 
though, the primary aim of those 849 pages can fairly be read as a reorientation of financial regulation towards safeguarding "financial stability" th
rough the containment of "systemic risk," phrases that both recur dozens of times throughout the statute. The law, explicitly in many provisions and i
mplicitly in many others, directs the bank regulatory agencies to broaden their focus beyond the soundness of individual banking institutions, and the
 market regulatory agencies to move beyond their traditional focus on transaction-based investor protection. This emphasis on financial stability and 
systemic risk is hardly surprising in light of the damage done by the financial crisis and the

In [363]:
def remove_video_code(text):
    if not text is None:
        res = re.sub("Accessible Keys for Video.*myPlayer\.play\(\);(.*?)\}(.*?)\}", "", text).strip()
        res = re.sub("^(Watch|View) Video", "", res)
        return res.strip()
    return text

In [364]:
def remove_refs_fed(text):
    if not text is None:
        res = re.sub(r'References.*', '', text)
        res = re.sub(r'1\.(.+?)Return to text.*', '', res)
        res = re.sub(r'Return to text.*', '', res)
        return res.strip()
    return text

In [365]:
def website_remover(text):
    # Remove websites
    regex = "((http|https)://)(www.)?" \
        + "[a-zA-Z0-9@:%._\\+~#?&//=]{2,256}\\.[a-z]" \
        + "{2,6}\\b([-a-zA-Z0-9@:%._\\+~#?&//=]*)"
    res = re.sub(regex, "", text).strip()
    return res

In [366]:
def remove_greetings(text):
    if not text is None:
        res = re.sub(r'^(.*?)Good (morning|afternoon|evening)[^.]*\.', '', text)
        res = re.sub(r'Hello.', '', res)
        return res.strip()
    return text

In [367]:
def pipeline_fed(fed):
    res = fed["text"].apply(find_footnote)
    res = res.apply(remove_video_code)
    res = res.apply(find_useless_thanks)
    res = res.apply(remove_refs_fed)
    res = res.apply(remove_greetings)
    res = res.apply(website_remover)
    fed["text_"] = res
    return fed

In [368]:
fed_ = pipeline_fed(fed)

In [374]:
fed_["text_"].str.split(" ").apply(len).describe()

count     739.000000
mean     2591.119080
std      1429.253326
min        27.000000
25%      1326.000000
50%      2818.000000
75%      3596.000000
max      8479.000000
Name: text_, dtype: float64

In [372]:
weird_entries = fed_["text_"][fed_["text_"].str.contains("Video")]
weird_entries

713    The Federal Reserve is best known for its role...
Name: text_, dtype: object

In [373]:
weird_text = weird_entries.iloc[0]
weird_text

'The Federal Reserve is best known for its role in the national economy and monetary policy. But through the 12 Federal Reserve Banks across the country, it also gets involved in efforts to support local communities and their economies. This work helps to enhance our understanding of the pace of economic recovery and further creates a backdrop for a national dialogue about common problems and their potential solutions.              Over the last several years, every community across the country has felt the effects of the financial crisis. Foreclosed, vacant, and abandoned properties threaten neighborhoods nationwide, and community leaders are working to stabilize those neighborhoods. While the problem touches every community, it doesn\'t look the same in each because it\'s shaped by the circumstances that prevailed in those neighborhoods before the crisis hit.              Neighborhood stabilization efforts are critical, now more than ever, as not all communities will be stabilized wi

In [331]:
weird_entries.iloc[0]

'           Accessible Keys for Video [Space Bar] toggles play/pause; [Right/Left Arrows] seeks the video forwards and back (5 sec ); [Up/Down Arrows] increase/decrease volume; [M] toggles mute on/off; [F] toggles fullscreen on/off (Except IE 11); The [Tab] key may be used in combination with the [Enter/Return] key to navigate and activate control buttons, such as caption on/off.                  videojs(\'frb-video6917\').ready(function() {                 var myPlayer;                 myPlayer = this;                 myPlayer.on(\'loadstart\',function(){                   var videoInfo = "";                   var transcriptLinkLabel;                   if (myPlayer.mediainfo.custom_fields["actualdatetext"]) {                     videoInfo += "<span class=\'col-xs-6\'>" + myPlayer.mediainfo.custom_fields["actualdatetext"] + "</span>";                   }                                if (myPlayer.mediainfo.custom_fields["transcriptlinkurl"]) {                     if (myPlayer.mediainf

In [332]:
weird_entries.iloc[2]

'           Accessible Keys for Video [Space Bar] toggles play/pause; [Right/Left Arrows] seeks the video forwards and back (5 sec ); [Up/Down Arrows] increase/decrease volume; [M] toggles mute on/off; [F] toggles fullscreen on/off (Except IE 11); The [Tab] key may be used in combination with the [Enter/Return] key to navigate and activate control buttons, such as caption on/off.                  videojs(\'frb-video6899\').ready(function() {                 var myPlayer;                 myPlayer = this;                 myPlayer.on(\'loadstart\',function(){                   var videoInfo = "";                   var transcriptLinkLabel;                   if (myPlayer.mediainfo.custom_fields["actualdatetext"]) {                     videoInfo += "<span class=\'col-xs-6\'>" + myPlayer.mediainfo.custom_fields["actualdatetext"] + "</span>";                   }                                if (myPlayer.mediainfo.custom_fields["transcriptlinkurl"]) {                     if (myPlayer.mediainf