# Handelsblatt data

*Handelsblatt* is a popular German daily newspaper. According to the IVW, Informationsgemeinscahft zur Feststellung der Verbreitung von Werbeträgern (Information Community for the Assessment of the Circulation of Media), it had a circulation of 140612 daily copies in the first quarter of 2021. An appealing feature of Handelsblatt for forecasting economic activity is its focus on the economy.

We purchased Handelsblatt data from Genios, a German provider of business information. The corpus includes **980516** articles from January 1994 to November 2018. The data was purchased in February 2019. 

The data set includes 25 subfolders corresponding to a particular year (e.g., HB_1994). Each subfolder contains a few XML files which we parse to extract information relevant to our research project. Unfortunately, the copyright prevents us from publishing the data.

## Load the data

First, we need to read in the data. We create the list including the names of the 25 subfolders (`folder_list`) and apply the function `hb_load` to them in parallel by exploiting Python's `multiprocessing` library.

We extract the following XML elements:

* datum - publication date
* worte - the word count
* ressort - section/subsection of the newspaper
* titel-liste/serientitel - the name of the series
* quelle/name - the name of the newspaper
* titel-liste/titel - article's title
* titel-liste/dachzeile - article's kicker
* titel-liste/untertitel - article's subheading
* inhalt/vorspann - annotation
* inhalt/text - text of the article

In [1]:
import os

# Handelsblatt is the main folder with 25 subfolders in it
path = 'E:\\Userhome\\mokuneva\\Handelsblatt' # your path here

# Create the list of all subfolders within Hb main folder.
folder_list=[]

for f in [f for f in os.listdir(path) ]: 
    # os.listdir(path) - names of directories                                         
    folder_list.append(path + '\\' + f)

In [2]:
import multiprocessing as mp # use multiprocessing module for parallel computing

NUM_CORE = mp.cpu_count()-4 # set the number of cores to use

print("The number of cores that will be used: {}".format(NUM_CORE))

The number of cores that will be used: 12


In [3]:
from datetime import datetime
startTime = datetime.now() # track time

import pandas as pd # load pandas: python data analysis library
import hb_load # import a function that loads the data from one folder (see hb_load.py file for details)

if __name__ == "__main__":
    pool = mp.Pool(NUM_CORE)
    data_intermediate = pool.map(hb_load.hb_load, folder_list) # load data from each folder in parallel
    data = pd.concat(data_intermediate) # concatenate DataFrames corresponding to different folders
    pool.close()
    pool.join()
    
print(datetime.now()-startTime)

0:01:16.444038


In [4]:
data = data.sort_values(['year', 'month', 'day'], ascending=[True, True, True]) # sort the data in chronological order
data = data.reset_index() # reset the index of the DataFrame
del data['index'] # delete a column with an old index

In [5]:
data[10:15]

Unnamed: 0,day,kicker,month,newspaper,rubrics,series_title,texts,title,title_only,word_c,year
10,3,,1,Handelsblatt,Nachrichten,,Nachrichten. Yuan abgewertet dpa PEKING. China...,Nachrichten.,Nachrichten.,68,1994
11,3,,1,Handelsblatt,Nachrichten,,Nachrichten. Ruecktritt abgelehnt afp NEU DELH...,Nachrichten.,Nachrichten.,58,1994
12,3,,1,Handelsblatt,Nachrichten,,Nachrichten. Ungarn wertet Forint ab rtr BUDAP...,Nachrichten.,Nachrichten.,33,1994
13,3,,1,Handelsblatt,Nachrichten,,Nachrichten. FDP zum Beamtenrecht dpa BONN. Di...,Nachrichten.,Nachrichten.,55,1994
14,3,,1,Handelsblatt,Nachrichten,,Nachrichten. Neue Drohungen dpa HAMBURG. Der F...,Nachrichten.,Nachrichten.,86,1994


In [6]:
# the number of article before pre-processing
len(data)

980516

## Light pre-processing

### Remove short articles (<100 words)

The first part of the article 327123 contains different symbols (e.g. zw|lf) instead of umlauts. The second part (which I keep) has the same content with the right format. The second part starts with 'NEU DELHI.'.

In [7]:
data.at[327123, 'texts'] = data['texts'][327123][data['texts'][327123].find('NEU DELHI.'):]

An important feature of the text is its length. There are a few ways to count the number of words in a text:

1. Use the metadata ('worte') of an XML file. Beware that numbers are counted as words. This is the column 'word_c' in the DataFrame. 
2. Quick and dirty approach: split the tokens by space and calculate the length of the list. Numbers and other non-alphabetic characters are counted as words. If a space is used as a delimiter in large numbers (100 000), this number will be counted as two tokens.
3. Use `count_words_mp` function, which counts only words. This is our preferred method because we exclude numbers from the analysis in both sentiment analysis and topic modeling.

In [8]:
# the second approach
# data['w_count'] = data['texts'].str.split(' ').str.len()

In [8]:
from datetime import datetime
startTime = datetime.now() # track time

import count_words_mp # import the function calculating the number of words in a text

if __name__ == "__main__":
    pool = mp.Pool(NUM_CORE)
    count_results = pool.map(count_words_mp.count_words_mp, [text for text in data['texts']]) 
    pool.close()
    pool.join()
    
print(datetime.now()-startTime)

0:00:40.811674


In [9]:
# Save the result as a new column "word_count"
data['word_count'] = count_results

The difference between the first ('word_c' column) and third ('word_count' column) approaches might be significant. 

In [10]:
diff = (data['word_c']-data['word_count'])

In [11]:
sorted(diff, reverse=True)[:10]

[695, 691, 591, 430, 429, 429, 404, 390, 387, 387]

This happens because the third approach does not count a number as a word.

In [12]:
data['texts'][list(diff).index(387)][:300]

'Fußball-Bundesliga. \\\\\\\\\\\\\\ zu Hause\\ auswärts\\ Sp.\\ g.\\ u.\\ v.\\ Tore\\ Diff.\\ Pkt.\\ Sp.\\ g.\\ u.\\ v.\\ Tore\\ Pkt.\\ Sp.\\ g.\\ u.\\ v.\\ Tore\\ Pkt.\\ 1. ( 1)\\ Bayern München 34\\ 20\\ 11\\ 3\\ 68:34\\ +34\\ 71\\ 17\\ 13\\ 4\\ 0\\ 37:12\\ 43\\ 17\\ 7\\ 7\\ 3\\ 31:22\\ 28\\ 2. ( 2)\\ Bayer Leverkusen 34\\ 21\\ 6\\ 7\\ 69:41\\ +28\\ 69'

Because short texts have sparse semantic features, topic models and BOW-based sentiment tools perform better on longer texts. This is why we keep the texts longer than 100 words.

In [13]:
# remove articles with the number of words<100
data = data[data['word_count']>=100]

In [14]:
# the number of articles after removing short articles
len(data)

673430

### Remove exact duplicates

A few examples of duplicates in our corpus:
* The same article enters the corpus twice with different publication dates (e.g., 6.4.1994 and 27.4.1994). In this case, a natural solution is to keep the first entry.
* The same article appears twice with a slight variation in the metadata (e.g., one of the duplicates includes the title of the series, or the word count is a little different even though the articles are identical).
* The same article enters the corpus twice with the same publication date and metadata.

In [15]:
# All the duplicated articles are saved as 'hb_duplicates' for further exploration.
hb_duplicates = data[data['texts'].duplicated(keep = False)]

In [16]:
hb_duplicates[:10]

Unnamed: 0,day,kicker,month,newspaper,rubrics,series_title,texts,title,title_only,word_c,year,word_count
18404,29,,4,Handelsblatt,Weltwirtschaft,,Importlizenzen im Vergleichsverfahren. HB DUES...,Importlizenzen im Vergleichsverfahren.,Importlizenzen im Vergleichsverfahren.,121,1994,112
18618,2,,5,Handelsblatt,Weltwirtschaft,,Importlizenzen im Vergleichsverfahren. HB DUES...,Importlizenzen im Vergleichsverfahren.,Importlizenzen im Vergleichsverfahren.,122,1994,112
29129,7,,7,Handelsblatt,Finanzzeitung; Geld und Kredit,,NACHTREPORT / Dow-Jones-Index steigt um 22 Pun...,NACHTREPORT / Dow-Jones-Index steigt um 22 Pun...,NACHTREPORT / Dow-Jones-Index steigt um 22 Pun...,638,1994,604
29637,8,,7,Handelsblatt,Finanzzeitung; Geld und Kredit,,NACHTREPORT / Dow-Jones-Index steigt um 22 Pun...,NACHTREPORT / Dow-Jones-Index steigt um 22 Pun...,NACHTREPORT / Dow-Jones-Index steigt um 22 Pun...,636,1994,604
32437,28,,7,Handelsblatt,Wirtschaft und Politik,,Glossar. Das Verarbeitende Gewerbe ist in West...,Glossar.,Glossar.,139,1994,130
35680,19,,8,Handelsblatt,Karriere,,Solide Basisausbildung zu DDR-Zeiten. Ost-Inge...,Solide Basisausbildung zu DDR-Zeiten. Ost-Inge...,Solide Basisausbildung zu DDR-Zeiten. Ost-Inge...,176,1994,169
35962,19,,8,Handelsblatt,Karriere,,Solide Basisausbildung zu DDR-Zeiten. Ost-Inge...,Solide Basisausbildung zu DDR-Zeiten. Ost-Inge...,Solide Basisausbildung zu DDR-Zeiten. Ost-Inge...,176,1994,169
40771,20,,9,Handelsblatt,Unternehmen und Maerkte,,Hochtief will seine Beteiligung bei Holzmann e...,Hochtief will seine Beteiligung bei Holzmann e...,Hochtief will seine Beteiligung bei Holzmann e...,738,1994,715
40776,20,,9,Handelsblatt,Unternehmen und Maerkte,,Hochtief will seine Beteiligung bei Holzmann e...,Hochtief will seine Beteiligung bei Holzmann e...,Hochtief will seine Beteiligung bei Holzmann e...,738,1994,715
41355,22,,9,Handelsblatt,Wirtschaft und Politik; Konjunkturbarometer,,Glossar. Das Verarbeitende Gewerbe ist in West...,Glossar.,Glossar.,133,1994,126


In [17]:
# drop the exact duplicates, keep the article with the earlier publication date ('first')
data.drop_duplicates(['texts'], keep = 'first', inplace=True)
data = data.reset_index() # reset the index of the DataFrame
del data['index'] # delete a column with an old index

In [18]:
# the number of articles after removing exact duplicates
len(data)

670682

## Filtering

### Section

Handelsblatt's articles are organized into 2073 unique sections/subsections. We investigate the most frequently met ones and remove artiles published within a few of them.

In [19]:
# Sections and the number of articles per section
import collections
counter_sections = collections.Counter(data.rubrics)
print(counter_sections.most_common(10))

[('Unternehmen und Märkte', 87617), ('Wirtschaft und Politik', 33223), ('Deutschland', 29652), ('Finanzzeitung', 28197), ('Unternehmen & Märkte', 19289), ('Meinung und Analyse', 17238), ('Titelseite', 16957), ('Beilage oder Sonderseite', 16278), ('International', 16252), ('Europa', 15709)]


To ensure consistency in topical content over time, we remove articles from a few sections that were covered only within a limited time period:

* 1) 'Karriere': news related to work, career; due to the organizational changes, starting from 2014, this section is only present in a weekend edition of the newspaper.
* 2) 'Weekend Journal': the section was introduced in 2002 and did not receive any coverage after 2008. Moreover, the news discussed within this section is on average longer (663 words) than the news from the main section of interest "Wirtschaft und Politik" (436 words).
* 3) 'Panorama': news from around the world, issued in 1997, existed until 2001.
* 4) 'Business-Travel': travel-related news, e.g. weather, 2001-2006, only 366 articles in total.
* 5) 'Fortschritt': scientific findings, new devices, 2001-2003.
* 6) 'Wochenende': weekend news, 2012-2018, average article length is 1579.
* 7) 'perspektiven': perspectives, non-economic topics like a prize for women in science, working abroad, or new beauty standards popularized by Dove's advertising campaign; 2007-2009.

In [20]:
abandoned_sections = ['Karriere', 'Weekend Journal', 'Panorama', 'Business-Travel', 
                    'Fortschritt', 'Wochenende', 'perspektiven']

In [21]:
data = data[~data['rubrics'].isin(abandoned_sections)]

In [22]:
# the number of articles after removing articles from the sections covered only within a certain time period
len(data)

658481

Here we want to exclude the articles from non-economic sections. The list of irrelevant sections follows.

* 1) 'Sportreport': sport news.
* 2) 'Kunstmarkt': art market.
* 3) 'Sportreport      Nachrichten': sport news.
* 4) 'Neue Bücher': an advertisement for new books.
* 5) 'Wissenschaft & Debatte': science and debate.
* 6) 'Literatur': literature.
* 7) 'Leserbriefe': letters to the editor; starting from 2013, there are almost no letters/comments from the readers, because all the comments became digital.
* 8) 'Reisen und Tagen': travel.
* 9) 'Kultur': culture.
* 10) 'Termin- und Optionsmärkte': futures and options markets, mostly quantitative info.
* 11) 'Galerie des Stils': style and fashion.
* 12) 'Galerie': non-economic news about a pop band, Nutella restaurant etc.
* 13) 'Neue Buecher': new books.
* 14) 'Online': news about the websites, digitalization, online services.
* 15) 'Forschung und Technik': research and technology.
* 16) 'Business-Service': news unrelated to the economy, e.g. telephone service for weather forecast, travel information etc.
* 17) 'Computer und Kommunikation': computers and communication.
* 18) 'Ökonomie & Bildung': economics and education, Handelsblatt's materials for school lessons, MBA studies, educating articles about how the economy works.
* 19) 'Auto-Mobil': new car models.
* 20) 'Die Handelsblatt-Woche': announcement of events taking place this week.
* 21) 'Wirtschaftsbuch': economics books.

In [23]:
non_economic_sections = ['Kunstmarkt', 'Sportreport', 'Sportreport      Nachrichten', 
                  'Neue Bücher', 'Wissenschaft & Debatte',  
                  'Literatur', 'Leserbriefe', 'Reisen und Tagen', 
                  'Kultur',  'Termin- und Optionsmärkte',
                  'Galerie des Stils', 'Galerie', 'Neue Buecher', 'Online', 'Forschung und Technik', 
                  'Business-Service', 'Computer und Kommunikation', 'Ökonomie & Bildung', 'Auto-Mobil', 
                  'Die Handelsblatt-Woche', 'Wirtschaftsbuch']

In [24]:
data = data[~data['rubrics'].isin(non_economic_sections)]

In [25]:
# the number of articles after removing articles from the non-economic sections
len(data)

621317

Additionally, we remove articles from the related sections and subsections. 

* This means that we not only remove articles from a section 'Kunstmarkt', but also from the subsections 'Kunstmarkt      Anleihen', 'Kunstmarkt      Aus den Galerien' etc. 

* We consider 'Kunstmarkt' and 'Themen und Trends      Kunstmarkt' as two close sections. 

* The full list of sections/subsections we remove can be found in 'subsections_hb.txt'.

In [26]:
# this function loads the dictionary with the related sections and subsections we want to exclude
from os import path
import codecs

def dictionary_open(name):
    with codecs.open(path.join(os.getcwd(), name),
               'r',  'utf-8') as f:
          dictionary = set(f.read().splitlines()[1:-1])
    return dictionary

In [27]:
subsections_clean = dictionary_open('subsections_hb.txt')

In [28]:
data = data[~data['rubrics'].isin(subsections_clean)]
data = data.reset_index() # reset the index of the DataFrame
del data['index'] # delete a column with an old index

In [29]:
# the number of articles after removing articles from the related subsections
len(data)

588837

In [30]:
# Sections and the number of articles per section
counter_sections=collections.Counter(data['rubrics'])
print(counter_sections.most_common(10))

[('Unternehmen und Märkte', 87617), ('Wirtschaft und Politik', 33223), ('Deutschland', 29652), ('Finanzzeitung', 28197), ('Unternehmen & Märkte', 19289), ('Meinung und Analyse', 17238), ('Titelseite', 16957), ('Beilage oder Sonderseite', 16278), ('International', 16252), ('Europa', 15709)]


### Section + text

Remove articles for which the following two conditions are met: a section is '' (unclassified), and a text contains a string 'KARRIERE'. As discussed, this type of articles is removed due to organizational changes (see above).

In [31]:
data = data[~((data['rubrics']=='') & (data['texts'].str.contains('KARRIERE')))]

Remove articles that are characterized by the following conditions: a section contains a string "Recht und Steuern" (Law and taxes), and a text contains one of the legal terms from the list `legal_terms`.

* Az./AZ/Aktenzeichen/AKTENZEICHEN - docket number
* LAG/Landesarbeitsgericht - Regional Labour Court
* BAG/Bundesarbeitsgericht - Federal Labour Court
* VI R/IV A - case/law reference
* BFH - Bundesfinanzhof, German Federal Fiscal court
* Abs. - section
* BGH/Bundesgerichtshof - Federal Supreme Court
* BGB - Bürgerliches Gesetzbuch, Civil Code
* BSG - Bundessozialgericht, Federal Social Court
* EStG/Einkommensteuergesetz - Income Tax Act
* OLG/Oberlandesgericht - Higher Regional Court
* Anwaltgerichtshof - Lawyers' Court
* Finanzgericht - Tax Court

We remove these articles because they contain very detailed explanations of laws/tax deduction rules/court decisions which have a low chance to be relevant for forecasting economic activity.

In [32]:
legal_terms = ['Az\\.', 'AZ', 'Az\\:', 'Aktenzeichen', 'AKTENZEICHEN', 'LAG', 'Landesarbeitsgericht', 'Bundesarbeitsgericht', 'BAG', 'Arbeitsgericht', 'VI R','IV A','BFH','Abs\\.','BGH', 'Bundesgerichtshof', 'BGB','BSG','EStG','Einkommensteuergesetz','OLG', 'Oberlandesgericht', 'Anwaltgerichtshof', 'Finanzgericht']
data = data[~((data.rubrics.str.contains('Recht und Steuern')) & (data.texts.str.contains('|'.join(legal_terms))))]

In [33]:
# the number of articles after filtering out articles based on the section and text
len(data)

583767

### Section + title

Remove articles satisfying two conditions: section is 'Inhalt' (content), and the title is not 'Termine des Tages.' Section 'Inhalt' mainly includes news reports about the Handelsblatt and its organization. The aritlces with the title 'Termine des Tages.' may include short news on the Economy.

In [34]:
data = data[~(((data['rubrics']=='Inhalt') | (data['rubrics']=='Inhalt ;') | (data['rubrics']=='Inhalt      Der Werber-Rat'))
              & (data['title']!='Termine des Tages.'))]

In [35]:
# the number of articles after removing articles from the section 'Inhalt'
len(data)

582998

### Title

Exclude articles with the following title patterns:

* book suggestions (Buchtip)
* some very short news on a few people (BUSINESS LOUNGE.)
* articles about Handelsblatt and its organization (Liebe Leser)
* articles with comments from the readers (DIE MEINUNG UNSERER LESER.)
* announcement of events taking place this week (DIE HANDELSBLATT-WOCHE.)
* non-economic news related to numbers (e.g., how many people watched the movie this week); Bilanz des Wochenendes.
* sport news (SPORT TELEGRAMM.) 
* news related to the project 'Handelsblatt macht Schule' promoting economic education

In [36]:
title_patterns = ['BUCHTIP', 'Buchtip', 'BUSINESS LOUNGE\\.', 'Liebe Leser', 'DIE MEINUNG UNSERER LESER\\.', 
                 'DIE HANDELSBLATT\\-WOCHE\\.', 'Die Handelsblatt\\-Woche\\.', 'Die Handelsblattwoche\\.',
                 'DIE HANDELSBLATTWOCHE\\.', 'Bilanz des Wochenendes\\.', 'BILANZ DES WOCHENENDES\\.',
                 'SPORT TELEGRAMM\\.']
data = data[~data['title_only'].str.contains('|'.join(title_patterns))]

kicker_title_patterns = ['Handelsblatt macht Schule', 'HANDELSBLATT MACHT SCHULE\\.', 
                         'Aktuelles Wirtschaftswissen für den Unterricht', 
                         'Wirtschaftswissen für Schüler leicht gemacht',
                        'Ökonomie ganz leicht gemacht für Jugendliche',
                        'Aktuelles Wirtschaftswissen für Jugendliche',
                        'Neuer Stoff für den Unterricht im Fach Wirtschaft',
                        'Handelsblatt bringt Praxis in die Schulen',
                        'Experten aus der Praxis in den Schulen',
                        'Der neue Newcomer ist da',
                        'Die neue Zeitung für Schüler ist da',
                        'Wettbewerb\\: Eine Welt ohne Geld\\?',
                        'Wettbewerb zum Thema Geld',
                        'Wettbewerb für Schüler',
                        'Schlauere Azubis dank Zeitunglesen',
                        'Förderung für Auszubildende'
                        ]
data = data[~data.title.str.contains('|'.join(kicker_title_patterns))]

Exclude articles with the following titles:

* Kontakte. (contacts of organizations: addresses and phone numbers).

In [37]:
data = data[~(data['title_only']=='Kontakte.')]

In [38]:
# the number of articles after excluding articles based on the title patterns
len(data)

580596

### Series

Exclude articles from the following non-economic series:

* Neue Wirtschaftsliteratur (Handelsblatt-Beilage): economic books;
* Golf (Handelsblatt-Beilage): golf;
* Literatur (Handelsblatt-Beilage): literature;
* Macher des Handelsblatts (Handelsblatt-Serie): about journalists working for Handelsblatt.

In [39]:
series_exclude = ['Neue Wirtschaftsliteratur (Handelsblatt-Beilage)', 
                  'Golf (Handelsblatt-Beilage)', 'Literatur (Handelsblatt-Beilage)', 
                  'Literatur (Handelsblatt - Beilage)', 'Macher des Handelsblatts (Handelsblatt-Serie)']
data = data[~data['series_title'].isin(series_exclude)]
data = data.reset_index() # reset the index of the DataFrame
del data['index'] # delete a column with an old index

In [40]:
# the number of articles after excluding articles from the non-economic series
len(data)

579948

### Text

Exclude articles that contain the following strings:

* TERMINMÄRKTE / Tagesbericht: options trading, mainly quantiative information;
* Deutsche Börsen Düsseldorf: stock market, mainly quantitative information;
* DIGIX. Der E-Business-Index. or DIGIX. DIGIX Der E-Business-Index.: text corresponding to a graph.

In [41]:
text_strings = ['TERMINMÄRKTE \\/ Tagesbericht', 'TERMINMAERKTE \\/ Tagesbericht', 
                'Deutsche Börsen Düsseldorf', 'Deutsche Boersen Duesseldorf', 'DIGIX\\. Der E\\-Business\\-Index.',
               'DIGIX\\. DIGIX Der E\\-Business\\-Index\\.']
data = data[~(data.texts.str.contains('|'.join(text_strings)))]
data = data.reset_index() # reset the index of the DataFrame
del data['index'] # delete a column with an old index

In [42]:
# the number of articles after excluding articles based on the text patterns
len(data)

579822

In [43]:
counter=collections.Counter(data['year'])
print(counter)

Counter({2000: 33949, 1999: 32861, 2001: 30730, 1996: 27326, 1997: 27084, 1998: 26911, 2004: 26432, 1995: 26169, 2002: 25083, 1994: 25072, 2005: 24726, 2003: 24350, 2006: 24167, 2007: 23816, 2010: 23811, 2008: 23520, 2009: 23329, 2011: 22196, 2012: 20667, 2013: 17506, 2014: 17397, 2015: 16297, 2016: 13103, 2017: 12152, 2018: 11168})


## Umlauts

To fix the issue with umlauts, we use the notebook Umlauts_fix written in Python 2. 

In [44]:
umlauts = ['ä', 'ö', 'ü', 'ß', 'Ä', 'Ö', 'Ü']
umlauts_replace = ['ae', 'oe', 'ue', 'ss', 'AE', 'OE', 'UE']

In [45]:
hb_umlauts_fix = data[(data.texts.str.contains('|'.join(umlauts_replace))) & (~data.texts.str.contains('|'.join(umlauts))) & (data.year<1999)]

In [46]:
# example of the text that we try to fix with a spellchecker
hb_umlauts_fix.texts[0]

'1994 wird ein Jahr des Wandels. Auf ein Neues!. Steuern und Beitraege steigen, staatliche Leistungen werden gesenkt - wahrlich kein berauschender Auftakt fuer das neue Jahr. Ohnehin sind die Erwartungen nicht sehr rosig. Zwar wird sich die Wirtschaft erholen. Ob es zu einem kraeftigen Aufschwung reicht, ist ungewiss. Sicher ist dagegen der Anstieg der Arbeitslosigkeit. Fuer Spannung sorgt der Wahlmarathon, der erst mit den Bundestagswahlen im Oktober entschieden wird. Kein Zweifel: 1994 wird ein Jahr der Unsicherheit und des Wandels. Ob es auch ein gutes Jahr wird, haengt davon ab, wie dieser Wandel bewaeltigt wird. 1993 war das Jahr der Krisenerkenntnis. Die Einsicht in die Kluft zwischen wirtschaftlich Machbarem und Anspruchsdenken hat erste Ansaetze zur Krisenbewaeltigung ermoeglicht. In der Tarifpolitik und bei den Sozialleistungen hat es - noch vor kurzem undenkbare - Einschnitte gegeben. Auch 1994 wird es Kuerzungen geben. Nur: Mit dem Rotstift allein kann kein Ausweg aus der Kr

In [47]:
hb_umlauts_fix.to_csv('hb_umlauts_fix.csv', encoding='utf-8-sig', sep = ';')

In [53]:
import pandas as pd
hb_umlauts_fixed = pd.read_csv('hb_umlauts_fixed.csv', encoding = 'utf-8', sep=';')

  interactivity=interactivity, compiler=compiler, result=result)


In [55]:
data.loc[hb_umlauts_fixed['Unnamed: 0.1'], 'texts'] = hb_umlauts_fixed.texts.values

In [56]:
# fixed version
data.texts[0]

'1994 wird ein Jahr des Wandels. Auf ein Neues!. Steuern und Beiträge steigen, staatliche Leistungen werden gesenkt - wahrlich kein berauschender Auftakt für das neue Jahr. Ohnehin sind die Erwartungen nicht sehr rosig. Zwar wird sich die Wirtschaft erholen. Ob es zu einem kräftigen Aufschwung reicht, ist ungewiss. Sicher ist dagegen der Anstieg der Arbeitslosigkeit. Für Spannung sorgt der Wahlmarathon, der erst mit den Bundestagswahlen im Oktober entschieden wird. Kein Zweifel: 1994 wird ein Jahr der Unsicherheit und des Wandels. Ob es auch ein gutes Jahr wird, hängt davon ab, wie dieser Wandel bewältigt wird. 1993 war das Jahr der Krisenerkenntnis. Die Einsicht in die Kluft zwischen wirtschaftlich Machbarem und Anspruchsdenken hat erste Ansätze zur Krisenbewältigung ermöglicht. In der Tarifpolitik und bei den Sozialleistungen hat es - noch vor kurzem undenkbare - Einschnitte gegeben. Auch 1994 wird es Kürzungen geben. Nur: Mit dem Rotstift allein kann kein Ausweg aus der Krise gezeic

## Remove English articles

With the next pre-processing step, we filter out news articles written in English. To do that, we use a **langdetect** library.

In [57]:
startTime = datetime.now()

import identify_eng_2

if __name__ == "__main__":
    pool = mp.Pool(NUM_CORE)
    eng_results = pool.map(identify_eng_2.identify_eng_2, [text for text in data['texts']]) 
    pool.close()
    pool.join()

print(datetime.now()-startTime)

0:32:41.138686


In [58]:
data['language'] = eng_results
data = data[data.language==0]
data = data.reset_index() # reset the index of the DataFrame
del data['index'] # delete a column with an old index

In [59]:
# the number of articles after excluding English articles
len(data)

579765

## Remove URLs

Remove URLs and html file references.
E.g.: 'de/studie99.html', 'Amazon.com', etc.

In [60]:
startTime = datetime.now()

import correct_url

if __name__ == "__main__":
    pool = mp.Pool(NUM_CORE)
    url_corrected = pool.map(correct_url.correct_url, [text for text in data['texts']]) 
    pool.close()
    pool.join()

print(datetime.now()-startTime)

0:02:41.383332


In [61]:
data['texts'] = url_corrected

## O instead of 0 problem

In some cases, Optical Character Recognition (OCR) can not distinguish between '0' and 'O' ('o'). As a result, there are tokens like '1OO'. Using regular expressions, we identify problematic tokens and replace 'O' ('o') with '0'.

In [62]:
startTime = datetime.now()

import ocr_replace

if __name__ == "__main__":
    pool = mp.Pool(NUM_CORE)
    ocr_corrected = pool.map(ocr_replace.ocr_replace, [text for text in data['texts']]) 
    pool.close()
    pool.join()

print(datetime.now()-startTime)

0:00:15.676602


In [63]:
data['texts'] = ocr_corrected

## Fixing tokens containing a number and a word

In quite a few cases, a number and a word are erroneously merged together into a single token. Splitting these tokens into two tokens helps us to deal with the following problems:

1. Hyphen-separated words will have the same spelling across the whole corpus. Examples: 20Jährige/20-Jährige, 100prozentig/100-prozentig, 30minütigen/30-minütigen etc.

2. Numbers and the names of the currencies will represent two different tokens. Examples: 100DM/100 DM, 30Euro/30 Euro.

3. Numbers and the measures of time/weight/distance etc. will be split into two tokens. Examples: 100km/100 km, 1Mill/1 Mill, 16Uhr/16 Uhr.

4. Simple mistakes will be corrected: 30bis 40/30 bis 40, 10Fahrzeuge/10 Fahrzeuge, 10und/10 und.

5. Mistakes in the beginnning of the sentences will be corrected: 10Welche/10 Welche, 8Wie/8 Wie, etc.

The most frequently met exceptions from this rule are taken into account:

1. Names of the companies/organizations: 1822direkt, 3Sat, 4MBO etc.
2. Names of smartphone/airplane/satellite models: 4S, 1B, 1k, 328Jet, 5C, 777X etc.
3. German nouns like 90er and English adjectives like 21st, 3rd.

In [64]:
startTime = datetime.now()

import split_number_word

if __name__ == "__main__":
    pool = mp.Pool(NUM_CORE)
    split_corrected = pool.map(split_number_word.split_number_word, [text for text in data['texts']]) 
    pool.close()
    pool.join()

print(datetime.now()-startTime)

0:00:24.513277


In [65]:
data['texts'] = split_corrected

## Remove fuzzy duplicates

By fuzzy duplicates we understand nearly duplicated articles. These are:

* drafts/minor revisions of the articles saved in the database;
* slightly changed advertisements which are published several times during a month;
* 'NACHT - REPORT' news reports about stock market which are published a few hours earlier than almost identical 'WALL STREET' articles.

We identify 'fuzzy' duplicates using cosine similarity and choose a threshold of 93% based on some visual exploration. Here is the article by Ryan Basques we used as a reference: [Link](https://towardsdatascience.com/a-laymans-guide-to-fuzzy-document-deduplication-a3b3cf9a05a7). 

There are also a few exceptions that we take into account:

* 'DEUTSCHE AKTIEN' articles are often erroneously classified as duplicates because they contain very similar information;
* 'Fortsetzung von Seite' (continuation of an article from another page): not surprisingly, these documents are also very close to each other. Later we concatenate them into one article;
* cosine similarity approach does not work perfectly for extremely long articles. A few of them (e.g., 'Stuttgart 21: Ist das Projekt noch zu stoppen?') we list as exceptions.

In [66]:
# Required input for the function 'fuzzy_duplicates': a dataframe for each month-year combination.
# List with a year
inputs_year = []
# List with a month
inputs_month = []
# List with the dataframes containing 'year', 'month', and 'texts' columns
inputs_month_year = []
for year in list(set(data['year'])):
    for month in list(set(data['month'])):
        inputs_year.append(year)
        inputs_month.append(month)
        inputs_month_year.append(data[(data['year'] == year) & (data['month'] == month)][["month", "year", "texts"]])
        
inputs = list(zip(inputs_year, inputs_month, inputs_month_year))

In [67]:
startTime = datetime.now() # track time

import fuzzy_duplicates # import a function that outputs the indices of fuzzy duplicates 
delete_indices = []

if __name__ == "__main__":
    pool = mp.Pool(NUM_CORE)
    # apply function to all combinations of month-year in parallel
    delete_intermediate = pool.map(fuzzy_duplicates.fuzzy_duplicates, inputs)
    delete_indices = delete_indices + delete_intermediate # create one list of indices
    pool.close()
    pool.join()
    
print(datetime.now()-startTime)



0:08:08.459927


### Duplicates exploration

In case you want to have a look at the identified duplicates, use the function 'fuzzy_duplicates_test' from the file 'fuzzy_duplicates_test_all'.

In [68]:
startTime = datetime.now() # track time

import fuzzy_duplicates_test_all 

if __name__ == "__main__":
    pool = mp.Pool(NUM_CORE)
    dup_intermediate = pool.map(fuzzy_duplicates_test_all.fuzzy_duplicates_test, inputs) 
    duplicates = pd.concat(dup_intermediate) 
    pool.close()
    pool.join()
    
print(datetime.now()-startTime)

duplicates.to_csv('duplicates.csv', encoding = 'utf-8-sig', sep = ';')

0:08:10.531109


### Drop the duplicates

In [69]:
# Free memory
inputs = None
# List of indices corresponding to the duplicated articles
delete_indices = [item for sublist in delete_indices for item in sublist]
# List of unique indices
delete_indices = list(set(delete_indices))
# Drop the fuzzy duplicates
data.drop(data.index[delete_indices], inplace = True)
data = data.reset_index() # reset the index of the DataFrame
del data['index'] # delete a column with an old index