In this notebook we identify collocations by using the part-of-speech tagger. As in [Hansen (2018)](https://academic.oup.com/qje/article/133/2/801/4582916), we later use the collocations in the topic model. By collocations we understand two-word and three-word sequences that have a specific meaning (e.g., labour market).

We follow the article by [Markus Konrad](https://datascience.blog.wzb.eu/2016/07/13/accurate-part-of-speech-tagging-of-german-texts-with-nltk/) and use supervised classification for POS (part-of-speech) tagger. This means that a tagger is trained with a large [text corpus](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger/).

Here are the steps we take to train the tagger (please see the article by Markus Konrad for a detailed explanation):

Step 1: donwload the TIGER corpus in CONLL09 format 'tigercorpus-2.2.conll09.tar.gz' [here](https://www.ims.uni-stuttgart.de/documents/ressourcen/korpora/tiger-corpus/download/start.html), unzip it and save it to your working directory.

Step 2: read the corpus with NLTK library:

In [1]:
import nltk
corp = nltk.corpus.ConllCorpusReader('.', 'tiger_release_aug07.corrected.16012013.conll09',
                                    ['ignore', 'words', 'ignore', 'ignore', 'pos'],
                                    encoding = 'utf-8')

Step 3: Use the Python class ClassifierBasedGermanTagger to determine POS. This tagger inspects words for prefixes, suffixes, and other attributes and also takes the sequence of words into account. Download the folder 'ClassifierBasedGermanTagger' [here](https://github.com/ptnplanet/NLTK-Contributions) and save it to your working directory.

In [2]:
# import the tagger:
from ClassifierBasedGermanTagger.ClassifierBasedGermanTagger import ClassifierBasedGermanTagger

# load the sentences from the corpus:
tagged_sents = [sentence for sentence in corp.tagged_sents()]

# train the tagger with the complete corpus (the accuracy is around 96%)
tagger = ClassifierBasedGermanTagger(train=tagged_sents)

# determine POS of a word in a sentence:
tagger.tag(['Das', 'ist', 'ein', 'einfacher', 'Test', 'dritte'])

[('Das', u'ART'),
 ('ist', u'VAFIN'),
 ('ein', u'ART'),
 ('einfacher', u'ADJA'),
 ('Test', u'NN'),
 ('dritte', u'ADJA')]

The part of speech tagset for the Tiger corpus can be found on the page 12 [here](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwj_1-L0q6j5AhUFOHoKHZQlBRoQFnoECAYQAQ&url=https%3A%2F%2Fwww.ims.uni-stuttgart.de%2Fdocuments%2Fressourcen%2Fkorpora%2Ftiger-corpus%2Fannotation%2Ftiger_introduction.pdf&usg=AOvVaw2jtxYYF9U_nY7Lt-c-ZbzL).

Step 4: Save the whole tagger object to disk using **pickle**.

In [3]:
import pickle

with open('nltk_pos.pickle', 'wb') as f:
    pickle.dump(tagger, f, protocol=2)

Next we identify collocations based on our data.

In [4]:
import os
import pandas as pd
from ast import literal_eval

# Set the path variable to point to the 'newspaper_data_processing' directory.
path = os.getcwd().replace('\\Collocations', '')

# Load pre-processed 'dpa' dataset from a CSV file.
dpa = pd.read_csv(path + '\\dpa\\' + 'dpa_prepro_final.csv', encoding = 'utf-8', sep=';', index_col = 0,  keep_default_na=False,
                   dtype = {'rubrics': 'str', 
                            'source': 'str',
                            'keywords': 'str',
                            'title': 'str',
                            'city': 'str',
                            'genre': 'str',
                            'wordcount': 'str'},
                  converters = {'paragraphs': literal_eval})

# Keep only the article texts and their respective publication dates.
dpa = dpa[['texts', 'day', 'month', 'year']]

# Load pre-processed 'SZ' dataset from a CSV file.
sz = pd.read_csv(path + '\\SZ\\' + 'sz_prepro_final.csv', encoding = 'utf-8-sig', sep=';', index_col = 0, dtype = {'newspaper': 'str',
                                                                                                 'newspaper_2': 'str',
                                                                                                 'quelle_texts': 'str',
                                                                                                 'page': 'str',
                                                                                                 'rubrics': 'str'})
sz.page = sz.page.fillna('')
sz.newspaper = sz.newspaper.fillna('')
sz.newspaper_2 = sz.newspaper_2.fillna('')
sz.rubrics = sz.rubrics.fillna('')
sz.quelle_texts = sz.quelle_texts.fillna('')

# Keep only the article texts and their respective publication dates.
sz = sz[['texts', 'day', 'month', 'year']]

# Load pre-processed 'Handelsblatt' dataset from a CSV file.
hb = pd.read_csv(path + '\\Handelsblatt\\' + 'hb_prepro_final.csv', encoding = 'utf-8-sig', sep=';', index_col = 0, dtype = {'kicker': 'str',
                                                                                                 'page': 'str',
                                                                                                 'series_title': 'str',
                                                                                                 'rubrics': 'str'})
hb.page = hb.page.fillna('')
hb.series_title = hb.series_title.fillna('')
hb.kicker = hb.kicker.fillna('')
hb.rubrics = hb.rubrics.fillna('')

# Keep only the article texts and their respective publication dates.
hb = hb[['texts', 'day', 'month', 'year']]

# Load pre-processed 'Welt' dataset from a CSV file.
welt = pd.read_csv(path + '\\Welt\\' + 'welt_prepro_final.csv', encoding = 'utf-8-sig', sep=';', index_col = 0, dtype = {'newspaper': 'str',
                                                                                                 'rubrics': 'str',
                                                                                                 'title': 'str'})
welt.title = welt.title.fillna('')
welt.rubrics = welt.rubrics.fillna('')

# Keep only the article texts and their respective publication dates.
welt = welt[['texts', 'day', 'month', 'year']]

# Concatenate the 'dpa', 'sz', 'hb', and 'welt' DataFrames into a single DataFrame 'data'.
data = pd.concat([dpa, sz, hb, welt])

# The number of articles in the final dataset.
print(len(data))

# Sort the data in chronological order.
data = data.sort_values(['year', 'month', 'day'], ascending=[True, True, True])
# Reset the index of the DataFrame
data.reset_index(inplace=True, drop=True)
data.head()

3336299


Unnamed: 0,texts,day,month,year
0,Schalck: Milliardenkredit sicherte Zahlungsfäh...,1,1,1991
1,Welajati: Iran bleibt bei einem Krieg am Golf ...,1,1,1991
2,Bush will offenbar seinen Außenminister erneut...,1,1,1991
3,Sperrfrist 1. Januar 1000 HBV fordert umfassen...,1,1,1991
4,Schamir weist Nahost-Äußerungen des neuen EG-P...,1,1,1991


In this project, we limit our training dataset to include only articles published from 1991 through 2009, inclusive. This decision is driven by our later goal of performing an out-of-sample forecast of GDP growth for the years 2010 to 2018. By excluding data from 2010 onward during the model training phase, we ensure that our forecast does not incorporate information that was not yet available.

In [5]:
# Filter the dataset to include only articles published before 2010.
data = data[data.year < 2010]

We tag every word in the news articles using the tagger trained above. We define collocations as two-word or three-word sequences that adhere to the part-of-speech (POS) patterns proposed by [Lang, Schneider, and Suchowolec (2016)](https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://heiup.uni-heidelberg.de/catalog/view/361/509/81161&ved=2ahUKEwi4itSg2IqGAxX2evEDHWMbCYcQFnoECBgQAQ&usg=AOvVaw3gvsMJQoJ5J8pNfZZ3FDje). Each collocation must appear in over 100 texts for two-word sequences and over 50 texts for three-word sequences. However, to maintain computational efficiency and focus on the most salient collocations, we only consider the top 10,000 most frequently occurring bigrams and trigrams. This means that even if a certain bigram or trigram appears in more than 100 or 50 texts respectively, it will be overlooked if it is not among the 10,000 most common collocations.

Considered patterns are AN, NN, N Prep N, N Det N, A A N, where A stands for adjectives, N for nouns, Prep for prepositions, and Det for determiners.

In [6]:
# Download the nltk_data and save it to the folder Collocations.
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [7]:
from datetime import datetime
startTime = datetime.now() # track time

import multiprocessing as mp
NUM_CORE = 4

import bitridic_german
import bigrams_trigrams_multi
import most_freq

if __name__ == "__main__":
    
    list_of_texts = data.texts
    # Split each text into sentences.
    list_of_objects = [bitridic_german.BiTriDic(i) for i in list_of_texts]
    
    pool = mp.Pool(NUM_CORE)
    # Tag each word in the news article, create a list of bigrams and trigrams satisfying the POS patterns
    # proposed by Lang, Schneider, and Suchowolec (2016).
    list_of_bigrams_trigrams = pool.map(bigrams_trigrams_multi.worker_bigr_trigr, ((obj) for obj in list_of_objects))
    pool.close()
    pool.join()

# A list of bigrams based on all articles.
list_of_bigrams = [item[0][0] for item in list_of_bigrams_trigrams]
# A list of trigrams based on all articles.
list_of_trigrams = [item[1][0] for item in list_of_bigrams_trigrams]

# Two-word collocations whose frequency is above 100, restricted to the top 10,000 most frequent bigrams.
most_freq_bigrams = most_freq.most_freq(list_of_bigrams, 'bigrams', 100)
# Three-word collocations whose frequency is above 50, restricted to the top 10,000 most frequent trigrams.
most_freq_trigrams = most_freq.most_freq(list_of_trigrams, 'trigrams', 50)

print(datetime.now()-startTime)

4 days, 7:55:04.370000
