In this notebook we identify collocations by using the part-of-speech tagger. As in [Hansen (2018)](https://academic.oup.com/qje/article/133/2/801/4582916), we later use the collocations in the topic model. By collocations we understand two-word and three-word sequences that have a specific meaning (e.g., labour market).

We follow the article by [Markus Konrad](https://datascience.blog.wzb.eu/2016/07/13/accurate-part-of-speech-tagging-of-german-texts-with-nltk/) and use supervised classification for POS (part-of-speech) tagger. This means that a tagger is trained with a large [text corpus](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger/).

Here are the steps we take to train the tagger (please see the article by Markus Konrad for a detailed explanation):

Step 1: donwload the TIGER corpus in CONLL09 format 'tigercorpus-2.2.conll09.tar.gz' [here](https://www.ims.uni-stuttgart.de/documents/ressourcen/korpora/tiger-corpus/download/start.html), unzip it and save it to your working directory.

Step 2: read the corpus with NLTK library:

In [1]:
import nltk
corp = nltk.corpus.ConllCorpusReader('.', 'tiger_release_aug07.corrected.16012013.conll09',
                                    ['ignore', 'words', 'ignore', 'ignore', 'pos'],
                                    encoding = 'utf-8')

Step 3: Use the Python class ClassifierBasedGermanTagger to determine POS. This tagger inspects words for prefixes, suffixes, and other attributes and also takes the sequence of words into account. Download the folder 'ClassifierBasedGermanTagger' [here](https://github.com/ptnplanet/NLTK-Contributions) and save it to your working directory.

In [2]:
# import the tagger:
from ClassifierBasedGermanTagger.ClassifierBasedGermanTagger import ClassifierBasedGermanTagger

# load the sentences from the corpus:
tagged_sents = [sentence for sentence in corp.tagged_sents()]

# train the tagger with the complete corpus (the accuracy is around 96%)
tagger = ClassifierBasedGermanTagger(train=tagged_sents)

# determine POS of a word in a sentence:
tagger.tag(['Das', 'ist', 'ein', 'einfacher', 'Test', 'dritte'])

[('Das', u'ART'),
 ('ist', u'VAFIN'),
 ('ein', u'ART'),
 ('einfacher', u'ADJA'),
 ('Test', u'NN'),
 ('dritte', u'ADJA')]

The part of speech tagset for the Tiger corpus can be found on the page 12 [here](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwj_1-L0q6j5AhUFOHoKHZQlBRoQFnoECAYQAQ&url=https%3A%2F%2Fwww.ims.uni-stuttgart.de%2Fdocuments%2Fressourcen%2Fkorpora%2Ftiger-corpus%2Fannotation%2Ftiger_introduction.pdf&usg=AOvVaw2jtxYYF9U_nY7Lt-c-ZbzL).

Step 4: Save the whole tagger object to disk using **pickle**.

In [3]:
import pickle

with open('nltk_pos.pickle', 'wb') as f:
    pickle.dump(tagger, f, protocol=2)

Next we identify collocations based on the Handelsblatt data. We use only one newspaper for this task to reduce computational cost. 

In [4]:
import pandas as pd
data = pd.read_csv('E:\\Userhome\\mokuneva\\newspaper_data_processing\\Handelsblatt\\hb_prepro_final.csv', encoding = 'utf-8-sig', sep=';', index_col = 0, dtype = {'kicker': 'str',
                                                                                                 'page': 'str',
                                                                                                 'series_title': 'str',
                                                                                                 'rubrics': 'str'})
data.page = data.page.fillna('')
data.series_title = data.series_title.fillna('')
data.kicker = data.kicker.fillna('')
data.rubrics = data.rubrics.fillna('')

We tag every word in the news articles using the tagger trained above. By collocations we understnad two-word (three-word) sequences that satisfy POS patterns proposed by [Lang, Schneider, and Suchowolec (2016)](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjGi66hu6j5AhUBVPEDHRVuDf0QFnoECAcQAQ&url=https%3A%2F%2Fheiup.uni-heidelberg.de%2Freader%2Fdownload%2F361%2F361-69-81161-1-10-20180515.pdf&usg=AOvVaw2-EKkmc6ZA62Re2L3d0SwS) and whose frequency is above 100 (50).

Considered patterns are AN, NN, N Prep N, N Det N, A A N, where A stands for adjectives, N for nouns, Prep for prepositions, and Det for determiners.

In [5]:
# Download the nltk_data and save it to the folder Collocations.
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [None]:
from datetime import datetime
startTime = datetime.now() # track time

import multiprocessing as mp
NUM_CORE = 4

import bitridic_german
import bigrams_trigrams_multi
import most_freq

if __name__ == "__main__":
    
    list_of_texts = data.texts
    # Split each text into sentences.
    list_of_objects = [bitridic_german.BiTriDic(i) for i in list_of_texts]
    
    pool = mp.Pool(NUM_CORE)
    # Tag each word in the news article, create a list of bigrams and trigrams satisfying the POS patterns
    # proposed by Lang, Schneider, and Suchowolec (2016).
    list_of_bigrams_trigrams = pool.map(bigrams_trigrams_multi.worker_bigr_trigr, ((obj) for obj in list_of_objects))
    pool.close()
    pool.join()

# A list of bigrams based on all articles.
list_of_bigrams = [item[0][0] for item in list_of_bigrams_trigrams]
# A list of trigrams based on all articles.
list_of_trigrams = [item[1][0] for item in list_of_bigrams_trigrams]

# Two-word collocations whose frequency is above 100.
most_freq_bigrams = most_freq.most_freq(list_of_bigrams, 'bigrams', 100)
# Three-word collocations whose frequency is above 50.
most_freq_trigrams = most_freq.most_freq(list_of_trigrams, 'trigrams', 50)

print(datetime.now()-startTime)