In this notebook we identify collocations by using the part-of-speech tagger. As in [Hansen (2018)](https://academic.oup.com/qje/article/133/2/801/4582916), we later use the collocations in the topic model. By collocations we understand two-word and three-word sequences that have a specific meaning (e.g., labour market).

In [1]:
import os
import pandas as pd
path_to_file = os.getcwd().replace('\\analysis\\analysis_topics', '') + '\\finance data'
data = pd.read_csv(path_to_file + '\\articles_daily_ts.csv', encoding = 'utf-8-sig', sep=';')

We tag every word in the news articles using nltk tagger. By collocations we understand two-word (three-word) sequences that satisfy POS patterns proposed by [Justeson and Katz (1995)](https://www.cambridge.org/core/journals/natural-language-engineering/article/abs/technical-terminology-some-linguistic-properties-and-an-algorithm-for-identification-in-text/D5F076938C4E3F24B11EDC2E831216AF#access-block) and whose frequency is above 100 (50).

Considered patterns are AN, NN, AAN, ANN, NAN, NNN, NPrepN where A stands for adjectives, N for nouns, and Prep for prepositions. 

In [2]:
#import nltk
# Download the nltk_data and save it to the folder analysis_topics.
#nltk.download()

In [3]:
from datetime import datetime
startTime = datetime.now() # track time

import multiprocessing as mp
NUM_CORE = 4

import bitridic_english
import bigrams_trigrams_multi
import most_freq

if __name__ == "__main__":
    
    list_of_texts = data.texts
    # Split each text into sentences.
    list_of_objects = [bitridic_english.BiTriDic(i) for i in list_of_texts]
    
    pool = mp.Pool(NUM_CORE)
    # Tag each word in the news article, create a list of bigrams and trigrams satisfying the POS patterns
    # proposed by Justeson and Katz (1995) .
    list_of_bigrams_trigrams = pool.map(bigrams_trigrams_multi.worker_bigr_trigr, ((obj) for obj in list_of_objects))
    pool.close()
    pool.join()

# A list of bigrams based on all articles.
list_of_bigrams = [item[0][0] for item in list_of_bigrams_trigrams]
# A list of trigrams based on all articles.
list_of_trigrams = [item[1][0] for item in list_of_bigrams_trigrams]

# Two-word collocations whose frequency is above 100.
most_freq_bigrams = most_freq.most_freq(list_of_bigrams, 'bigrams', 100)
# Three-word collocations whose frequency is above 50.
most_freq_trigrams = most_freq.most_freq(list_of_trigrams, 'trigrams', 50)

print(datetime.now()-startTime)

0:01:28.170000
