### Word2vec example

This code includes the example of Section 3.3 in the article "Machine learning in management accounting research: Literature review and pathways for the future"

Note! This code does not use the language model of the paper used to infer similar phrases for "restructuring" and "growth_strategy" due to the model being several gigabytes in size. For demonstration purposes, this code includes a simpler model that is trained using text8 dataset in Gensim. It contains textual data from Wikipedia. However, this code includes guidelines how to construct a similar language model using 10-K filings that was used in the paper. 

Different tasks are separated to different parts below. Of course in practical applications they can be done in one pipeline.

<div class="alert-warning">
Yellow is used for parts of the code which are irrelevant and perform, for example, pre-processing operations.
</div>

<div class="alert-info">
Blue is used for the relevant parts of the code.
</div>

In [2]:
import gensim
import os
import spacy
import numpy as np
import pandas as pd

-----

### Part 1: Replacing named entities with tags

In [2]:
source_dir = 'your_source_dir_here'

In [3]:
dest_dir = 'your_dest_dir_here'

<div class="alert-info">
Load the Spacy model
</div>

<div class="alert-info">
We add "merge entities" module to the pipeline to connect entities that consist of several words.
</div>

In [3]:
nlp = spacy.load('en_core_web_lg')

In [4]:
nlp.add_pipe('merge_entities')

<function spacy.pipeline.functions.merge_entities(doc: spacy.tokens.doc.Doc)>

<div class="alert-warning">
The algorithm below can be used to transform files in "source_dir" to such that named entities have been replaced with "ner_(entity_type)" tags. np.setdiff1d is used to collect the remaining files that have not yet been processed. 
</div>

In [None]:
files1 = os.listdir(source_dir)

In [7]:
files2 = os.listdir(dest_dir)

In [9]:
remaining_files = np.setdiff1d(files1,files2)

<div class="alert-info">
Algorithm for replacing named entities with a tag ner_(type of named entity)
</div>

In [11]:
for fname in remaining_files:
    raw = open(os.path.join(source_dir, fname)).read()
    raw = raw[500:1000000].lower()
    raw = " ".join(gensim.utils.simple_preprocess(raw))
    doc=nlp(raw,disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"])
    open(os.path.join(dest_dir + fname),mode='w').write(' '.join([t.text if not t.ent_type_ else 'ner_' + t.ent_type_ for t in doc]))

#### Example

Below is an example output what the algorithm above produces.

In [14]:
example_file = './example22_data/20180814_707549.txt'

In [15]:
raw = open(example_file).read()
raw = raw[500:1000000].lower()
raw = " ".join(gensim.utils.simple_preprocess(raw))
doc=nlp(raw,disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"])

In [33]:
' '.join([t.text if not t.ent_type_ else 'ner_' + t.ent_type_ for t in doc])[9600:10000]

' we make with the ner_ORG are available on our website free of charge as soon as reasonably practical after we file them with or furnish them to the ner_ORG and are also available online at the ner_ORG website at www sec gov any materials we file with the ner_ORG may also be read and copied at the ner_ORG public reference room at street ne ner_GPE to obtain information on the operation of the publ'

---

### Part 2: Create noun chunks 

In [3]:
source_dir = 'your_source_dir_here'

In [4]:
dest_dir = 'your_dest_dir_here'

In [35]:
nlp = spacy.load('en_core_web_lg')

<div class="alert-info">
Add module to the pipeline that creates noun chunks.
</div>

In [36]:
nlp.add_pipe('merge_noun_chunks')

<function spacy.pipeline.functions.merge_noun_chunks(doc: spacy.tokens.doc.Doc) -> spacy.tokens.doc.Doc>

<div class="alert-warning">
The algorithm below can be used to transform files in "source_dir" to such that words related to nouns are connected with "_". np.setdiff1d is used to collect the remaining files that have not yet been processed. 
</div>

In [7]:
files1 = os.listdir(source_dir)

In [8]:
files2 = os.listdir(dest_dir)

In [9]:
import numpy as np

In [10]:
remaining_files = np.setdiff1d(files1,files2)

<div class="alert-info">
Combine the words of noun chunks with '_'
</div>

In [13]:
for fname in remaining_files:
    raw = open(os.path.join(source_dir, fname)).read()
    doc=nlp(raw,disable=["lemmatizer","ner"])
    open(os.path.join(dest_dir + fname),mode='w').write(' '.join([t.text.replace(' ','_') for t in doc]))

#### Example

Below is an example output what the algorithm above produces.

In [37]:
example_file = './example22_data/20180814_707549.txt'

In [41]:
raw = open(example_file).read()
raw = " ".join(gensim.utils.simple_preprocess(raw))
doc=nlp(raw,disable=["lemmatizer","ner"])

In [48]:
' '.join([t.text.replace(' ','_') for t in doc])[5000:5200]

'ject to the_safe_harbor_provisions created by the_private_securities_litigation_reform_act of certain but not all of the_forward_looking_statements in this_report are specifically identified as forwar'

---

### Part 3: Identify similar phrases using a word embedding model

<div class="alert-info">
Use a word2vec model, trained with 10-Ks, to infer most similar words to specified keywords.
</div>

<div class="alert-info">
Main routine to create the word2vec model. NOTE! For this code to work, you need a collection of texts. In the article 180 000 10-K filings were used to train the word2vec model. This algorithm creates the model iteratively without loading the whole textual data to the memory.
</div>

In [54]:
class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname
 
    def __iter__(self):
        for fname in os.listdir(self.dirname):
            raw = open(os.path.join(self.dirname, fname)).read()
            raw = raw.lower()
            yield gensim.utils.simple_preprocess(raw)

In [9]:
docs_10K = MySentences(source_dir)

In [10]:
model = gensim.models.Word2Vec(docs_10K)

#### Example

NOTE! Below is an example how the model above can be used to search similar phrases to the selected keywords. However, the example below uses the text8 dataset included in Gensim, not the model that was used in the paper. The text8 dataset contains text from Wikipedia, so the closest words are different than from the model trained using 10-Ks. (Again a justification that we should use domain-specific textual data.)

NOTE! The text8 dataset does not contain ner-tags or noun chunks, and therefore, there are no phrases in the results.

In [50]:
import gensim.downloader

In [51]:
corpus = gensim.downloader.load('text8')

In [53]:
model = gensim.models.Word2Vec(corpus)

<div class="alert-info">
The 4 closest words the keywpords 'restructuring'
</div>

In [55]:
model.wv.most_similar(positive=['restructuring'],topn=4)

[('liberalization', 0.8058362603187561),
 ('modernization', 0.7889036536216736),
 ('privatization', 0.754657506942749),
 ('stabilization', 0.7462006211280823)]

<div class="alert-info">
Calculate the centroid vector from the word vectors representing words 'restructuring','liberalization','modernization','privatization','stabilization'. At this point, it is possible to finetune the seed words by adding your own words.
</div>

In [58]:
word_list = ['restructuring','liberalization','modernization','privatization','stabilization']
restr_centroid = np.zeros(100)
for word in word_list:
    restr_centroid = np.add(restr_centroid, model.wv[word])

In [59]:
restr_centroid = restr_centroid/5

<div class="alert-info">
Collect the 100 word vectors that are cloesest to the centroid
</div>

In [61]:
restr_keywords = [word for (word,_) in model.wv.most_similar(positive=restr_centroid,topn=100)]

In [62]:
restr_keywords

['restructuring',
 'liberalization',
 'privatization',
 'modernization',
 'privatisation',
 'financing',
 'deregulation',
 'stabilization',
 'industrialization',
 'decentralization',
 'macroeconomic',
 'banking',
 'austerity',
 'liberalized',
 'policies',
 'perestroika',
 'nationalisation',
 'privatizations',
 'infrastructure',
 'reforms',
 'consolidation',
 'initiatives',
 'deficits',
 'instability',
 'mismanagement',
 'welfare',
 'nationalization',
 'dismantling',
 'downturn',
 'investment',
 'democratization',
 'collectivization',
 'fiscal',
 'subsidies',
 'investments',
 'overhaul',
 'financial',
 'enterprises',
 'diversification',
 'sponsorship',
 'liberalisation',
 'finances',
 'multilateral',
 'monetary',
 'tariff',
 'lobbying',
 'reform',
 'eradication',
 'glasnost',
 'adjustment',
 'employment',
 'economic',
 'bureaucratic',
 'procurement',
 'monopolies',
 'taxation',
 'policy',
 'securities',
 'crises',
 'outsourcing',
 'centralization',
 'administration',
 'incentives',
 'su

### Part 4: Build the measure

<div class="alert-info">
When the phrases are identified, the code below canbe used to calculate the occurence of words in 10-Ks
</div>

In [149]:
docs_10K = MySentences(source_dir)

In [150]:
count_list = []
for doc in docs_10K:
    temp_sum = 0
    for word in restr_keywords:
        temp_sum+=doc.count(word)
    count_list.append(temp_sum)