### Text preprocessing for conventional text analysis methods

This code includes the example of Section 2.2 in the article "Machine learning in management accounting research: Literature review and pathways for the future". The article is forthcoming in European Accounting Review, but the working paper version can be downloaded from https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3822650

For demonstration purposes, this example load everything into memory at each step. This is not efficient and an iterative approach should be used with larger textual datasets. The example data consists of 100 10-X filings of manufacturing companies from the year 2018.

Information texts are color-coded to make it easier to find parts relevant for the discussion of the accompanying article.

<div class="alert-warning">
Yellow is used for parts of the code which are irrelevant and perform, for example, pre-processing operations.
</div>

<div class="alert-info">
Blue is used for the relevant parts of the code.
</div>

<div class="alert-warning">
Libraries
</div>

In [5]:
from nltk.corpus import stopwords

import os
from nltk.corpus import stopwords
import gensim_lda_library as gl
from gensim import corpora
from gensim.models import CoherenceModel
import gensim
import spacy

<div class="alert-info">
Part 1: Transform pdf:s to pure text
</div>

* Using convert_pdfminer() -function from the specific libary to convert pdf files to pure text and insterting text as long strings to raw_text list. The details of the function are explained in the library.

In [2]:
data_path = './example22_data/'
files = os.listdir(data_path)

In [3]:
raw_text = []
for file in files:
    temp1 = open(data_path+file,'r').read()
    raw_text.append(temp1)

In [4]:
raw_text[0][3000:3400]

"atements of Operations   \n           2   \n\nCondensed Consolidated Statements of Changes in Shareholders  Deficiency   \n           3   \n\nCondensed Consolidated Statements of Cash Flows   \n           4   \n\nNotes to Condensed Consolidated Financial Statements     \n           5   \n \n      Item\n    2.     Management's Discussion and Analysis of Financial Condition and Results of Operations   \n         "

<div class="alert-info">
Part 2: First cleaning steps
</div>

* Define the stopwords list (NLTK library). The list can be extended easily with "extend"-method.Here extended with the words'https','doi','org'
* Simple_preprocess() -function of Gensim can be used basic cleaning steps. With default settings, it will remove numbers, words shorter than 2 characters and words longer than 15 characters 
* Use the stopwords list of the NLTK library to remove the stopwords from the texts

In [5]:
stop_words = stopwords.words("english")
stop_words.extend(['https','doi','org'])

In [6]:
docs_cleaned = []
for item in raw_text:
    tokens = gensim.utils.simple_preprocess(item)
    docs_cleaned.append(tokens)

In [24]:
docs_cleaned[0][500:506]

['unaudited', 'financial', 'statements', 'table', 'of', 'contents']

In [10]:
docs_nostops = []
for item in docs_cleaned:
    red_tokens = [word for word in item if word not in stop_words]
    docs_nostops.append(red_tokens)

In [25]:
docs_nostops[0][500:506]

['common', 'stock', 'shares', 'note', 'going', 'concern']

<div class="alert-info">
Part 3: Use the Spacy deep learning language model to remove specific parts-of-speech
</div>

* Define which PoS will be saved
* Load the deep learning model. For this application, we do not need the parser and named-entity-reconginition modules
* Go through the texts and keep only nouns

In [15]:
#allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']
allowed_postags=['NOUN']

In [16]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

In [17]:
docs_lemmas = []
for red_tokens in docs_nostops:
    doc = nlp(" ".join(red_tokens))
    docs_lemmas.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])

In [26]:
docs_lemmas[0][500:506]

['good', 'service', 'customer', 'time', 'amount', 'contract']

<div class="alert-info">
Part 4: Create bigrams/trigrams
</div>

In [32]:
bigram = gensim.models.Phrases(docs_lemmas,threshold = 25, min_count=2)
#trigram = gensim.models.Phrases(bigram[docs_lemmas], threshold=25, min_count=2)  
bigram_mod = gensim.models.phrases.Phraser(bigram)
#trigram_mod = gensim.models.phrases.Phraser(trigram)

In [33]:
docs_bigrams = [bigram_mod[doc] for doc in docs_lemmas]

In [37]:
docs_bigrams[0][540:546]

['credit_worthiness', 'trend', 'account', 'effort', 'collection', 'company']

In [38]:
#docs_trigrams = [trigram_mod[doc] for doc in docs_bigrams]

<div class="alert-info">
Create dictionary
</div>

* Dictionary is created from the cleaned texts in docs_bigrams
* Abnormal words are removed from the dictionary
    - Words that are present in just 2 or less texts
    - Words that are present in more than 70 % of the texts
    - 50000 most common words are kept
* The dictionary is used to create a bag-of-words representation of the words, which is saved to the corpus-list. The list contains counts for each dictionary word.
* Word "allowance" 7 times in the first article.

In [39]:
id2word = corpora.Dictionary(docs_bigrams)
id2word.filter_extremes(no_below=2, no_above=0.7, keep_n=50000)

In [40]:
corpus = [id2word.doc2bow(text) for text in docs_bigrams]

In [42]:
corpus[0][0:6]

[(0, 1), (1, 1), (2, 2), (3, 2), (4, 1), (5, 7)]

In [44]:
id2word[5]

'allowance'