### Text preprocessing for conventional text analysis methods

Information in the following code is color-coded to make it easier to find parts relevant for the discussion of the accompanying article.

<div class="alert-warning">
Yellow is used for parts of the code which are irrelevant and perform, for example, pre-processing operations.
</div>

<div class="alert-info">
Blue is used for the relevant parts of the code.
</div>

<div class="alert-warning">
Libraries
</div>

In [2]:
from nltk.corpus import stopwords

import os
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
import gensim_lda_library as gl
from gensim import corpora
from gensim.models import CoherenceModel
import gensim
import pyLDAvis.gensim
import spacy

<div class="alert-info">
Transform pdf:s to pure text
</div>

* Using convert_pdfminer() -function from the specific libary to convert pdf files to pure text and insterting text as long strings to raw_text list. The details of the function are explained in the library.

In [6]:
files = os.listdir(data_path)

In [None]:
raw_text = []
for file in files:
    temp1 = gl.convert_pdfminer(data_path+file)
    raw_text.append(temp1)

<div class="alert-info">
First cleaning steps
</div>

* Define the stopwords list (NLTK library). The example data consists of academic articles and the stopwords list is extended with the words 'firm','af','rm','topic','journal','https','doi','org'
* Simple_preprocess() -function of Gensim can be used basic cleaning steps. With default settings, it will remove numbers, words shorter than 2 characters and words longer than 15 characters 
* Use the stopwords list of the NLTK library to remove the stopwords from the texts

In [8]:
stop_words = stopwords.words("english")
stop_words.extend(['firm','af','rm','topic','journal','https','doi','org'])

In [9]:
docs_cleaned = []
for item in raw_text:
    tokens = gensim.utils.simple_preprocess(item)
    docs_cleaned.append(tokens)

In [11]:
docs_nostops = []
for item in docs_cleaned:
    red_tokens = [word for word in item if word not in stop_words]
    docs_nostops.append(red_tokens)

<div class="alert-info">
Use the Spacy deep learning language model to remove specific parts-of-speech
</div>

* Define which PoS will be saved
* Load the deep learning model. For this application, we do not need the parser and named-entity-reconginition modules
* Go through the texts and keep only nouns

In [13]:
#allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']
allowed_postags=['NOUN']

In [14]:
nlp = spacy.load('en_core_web_lg', disable=['parser', 'ner'])

In [15]:
docs_lemmas = []
for red_tokens in docs_nostops:
    doc = nlp(" ".join(red_tokens))
    docs_lemmas.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])

<div class="alert-info">
Create bigrams/trigrams
</div>

In [17]:
bigram = gensim.models.Phrases(docs_lemmas,threshold = 75, min_count=3)
#trigram = gensim.models.Phrases(bigram[docs_lemmas], threshold=1, min_count=2)  
bigram_mod = gensim.models.phrases.Phraser(bigram)
#trigram_mod = gensim.models.phrases.Phraser(trigram)

In [18]:
docs_bigrams = [bigram_mod[doc] for doc in docs_lemmas]

In [20]:
#docs_trigrams = [trigram_mod[doc] for doc in docs_bigrams]

<div class="alert-info">
Create dictionary
</div>

* Dictionary is created from the cleaned texts in docs_bigrams
* Exceptional words are removed from the dictionary
    - Words that are present in just 2 or less texts
    - Words that are present in more than 70 % of the texts
    - 50000 most common words are kept
* The dictionary is used to create a bag-of-words representation of the words, which is saved to the corpus-list

In [None]:
id2word = corpora.Dictionary(docs_bigrams)
id2word.filter_extremes(no_below=2, no_above=0.7, keep_n=50000)

In [22]:
corpus = [id2word.doc2bow(text) for text in docs_bigrams]