# 2.1 LDA Preprocessing

The generative nature of LDA does not account for context around words, but rather operates on a bag-of-words view of each document.
This means, the order of words in a document is not considered, but rather only the frequency of words in a document.
In particular, two words need to be spelled the same way to be considered the same word. As many languages, including English, have different forms of the same word, we need to normalize the words in our documents. This is called lemmatization.

Further, we want to remove as much noise as possible from our documents. This includes removing punctuation, numbers, and common words that do not carry much meaning, such as "the", "and", "a", etc. These words are called stop words.

### Loading the Data

In [1]:
import json


def load_dataset(path):
    with open(path) as f:
        data = json.load(f)

    texts = [d['text'] for d in data]
    labels = [d['label'] for d in data]
    return texts, labels


path = "../data/articles/train.json"
texts, labels = load_dataset(path)


### Building the Pipeline

In our simple preprocessing-pipeline we use spaCy for both lemmatization and stopword-removal. Other popular options include NLTK and gensim.

It is sensible to perform lemmatization first, as this will reduce the number of word-forms that need to be checked for stop words.

In [9]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS

# We only need the lemmatizer, so we disable the parser and ner
# en_core_web_md is a medium-sized model trained on written web text
# https://spacy.io/models/en#en_core_web_md
# Larger models, such as en_core_web_lg, are more accurate but take longer to load
lemmatizer = spacy.load('en_core_web_md', disable=['parser', 'ner'])

# We extend the list of stopwords with our own words.
# This is usually an iterative process, where you try out the model and add/remove words
stopwords = STOP_WORDS.union(set(["Mr.", "Mrs.", "Ms.", "Dr.", "$", "s"]))


def lemmatize(text):
    doc = lemmatizer(text)
    return [token.lemma_ for token in doc if not token.is_punct and not token.is_space]


def remove_stopwords(text):
    return [token for token in text if not token in stopwords]


def preprocess(text):
    return remove_stopwords(lemmatize(text))


### Example

In [11]:
example = """Mr. Brown Fox jumps. 
This then is another sentence, i.e., another heap of words. 
Thank you, Weizenbaum Institute."""

example_lemmatized = lemmatize(example)

print("The lemmatized version of the example is:", end="\n>> ")
print(" ".join(example_lemmatized))

print("\nRemoving the stopwords yields", end="\n>> ")
print(" ".join(remove_stopwords(example_lemmatized)))


The lemmatized version of the example is:
>> Mr. Brown Fox jump this then be another sentence i.e. another heap of word thank you Weizenbaum Institute

Removing the stopwords yields
>> Brown Fox jump sentence i.e. heap word thank Weizenbaum Institute


Our pipeline lemmatized all words, removed punctuation, and stop words.
The lemmatization step also lowercased all of our words except for proper nouns.

### Preprocess the Data

We apply the preprocessing pipeline to all documents in our dataset. As this may take a while, we will skip a document if it has already been preprocessed. If you want to re-run the preprocessing, you can delete the corresponding `.preprocessed.json` file.

In [12]:
from os.path import exists
from tqdm import tqdm


def preprocess_file(path):
    output_filename = path.replace(".json", ".preprocessed.json")
    if exists(output_filename):
        print("File {} already exists, not overwriting.".format(output_filename))
        return

    texts, _ = load_dataset(path)
    result = [preprocess(text) for text in tqdm(texts, ncols=80)]

    with open(path.replace(".json", ".preprocessed.json"), "w") as f:
        json.dump(result, f)


preprocess_file("../data/articles/train.json")
preprocess_file("../data/articles/test.json")

preprocess_file("../data/headlines/train.json")
preprocess_file("../data/headlines/test.json")


100%|███████████████████████████████████████| 2977/2977 [02:45<00:00, 18.00it/s]
100%|█████████████████████████████████████████| 745/745 [00:38<00:00, 19.19it/s]


File ../data/headlines/train.preprocessed.json already exists, not overwriting.
File ../data/headlines/test.preprocessed.json already exists, not overwriting.


**Preprocessing is done!**

*Please contiune with [the next notebook](LDA-03-Training.ipynb).*