# LDA

## Loading the Data

In [5]:
import json

def load_dataset(path):
	with open(path) as f:
		data = json.load(f)
	
	texts = [ d['text'] for d in data ]
	labels = [ d['label'] for d in data ]
	return texts, labels

In [6]:
path = "../data/articles/train.json"
texts, labels = load_dataset(path)

## Preprocessing

The generative nature of LDA does not account for context around words, but rather operates on a bag-of-words view of each document.
This means, the order of words in a document is not considered, but rather only the frequency of words in a document.
In particular, two words need to be spelled the same way to be considered the same word. As many languages, including English, have different forms of the same word, we need to normalize the words in our documents. This is called lemmatization.

Further, we want to remove as much noise as possible from our documents. This includes removing punctuation, numbers, and common words that do not carry much meaning, such as "the", "and", "a", etc. These words are called stop words.

In [7]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS

lemmatizer = spacy.load('en_core_web_md', disable=['parser', 'ner'])

def lemmatize(text):
	doc = lemmatizer(text) 
	return [ token.lemma_ for token in doc if not token.is_punct and not token.is_space ]

def remove_stopwords(text):
	return [ token for token in text if not token in STOP_WORDS ]


Let's look at an example:

In [8]:
example = "The quick brown fox jumps over the lazy dogs. This then is another sentence, i.e., another heap of words."
example_lemmatized = lemmatize(example)

print("The lemmatized version of the example is:")
print(" ".join(example_lemmatized))

print("\nRemoving the stopwords yields")
print(" ".join(remove_stopwords(example_lemmatized)))

The lemmatized version of the example is:
the quick brown fox jump over the lazy dog this then be another sentence i.e. another heap of word

Removing the stopwords yields
quick brown fox jump lazy dog sentence i.e. heap word


The lemmatization step also lowercased all of our words, which is a common practice.

Let's apply this pipeline to our documents:

In [9]:
from tqdm import tqdm

def preprocess_corpus(texts):
    return [ remove_stopwords(lemmatize(text)) for text in tqdm(texts, ncols=80) ]

In [10]:
def preprocess(path):
	texts, _ = load_dataset(path)
	result = preprocess_corpus(texts)
 
	with open(path.replace(".json", ".preprocessed.json"), "w") as f:
		json.dump(result, f)
  
#preprocess("../data/articles/train.json")
#preprocess("../data/articles/test.json")

#preprocess("../data/headlines/train.json")
preprocess("../data/headlines/test.json")

100%|████████████████████████████████████| 24000/24000 [01:47<00:00, 222.92it/s]
