## Getting Started

In this workbook we loosely follow the example from "Toward Data Science" on
[Topic Modeling with spaCy and gensim](https://towardsdatascience.com/building-a-topic-modeling-pipeline-with-spacy-and-gensim-c5dc03ffc619). First, we need to install gensim, so open up a command window (and I had to do it in "administrator"
mode) and run this command: `pip install gensim`. We're also going to do some data viz, so run `pip install pyLDAvis`. 


In [None]:
from nltk.corpus import brown

import numpy as np
import pandas as pd

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel,LdaMulticore, Phrases 
from gensim.models.phrases import Phraser 
from gensim.corpora import Dictionary

import pyLDAvis
import pyLDAvis.gensim_models

import spacy
from spacy.lang.en.stop_words import STOP_WORDS

from pprint import pprint
from collections import Counter, defaultdict

nlp = spacy.load('en_core_web_sm')

Lemmatizer = nlp.get_pipe("lemmatizer")


## Getting to Know the Brown Corpus

Let's spend a bit of time getting to know what's in the Brown corpus, our NLTK example of an "overlapping" corpus.

In [None]:
# categories of articles in Brown corpus
print(brown.categories())

for category in brown.categories() :
    print(f"For {category} we have {len(brown.fileids(categories=category))} articles.")


Let's create a list of the articles in of editorial, government, news, and romance.

In [None]:
for_modeling = []

for category in ['editorial','government','news','romance'] :
    for file_id in brown.fileids(categories=category) :
        text = brown.words(fileids=file_id)
        for_modeling.append(" ".join(text))
        
print(f"We have {len(for_modeling)} documents.")

In [None]:
# Updates spaCy's default stop words list with my additional words. 
stop_list = ['`',"Mr.","Mrs.","Ms."]
nlp.Defaults.stop_words.update(stop_list)

# Iterates over the words in the stop words list and resets the "is_stop" flag.
for word in STOP_WORDS:
    lexeme = nlp.vocab[word]
    lexeme.is_stop = True

These next two cells prepare our documents for the LDA algorithm.

In [None]:
doc_list = []
allowed_postags=['NOUN','ADJ','VERB','ADV']

# Iterates through each article in the corpus.
for doc in for_modeling :
    # Passes that article through the pipeline and adds to a new list.
    pr = nlp(doc)
    doc_list.append([token.lemma_ for token in pr if token.pos_ in allowed_postags])

In [None]:

id2word = Dictionary(doc_list)
id2word.filter_extremes(no_below=10, no_above=0.4)
id2word.compactify()
corpus = [id2word.doc2bow(word) for word in doc_list]


And now we fit the actual model.

In [None]:
num_topics = 7

lda_model = LdaMulticore(corpus=corpus, 
                             id2word=id2word, 
                             num_topics=num_topics, 
                             random_state=1,
                             chunksize=30,
                             passes=20,
                             alpha=0.31,
                             eta=0.91,
                             eval_every=1,
                             per_word_topics=True,
                             workers=1)

Let's take a look at the model, both in terms of the words that define the model and via the visualization package `pyLDAvis`. 

In [None]:
pprint(lda_model.print_topics(num_words=10))

In [None]:
pyLDAvis.enable_notebook()

In [None]:
#pyLDAvis.gensim.prepare(lda_model, corpus, words)
pyLDAvis.gensim_models.prepare(lda_model, corpus,id2word)

Let's take a look at our topic classifications by document and see how good a job LDA is doing recovering our original topics. We'll take each document one at a time, parse it (as a joined string), and do basically the same processing as we did before. 

You can pass the processed document into the LDA model using square brackets (this is a bit odd) and recieve a tuple back. The first element of the tuple contains the topics and associated probabilities. The max probability will be the assigned topic.

In [None]:
topic_assignments = []

for file_id in brown.fileids(categories="romance") :
    doc = brown.words(fileids=file_id)
    pr = nlp(" ".join(doc))
    doc = [token.lemma_ for token in pr if token.pos_ in allowed_postags]
    doc_new = id2word.doc2bow(doc)
    
    topic_probs = lda_model[doc_new][0]
    topic = max(topic_probs,key=lambda x: x[1])
    topic_assignments.append(topic[0])
    
    
    
    

Now let's look at those topic assignments:

In [None]:
Counter(topic_assignments)

Looks like topic five is overwhelmingly romance. Let's do this for every category we worked with.

In [None]:
topic_assignments = defaultdict(list)

for category in ['editorial','government','news','romance'] :
    for file_id in brown.fileids(categories=category) :

        doc = brown.words(fileids=file_id)
        pr = nlp(" ".join(doc))
        doc = [token.lemma_ for token in pr if token.pos_ in allowed_postags]
        doc_new = id2word.doc2bow(doc)

        topic_probs = lda_model[doc_new][0]
        topic = max(topic_probs,key=lambda x: x[1])
        topic_assignments[category].append(topic[0])

        
        

In [None]:
for cat, topic_list in topic_assignments.items() :
    print(f"In {cat} we had the following:")
    topic_count = Counter(topic_list).most_common()
    
    for topic, count in topic_count : 
        print(f"    {count} articles were classified as topic {topic}.")
    
    

As we can see, this assignment is pretty imperfect, though the categories overlap pretty heavily, particularly the first three. Romance seems to be safely identified on its own. 