## Getting Started

In this workbook we loosely follow the example from "Toward Data Science" on
[Topic Modeling with spaCy and gensim](https://towardsdatascience.com/building-a-topic-modeling-pipeline-with-spacy-and-gensim-c5dc03ffc619). First, we need to install gensim, so open up a command window (and I had to do it in "administrator"
mode) and run this command: `pip install gensim`. We're also going to do some data viz, so run `pip install pyLDAvis`. 


In [None]:
from nltk.corpus import brown
from nltk.corpus import stopwords

import numpy as np
import pandas as pd

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

import pyLDAvis
import pyLDAvis.gensim

import spacy
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en.stop_words import STOP_WORDS

from pprint import pprint
from collections import Counter, defaultdict

nlp = spacy.load('en_core_web_sm')

In [None]:
# Some functions we'll use later in the topic modeling.

def lemmatizer(doc):
    # This takes in a doc of tokens from the NER and lemmatizes them. 
    # Pronouns (like "I" and "you" get lemmatized to '-PRON-', so I'm removing those.
    doc = [token.lemma_ for token in doc if token.lemma_ != '-PRON-']
    doc = u' '.join(doc)
    return nlp.make_doc(doc)
    
def remove_stopwords(doc):
    # This will remove stopwords and punctuation.
    # Use token.text to return strings, which we'll need for Gensim.
    doc = [token.text for token in doc if token.is_stop != True and token.is_punct != True]
    return doc

## Getting to Know the Brown Corpus

Let's spend a bit of time getting to know what's in the Brown corpus, our NLTK example of an "overlapping" corpus.

In [None]:
# categories of articles in Brown corpus
print(brown.categories())

for category in brown.categories() :
    print(f"For {category} we have {len(brown.fileids(categories=category))} articles.")


Let's create a list of the articles in of editorial, government, news, and romance.

In [None]:
for_modeling = []

for category in ['editorial','government','news','romance'] :
    for file_id in brown.fileids(categories=category) :
        text = brown.words(fileids=file_id)
        for_modeling.append(" ".join(text))
        
print(f"We have {len(for_modeling)} documents.")

In [None]:
# Updates spaCy's default stop words list with my additional words. 
stop_list = ['`',"Mr.","Mrs.","Ms."]
nlp.Defaults.stop_words.update(stop_list)

# Iterates over the words in the stop words list and resets the "is_stop" flag.
for word in STOP_WORDS:
    lexeme = nlp.vocab[word]
    lexeme.is_stop = True

In [None]:
# The add_pipe function appends our functions to the default pipeline.
nlp.add_pipe(lemmatizer,name='lemmatizer',after='ner')
nlp.add_pipe(remove_stopwords, name="stopwords", last=True)

In [None]:
doc_list = []

# Iterates through each article in the corpus.
for doc in for_modeling :
    # Passes that article through the pipeline and adds to a new list.
    pr = nlp(doc)
    doc_list.append([t.lower() for t in pr if t.isalpha()])

In [None]:
# Create a mapping of word IDs to words.
words = corpora.Dictionary(doc_list)

# Turns each document into a bag of words.
corpus = [words.doc2bow(doc) for doc in doc_list]

The actual fitting of our model.

In [None]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=words,
                                           num_topics=4, 
                                           random_state=2,
                                           update_every=1,
                                           passes=15,
                                           alpha='auto',
                                           per_word_topics=True)

In [None]:
pprint(lda_model.print_topics(num_words=15))

In [None]:
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(lda_model, corpus, words)

Let's take a look at our topic classifications by document and see how good a job LDA is doing recovering our original topics. We'll take each document one at a time, parse it (as a joined string), and do basically the same processing as we did before. 

You can pass the processed document into the LDA model using square brackets (this is a bit odd) and recieve a tuple back. The first element of the tuple contains the topics and associated probabilities. The max probability will be the assigned topic.

In [None]:
topic_assignments = []

for file_id in brown.fileids(categories="romance") :
    doc = brown.words(fileids=file_id)
    pr = nlp(" ".join(doc))
    doc = [t.lower() for t in pr if t.isalpha()]
    doc_new = words.doc2bow(doc)
    
    topic_probs = lda_model[doc_new][0]
    topic = max(topic_probs,key=lambda x: x[1])
    topic_assignments.append(topic[0])
    

Now let's look at those topic assignments:

In [None]:
Counter(topic_assignments)

Looks like topic zero is overwhelmingly romance. Let's do this for every category we worked with.

In [None]:
topic_assignments = defaultdict(list)

for category in ['editorial','government','news','romance'] :
    for file_id in brown.fileids(categories=category) :

        doc = brown.words(fileids=file_id)
        pr = nlp(" ".join(doc))
        doc = [t.lower() for t in pr if t.isalpha()]
        doc_new = words.doc2bow(doc)

        topic_probs = lda_model[doc_new][0]
        topic = max(topic_probs,key=lambda x: x[1])
        topic_assignments[category].append(topic[0])


In [None]:
for cat, topic_list in topic_assignments.items() :
    print(f"In {cat} we had the following:")
    topic_count = Counter(topic_list).most_common()
    
    for topic, count in topic_count : 
        print(f"    {count} articles were classified as topic {topic}.")
    
    

As we can see, this assignment is pretty imperfect, though the categories overlap pretty heavily, particularly the first three. Romance seems to be safely identified on its own. 