# Topic Modeling
Liz McQuillan
4/30/2018

## Topic Modeling Uses

Given a large corpus of text, Topic Modeling is one way to get a general overview of the different topics within the text, the proportion of these themes, or even find hidden patterns within the corpus. This is different from rules-based approaches in text mining (like keyword searches), in that it's an unsupervised technique for finding linked groups of words ("topics") in a large corpus. 

Topics are generally defined as "a repeating pattern of co-occuring terms". Topic Models are used for clustering documents, feature selection, and information retrieval among other things. 

## Tools and Methods

There are a handful of techniques for getting topics from text, including Term Frequency-Inverse Document Frequency (TF-IDF), NonNegative Matrix Factorization (NMF), Latent Semantic Analysis (LSA), Heirarchical Dirichlet Process (HPD), and Latent Dirichlet Allocation (LDA). LDA is the most popular topic modeling technique, so that's what I'll focus on here. 

For LDA, Gensim is the premier Python Package. scikit-learn has some alternative algorithms, like NMF, but doesn't have LDA (LDA in scikit-learn = Linear Discriminant Analysis).

LDA is a matrix factorization technique, and thus requires a document-term matrix as input. LDA takes a document-term matrix and tries to figure out which topics would create those documents based on the assumption that documents are produced from a bunch of topics which themselves are made up of words based on various probability distributions. 

The interim steps in this process include converting the document-term matrix into two matrices, a document-topics matric and a topic-terms matrix, which contain initial document/topic and topic/word distributions. Here's where LDA actually starts working. LDA aims to improve these matrices through a variety of sampling techniques. Basically, LDA iterates thorugh each word for each document to adjust the topic-word assignment (assuming all current topic/word assignments are correct) until a steady state is achieved.

In [1]:
#import necessary libraries

import pandas as pd 
import sklearn
import spacy
import en_core_web_sm  # or any other model you downloaded via spacy download or pip
nlp = en_core_web_sm.load()

import gensim
from gensim import corpora
from gensim.models import CoherenceModel

#### Import Data
Let's pull in some text data to work with. 

We're going to use a subset of the 20 Newsgroups dataset, via Sci-Kit Learn. 

By using the Pandas package we can enforce a tabular structure on the data. This is especially helpful if you're used to working in SQL, SAS, or Excel.

In [2]:
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
df = pd.DataFrame([newsgroups.data, newsgroups.target.tolist()]).T
df.columns = ['text', 'target']
targets = pd.DataFrame( newsgroups.target_names)
targets.columns=['title']
news_data = pd.merge(df, targets, left_on='target', right_index=True)

#### Cleaning the Data
Then we'll do some basic pre-processing to clean the data

See this page (https://github.com/LizMcQuillan/NLP/blob/master/NLP%20Pre-processing.ipynb) for a more thorough explaination of NLP pre-processing techniques

In [3]:
tokens = []
lemma = []
pos = []

news_data['text'] = news_data['text'].str.replace(r'[^\w\s]+', '')

for doc in nlp.pipe(news_data['text'].astype('unicode').values, batch_size=100,
                        n_threads=3):
    if doc.is_parsed:
        tokens.append([n.text for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.like_url])
        lemma.append([n.lemma_ for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.like_url])
        pos.append([n.pos_ for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.like_url])
    else:
        # We want to make sure that the lists of parsed results have the
        # same number of entries of the original Dataframe, so add some blanks in case the parse fails
        tokens.append(None)
        lemma.append(None)
        pos.append(None)

news_data['tokens'] = tokens
news_data['lemmas'] = lemma 
news_data['pos'] = pos

news_data.head()

Unnamed: 0,text,target,title,tokens,lemmas,pos
0,I was wondering if anyone out there could enli...,7,rec.autos,"[wondering, enlighten, car, saw, day, 2door, s...","[wonder, enlighten, car, see, day, 2door, spor...","[VERB, VERB, NOUN, VERB, NOUN, NUM, NOUN, NOUN..."
17,I recently posted an article asking what kind ...,7,rec.autos,"[recently, posted, article, asking, kind, rate...","[recently, post, article, ask, kind, rate, sin...","[ADV, VERB, NOUN, VERB, NOUN, NOUN, ADJ, NOUN,..."
29,\nIt depends on your priorities A lot of peop...,7,rec.autos,"[depends, priorities, lot, people, higher, pri...","[depend, priority, lot, people, high, priority...","[VERB, NOUN, NOUN, NOUN, ADJ, NOUN, NOUN, NOUN..."
56,an excellent automatic can be found in the sub...,7,rec.autos,"[excellent, automatic, found, subaru, legacy, ...","[excellent, automatic, find, subaru, legacy, s...","[ADJ, NOUN, VERB, PROPN, NOUN, VERB, NOUN, NOU..."
64,Ford and his automobile I need information o...,7,rec.autos,"[Ford, automobile, need, information, Ford, pa...","[Ford, automobile, need, information, Ford, pa...","[PROPN, NOUN, VERB, NOUN, PROPN, ADV, ADJ, NOU..."


#### Building the Corpus
Now, let's take only the lemmas to build the dictionary and doc-term matrix.

In [4]:
# Creating the term dictionary of our corpus, where every unique term is assigned an index. 
dictionary = corpora.Dictionary(lemma)

# Converting corpus into Document-Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in lemma]

## Building the Topic Model
Now that we have the dictionary and doc-term matrix we can start building the LDA model. LDA requires the number of topics as an input. I've also specified chunksize (number of docs to be used in each training "chunk") and passes (the number of training passes).

The LDA model might take a while to run. In a future notebook we'll talk about how to optimize runtime, automate hyperparameter tuning, and implement multiprocessing to speed this up considerably.

In [5]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.LdaModel

# Running and Training LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, 
               num_topics=10, 
               random_state = 100, 
               chunksize = 100, 
               id2word = dictionary, 
               passes=10, 
               alpha='auto')

#### View the Topics
Using print_topics will print keywords for each topic and the relative weights of each word.

##### Interpreting Topics
In the case of our data Topic 0 is represented as ('0.072*"key" + 0.030*"system" + 0.021*"need"')

This means the top words that contribute to the topic are "not", "people", and "know" and the weights represent the importance of each word. LDA requires a good deal of interpretation - it will not label the topics with a word or phrase, so it's up to the analyst to determine how to label each topic.

In [6]:
ldamodel.print_topics(num_topics=10, num_words=3)

[(0, '0.025*"not" + 0.014*"people" + 0.012*"know"'),
 (1, '0.026*"happy" + 0.010*"suck" + 0.006*"selection"'),
 (2, '0.078*"Q" + 0.049*"MR" + 0.041*"STEPHANOPOULOS"'),
 (3, '0.091*"anonymous" + 0.049*"process" + 0.028*"archive"'),
 (4, '0.015*"tv" + 0.014*"reverse" + 0.013*"select"'),
 (5, '0.031*"1" + 0.022*"2" + 0.015*"3"'),
 (6, '0.000*"Rofekamp" + 0.000*"Notre" + 0.000*"films"'),
 (7, '0.020*"key" + 0.011*"use" + 0.010*"encryption"'),
 (8, '0.130*"government" + 0.044*"God" + 0.033*"device"'),
 (9, '0.061*"criminal" + 0.018*"variable" + 0.017*"publication"')]

### Finding the Optimal Number of Topics

There's some disagreement among data scientists about what the best number of topics even means - is it coherence? comprehensiveness? something else? Personally, I err on the side of interpretability and meaningfulness - I don't want the same words repeated over and over throughout the topics. In practice this means building a handful of models with various k values and picking the one with the highest coherence score. The coherence score for our model is ~0.4 here - not ideal, but not terrible. It's a bit of a balancing act getting a "good" coehrence score, while maintaining an aceptable level of readability.

In [7]:
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=ldamodel, texts=lemma, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda) #higher is better


Coherence Score:  0.4262870033224271


## Improving the LDA Model

The results of the LDA model are completely dependent on the data used (garbage in = garbage out) and the parameter choices we make. Since doc-term matrices are typically sparse, dimensionailty reduction may improve the results.

### Frequency Filter
Since terms which appear less often in the corpus are also less likely to appear in the results, the lowest frequency terms can be excluded. Some basic exploratory analysis of term frequencies is required to pinpoint an appropriate frequency threshold.

### Parts of Speech Filter
Earlier in this code some types of strings were filtered out (stop words, numbers, etc). Depending on the data being analyzed it may improve the model's accuracy to strip out further types of words. Whether these are additional filler words (i.e. "within", "may", etc) or some other words which occur in ways that render them meaningless.

### Automated Hyperparameter Optimization

When building a model like LDA for use in production, it's necessary to automate much of the modeling process. This includes hyperparamers like alpha, beta, and the number of topics. In this notebook we used Gensim's 'auto' arg for our alpha hyperparameter, but there are better ways to optimize this that are outside the scope of this notebook. 

### Document Pooling
There's research to support creating macro-documents for LDA training might increase the accuracy and/or usability of topics by enriching the content in each document (http://users.cecs.anu.edu.au/~ssanner/Papers/sigir13.pdf). However, the documents being used here are quite long (average ~700 words) and assumedy have sufficient co-occurance of terms within each document to be used for training without aggregation. If the documents being analyzed were shorter (like Tweets, sms texts, and the like) it may be worthwhile to aggregate at some level.

### Assigning Documents to Topics

In [8]:
def format_topics_sentences(ldamodel=ldamodel, corpus=lemma, texts=dictionary):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get dominant topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the dominant topic, percent Contribution and keywords for each doc
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)


df_topic_sents_keywords = format_topics_sentences(ldamodel=ldamodel, corpus=doc_term_matrix, texts=dictionary)

# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contribution', 'Keywords', 'Text']

# Print the first 5 rows
df_dominant_topic.head()

Unnamed: 0,Document_No,Dominant_Topic,Topic_Perc_Contribution,Keywords,Text
0,0,0.0,0.4349,"not, people, know, think, say, right, go, s, l...",2door
1,1,0.0,0.4451,"not, people, know, think, say, right, go, s, l...",60
2,2,0.0,0.6864,"not, people, know, think, say, right, go, s, l...",70
3,3,7.0,0.5117,"key, use, encryption, system, chip, number, DB...",Bricklin
4,4,0.0,0.4117,"not, people, know, think, say, right, go, s, l...",addition


### Finding the Document That's Representitive of Each Topic

In [9]:
# Group top 5 sentences under each topic
sent_topics_sorteddf_mallet = pd.DataFrame()

sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant_Topic')

for i, grp in sent_topics_outdf_grpd:
    sent_topics_sorteddf_mallet = pd.concat([sent_topics_sorteddf_mallet, 
                                             grp.sort_values(['Perc_Contribution'], ascending=[0]).head(1)], 
                                            axis=0)

# Reset Index    
sent_topics_sorteddf_mallet.reset_index(drop=True, inplace=True)

# Format
sent_topics_sorteddf_mallet.columns = ['Topic_Num', "Topic_Perc_Contribution", "Keywords", "Text"]

# Print top 5 rows
sent_topics_sorteddf_mallet.head()

Unnamed: 0,Topic_Num,Topic_Perc_Contribution,Keywords,Text
0,0.0,0.8589,"not, people, know, think, say, right, go, s, l...",speedomete
1,1.0,0.6828,"happy, suck, selection, plate, Wednesday, pain...",surrounding
2,2.0,0.9524,"Q, MR, STEPHANOPOULOS, release, search, win, f...",B
3,5.0,0.8797,"1, 2, 3, 4, 5, proposal, Security, April, Univ...",M
4,7.0,0.8412,"key, use, encryption, system, chip, number, DB...",740


### Get the Distribution of Topics

In [10]:
# Number of Documents for Each Topic
topic_counts = df_topic_sents_keywords['Dominant_Topic'].value_counts()

# Percentage of Documents for Each Topic
topic_contribution = round(topic_counts/topic_counts.sum(), 4)

# Topic Number and Keywords
topic_num_keywords = df_topic_sents_keywords[['Dominant_Topic', 'Topic_Keywords']]

# Concatenate Column wise
df_dominant_topics = pd.concat([topic_num_keywords, topic_counts, topic_contribution], axis=1)

# Format
df_dominant_topics.columns = ['Dominant_Topic', 'Topic_Keywords', 'Num_Documents', 'Perc_Documents']

# Print top 5 rows
df_dominant_topics.head()

Unnamed: 0,Dominant_Topic,Topic_Keywords,Num_Documents,Perc_Documents
0.0,0.0,"not, people, know, think, say, right, go, s, l...",7187.0,0.6352
1.0,0.0,"not, people, know, think, say, right, go, s, l...",1.0,0.0001
2.0,0.0,"not, people, know, think, say, right, go, s, l...",10.0,0.0009
3.0,7.0,"key, use, encryption, system, chip, number, DB...",,
4.0,0.0,"not, people, know, think, say, right, go, s, l...",,
