# Topic Modeling
Liz McQuillan
4/30/2018

## Topic Modeling Uses

Given a large corpus of text, Topic Modeling is one way to get a general overview of the different topics within the text, the proportion of these themes, or even find hidden patterns within the corpus. This is different from rules-based approaches in text mining (like keyword searches), in that it's an unsupervised technique for finding linked groups of words ("topics") in a large corpus. 

Topics are generally defined as "a repeating pattern of co-occuring terms". Topic Models are used for clustering documents, feature selection, and information retrieval among other things. 

## Tools and Methods

There are a handful of techniques for getting topics from text, including Term Frequency-Inverse Document Frequency (TF-IDF), NonNegative Matrix Factorization (NMF), Latent Semantic Analysis (LSA), Heirarchical Dirichlet Process (HPD), and Latent Dirichlet Allocation (LDA). LDA is the most popular topic modeling technique, so that's what I'll focus on here. 

For LDA, Gensim is the premier Python Package. scikit-learn has some alternative algorithms, like NMF, but doesn't have LDA (LDA in scikit-learn = Linear Discriminant Analysis).

LDA is a matrix factorization technique, and thus requires a document-term matrix as input. LDA takes a document-term matrix and tries to figure out which topics would create those documents based on the assumption that documents are produced from a bunch of topics which themselves are made up of words based on various probability distributions. 

The interim steps in this process include converting the document-term matrix into two matrices, a document-topics matric and a topic-terms matrix, which contain initial document/topic and topic/word distributions. Here's where LDA actually starts working. LDA aims to improve these matrices through a variety of sampling techniques. Basically, LDA iterates thorugh each word for each document to adjust the topic-word assignment (assuming all current topic/word assignments are correct) until a steady state is achieved.

In [1]:
#import necessary libraries

import pandas as pd 
import numpy as np
from numpy import random
from sqlalchemy import create_engine, MetaData, Table, select
import sklearn
import nltk
from nltk import word_tokenize
import spacy
import en_core_web_sm  # or any other model you downloaded via spacy download or pip
nlp = en_core_web_sm.load()
import gensim
from gensim import corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import pprint
import matplotlib.pyplot as plt
%matplotlib inline

#### Import Data
Let's pull in some text data to work with. 

We're going to use a subset of the 20 Newsgroups dataset, via Sci-Kit Learn. 

By using the Pandas package we can enforce a tabular structure on the data. This is especially helpful if you're used to working in SQL, SAS, or Excel.

In [3]:
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
df = pd.DataFrame([newsgroups.data, newsgroups.target.tolist()]).T
df.columns = ['text', 'target']
targets = pd.DataFrame( newsgroups.target_names)
targets.columns=['title']
news_data = pd.merge(df, targets, left_on='target', right_index=True)

In [13]:
# Create the connection to all dbs
cnxn = pyodbc.connect('DRIVER={ODBC Driver 11 for SQL Server};SERVER=ES11vADOSQL006;DATABASE=master;Trusted_Connection=yes;')

In [14]:
# Pull data from SQL Server
#Create an additional column with all text concatenated
sql3 = """
SELECT TIPImprovementPlan1, TIPActionPlan, TIPTimelinePlan, TIPSupportPlan, TIPAssessmentPlan,(TIPImprovementPlan1 + ' ' + TIPActionPlan + ' ' + TIPTimeLinePlan + ' ' + TIPSupportPlan + ' ' + TIPAssessmentPlan) as TIP_all_txt
FROM [APPR_EXT].[dbo].[APPRTIP]
where IsSubmitted = 'Y' and TIPEndedAppeal = 'N' and FiscalYear = 2017
"""
APPRTIP = pd.io.sql.read_sql(sql3, cnxn) #assign the SQL query to a pandas dataframe called APPRTIP
APPRTIP.head()

Unnamed: 0,TIPImprovementPlan1,TIPActionPlan,TIPTimelinePlan,TIPSupportPlan,TIPAssessmentPlan,TIP_all_txt
0,1E: Designing Coherent Instruction: Design les...,1E: Designing Coherent Instruction:\r\n\t• Des...,See above,1) You will schedule inter-visitations to obse...,"In our second and third meetings, we will revi...",1E: Designing Coherent Instruction: Design les...
1,Based on prior observations from the 2015-2016...,For 1e:\r\n\r\nA) Establish regular time(s) to...,See action steps/activities for specifics,1) Choose PD Cycle to support the steps in you...,You are responsible for gathering and providin...,Based on prior observations from the 2015-2016...
2,Based on prior observations and feedback from ...,For 1e:\r\n\r\n1. Establish regular time(s) to...,See action steps/activities for specific time ...,1) Choose to participate in a PD cycle to sup...,You are responsible for gathering and providin...,Based on prior observations and feedback from ...
3,"After reviewing last year's TIP, MOSL assessme...",1:Addressing the learning needs of small group...,Refer to the timelines included at the end of ...,1. Mr. Louie will participate in 1:1 coaching...,1. In our next 2 meetings we will review the ...,"After reviewing last year's TIP, MOSL assessme..."
4,1. Having learning activities aligned with the...,1. For improved alignment of learning activiti...,See above-ongoing,-Collaborate with your co-teachers to follow T...,1. Learning activities are aligned with the in...,1. Having learning activities aligned with the...


#### Cleaning the Data
Then we'll do some basic pre-processing to clean the data

See this page (https://github.com/LizMGagne/NLP/blob/master/NLP%20Pre-processing.ipynb) for a more thorough explaination of NLP pre-processing techniques

In [15]:
tokens = []
lemma = []
pos = []

for doc in nlp.pipe(APPRTIP['TIP_all_txt'].astype('unicode').values, batch_size=100,
                        n_threads=3):
    if doc.is_parsed:
        tokens.append([n.text for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.like_url and n.is_alpha])
        lemma.append([n.lemma_ for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.like_url and n.is_alpha])
        pos.append([n.pos_ for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.like_url and n.is_alpha])
    else:
        tokens.append(None)
        lemma.append(None)
        pos.append(None)

APPRTIP['tokens_all_txt'] = tokens
APPRTIP['lemmas_all_txt'] = lemma
APPRTIP['pos_all_txt'] = pos

APPRTIP.head()

Unnamed: 0,TIPImprovementPlan1,TIPActionPlan,TIPTimelinePlan,TIPSupportPlan,TIPAssessmentPlan,TIP_all_txt,tokens_all_txt,lemmas_all_txt,pos_all_txt
0,1E: Designing Coherent Instruction: Design les...,1E: Designing Coherent Instruction:\r\n\t• Des...,See above,1) You will schedule inter-visitations to obse...,"In our second and third meetings, we will revi...",1E: Designing Coherent Instruction: Design les...,"[Designing, Coherent, Instruction, Design, les...","[designing, coherent, instruction, design, les...","[PROPN, PROPN, PROPN, PROPN, VERB, ADV, VERB, ..."
1,Based on prior observations from the 2015-2016...,For 1e:\r\n\r\nA) Establish regular time(s) to...,See action steps/activities for specifics,1) Choose PD Cycle to support the steps in you...,You are responsible for gathering and providin...,Based on prior observations from the 2015-2016...,"[Based, prior, observations, school, year, add...","[base, prior, observation, school, year, addit...","[VERB, ADJ, NOUN, NOUN, NOUN, NOUN, VERB, VERB..."
2,Based on prior observations and feedback from ...,For 1e:\r\n\r\n1. Establish regular time(s) to...,See action steps/activities for specific time ...,1) Choose to participate in a PD cycle to sup...,You are responsible for gathering and providin...,Based on prior observations and feedback from ...,"[Based, prior, observations, feedback, previou...","[base, prior, observation, feedback, previous,...","[VERB, ADJ, NOUN, VERB, ADJ, NOUN, VERB, VERB,..."
3,"After reviewing last year's TIP, MOSL assessme...",1:Addressing the learning needs of small group...,Refer to the timelines included at the end of ...,1. Mr. Louie will participate in 1:1 coaching...,1. In our next 2 meetings we will review the ...,"After reviewing last year's TIP, MOSL assessme...","[After, reviewing, year, TIP, MOSL, assessment...","[after, review, year, tip, mosl, assessment, o...","[ADP, VERB, NOUN, PROPN, PROPN, NOUN, ADJ, NOU..."
4,1. Having learning activities aligned with the...,1. For improved alignment of learning activiti...,See above-ongoing,-Collaborate with your co-teachers to follow T...,1. Learning activities are aligned with the in...,1. Having learning activities aligned with the...,"[Having, learning, activities, aligned, instru...","[have, learn, activity, align, instructional, ...","[VERB, VERB, NOUN, VERB, ADJ, NOUN, VERB, ADJ,..."


#### Building the Corpus
Now, let's take only the lemmas to build the dictionary and doc-term matrix.

In [16]:
# Creating the term dictionary of our corpus, where every unique term is assigned an index. 
dictionary = corpora.Dictionary(lemma)

# Converting corpus into Document-Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in lemma]

## Building the Topic Model
Now that we have the dictionary and doc-term matrix we can start building the LDA model. LDA requires the number of topics as an input. I've also specified chunksize (number of docs to be used in each training "chunk"), update_every (how often the model should be updated), and passes (the number of training passes).

In [17]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.LdaModel

# Running and Training LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=25, random_state = 100, update_every=1, chunksize = 100, id2word = dictionary, passes=100, alpha='auto')

#### View the Topics
Using print_topics will print keywords for each topic and the relative weights of each word.

##### Interpreting Topics
In the case of our data Topic 0 is represented as (0, '0.044*"student" + 0.044*"lesson" + 0.044*"activity"')

This means the top words that contribute to the topic are "student", "lesson", and "activity" and the weights represent the importance of each word. LDA requires a good deal of interpretation - it will not label the topics with a word or phrase, so it's up to the analyst to determine how to label each topic.

In [18]:
#print(ldamodel.print_topics(num_topics=25, num_words=3))
ldamodel.print_topics(num_topics=25, num_words=3)

[(0, '0.151*"science" + 0.042*"tuesday" + 0.021*"expand"'),
 (1, '0.069*"student" + 0.051*"discussion" + 0.032*"use"'),
 (2, '0.149*"scholar" + 0.017*"appeal" + 0.013*"furthermore"'),
 (3, '0.113*"content" + 0.050*"knowledge" + 0.043*"concept"'),
 (4, '0.047*"activity" + 0.041*"student" + 0.038*"lesson"'),
 (5, '0.311*"teacher" + 0.106*"the" + 0.018*"sandi"'),
 (6, '0.027*"year" + 0.027*"meeting" + 0.024*"observation"'),
 (7, '0.110*"student" + 0.084*"behavior" + 0.057*"classroom"'),
 (8, '0.021*"pedagogical" + 0.020*"enhance" + 0.020*"approach"'),
 (9, '0.069*"staff" + 0.065*"reading" + 0.058*"group"'),
 (10, '0.098*"o" + 0.036*"submit" + 0.022*"goldstein"'),
 (11, '0.155*"domain" + 0.082*"by" + 0.071*"component"'),
 (12, '0.054*"rigorous" + 0.047*"rigor" + 0.042*"task"'),
 (13, '0.079*"student" + 0.030*"step" + 0.028*"action"'),
 (14, '0.036*"project" + 0.033*"art" + 0.020*"evaluator"'),
 (15, '0.203*"principal" + 0.127*"assistant" + 0.026*"ms"'),
 (16, '0.161*"b" + 0.129*"c" + 0.075

### Finding the Optimal Number of Topics

There's some disagreement among data scientists about what the best number of topics even means - is it coherence? comprehensiveness? something else? Personally, I err on the side of interpretability and meaningfulness - I don't want the same words repeated over and over throughout the topics. In practice this means building a handful of models with various k values and picking the one with the highest coherence score. The coherence score for our model is ~0.4 here - not ideal, but not terrible. It's a bit of a balancing act getting a "good" coehrence score, while maintaining an aceptable level of readability.

In [19]:
# Compute Perplexity
print('\nPerplexity: ', ldamodel.log_perplexity(doc_term_matrix))  #lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=ldamodel, texts=lemma, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda) #higher is better


Perplexity:  -6.17109307576

Coherence Score:  0.41855547143


## Improving the LDA Model

The results of the LDA model are completely dependent on the data used (garbage in = garbage out). Since doc-term matrices are typically sparse, dimensionailty reduction may improve the results.

### Frequency Filter
Since terms which appear less often in the corpus are also less likely to appear in the results, the lowest frequency terms can be excluded. Some basic exploratory analysis of term frequencies is required to pinpoint an appropriate frequency threshold.

### Parts of Speech Filter
Earlier in this code some types of strings were filtered out (stop words, numbers, etc). Depending on the data being analyzed it may improve the model's accuracy to strip out further types of words. Whether these are additional filler words (i.e. "within", "may", etc) or some other words which occur in ways that render them meaningless.

### Document Pooling
There's research to support creating macro-documents for LDA training might increase the accuracy and/or usability of topics by enriching the content in each document (http://users.cecs.anu.edu.au/~ssanner/Papers/sigir13.pdf). However, the documents being used here are quite long (average ~700 words) and assumedy have sufficient co-occurance of terms within each document to be used for training without aggregation. If the documents being analyzed were shorter (like Tweets, texts, and the like) it may be worthwhile to aggregate at some level.

### Assigning Documents to Topics

In [20]:
def format_topics_sentences(ldamodel=ldamodel, corpus=lemma, texts=dictionary):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get dominant topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the dominant topic, percent Contribution and keywords for each doc
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)


df_topic_sents_keywords = format_topics_sentences(ldamodel=ldamodel, corpus=doc_term_matrix, texts=dictionary)

# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contribution', 'Keywords', 'Text']

# Print the first 5 rows
df_dominant_topic.head()

Unnamed: 0,Document_No,Dominant_Topic,Topic_Perc_Contribution,Keywords,Text
0,0,13.0,0.2227,"student, step, action, learning, lesson, area,...",0
1,1,20.0,0.2331,"student, lesson, plan, assessment, work, datum...",1
2,2,13.0,0.279,"student, step, action, learning, lesson, area,...",2
3,3,20.0,0.2897,"student, lesson, plan, assessment, work, datum...",3
4,4,4.0,0.855,"activity, student, lesson, area, action, step,...",4


### Finding the Document That's Representitive of Each Topic

In [21]:
# Group top 5 sentences under each topic
sent_topics_sorteddf_mallet = pd.DataFrame()

sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant_Topic')

for i, grp in sent_topics_outdf_grpd:
    sent_topics_sorteddf_mallet = pd.concat([sent_topics_sorteddf_mallet, 
                                             grp.sort_values(['Perc_Contribution'], ascending=[0]).head(1)], 
                                            axis=0)

# Reset Index    
sent_topics_sorteddf_mallet.reset_index(drop=True, inplace=True)

# Format
sent_topics_sorteddf_mallet.columns = ['Topic_Num', "Topic_Perc_Contribution", "Keywords", "Text"]

# Print top 5 rows
sent_topics_sorteddf_mallet.head()

Unnamed: 0,Topic_Num,Topic_Perc_Contribution,Keywords,Text
0,1.0,0.5216,"student, discussion, use, questioning, questio...",143
1,3.0,0.3238,"content, knowledge, concept, plan, lesson, ped...",292
2,4.0,0.9551,"activity, student, lesson, area, action, step,...",1752
3,5.0,0.5604,"teacher, the, sandi, development, this, profes...",1690
4,6.0,0.6951,"year, meeting, observation, school, profession...",129


### Get the Distribution of Topics

In [22]:
# Number of Documents for Each Topic
topic_counts = df_topic_sents_keywords['Dominant_Topic'].value_counts()

# Percentage of Documents for Each Topic
topic_contribution = round(topic_counts/topic_counts.sum(), 4)

# Topic Number and Keywords
topic_num_keywords = df_topic_sents_keywords[['Dominant_Topic', 'Topic_Keywords']]

# Concatenate Column wise
df_dominant_topics = pd.concat([topic_num_keywords, topic_counts, topic_contribution], axis=1)

# Format
df_dominant_topics.columns = ['Dominant_Topic', 'Topic_Keywords', 'Num_Documents', 'Perc_Documents']

# Print top 5 rows
df_dominant_topics.head()

Unnamed: 0,Dominant_Topic,Topic_Keywords,Num_Documents,Perc_Documents
0,13.0,"student, step, action, learning, lesson, area,...",,
1,20.0,"student, lesson, plan, assessment, work, datum...",181.0,0.052
2,13.0,"student, step, action, learning, lesson, area,...",,
3,20.0,"student, lesson, plan, assessment, work, datum...",6.0,0.0017
4,4.0,"activity, student, lesson, area, action, step,...",675.0,0.1939
