# Usage

This document is intended as a fast way to get an idea of what LDA can produce. Actual research should be done using a full experimental process including the use of the "LDA Job manager" notebook.

To make and inspect a quick topic model:
1. Make sure that you are using a fully functional notebook viewer, such as VS Code (best) or Jupyter Notebooks. Use options like the ability to collapse sections or input cells. Other options, like Jupyter Lab or custom web views, can be configured to work, but that's on you.
1. Prepare a dataset with at least columns for a unique document ID number and text you want to process, with a single textual response per row. LDA does not require preprocessed text to function, but it is easier to interpret results if you use the preprocessing notebook first.
1. Edit the data import section ([click here](#data)) with the path, columns names etc for your dataset.
1. Run the notebook
1. Look at the results in the model inspection section ([click here](#model-inspection))
1. If you want to try looking at particular subsets of your data look at the examples section ([click here](#examples-of-how-to-look-at-subsets-of-your-set-of-documents))
1. Keep in mind that LDA works best on a large textual dataset (many comments), where each comment is long. We didn't find the need to remove short comments, but you need long comments.


# Imports

## Libraries

In [None]:
import pandas as pd
import numpy as np
from scipy.stats.mstats import gmean

In [None]:
from ipywidgets import interact, Combobox
from IPython.display import display, display_html

In [None]:
from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel
from gensim.models import CoherenceModel

## Data

### Data Details

In [None]:
index_col = "unique_comment_ID" # Unique number/code for each document
text_col = "Preprocessed answer" # Text to be fed to LDA
nice_text_col = "answer" # Unprocessed text for viewing. Can be same as text_col

### Import Data

In [None]:
data_path = "/home/azureuser/cloudfiles/code/Data/pp-20210830_SES_and_SET.csv"

In [None]:
raw_df = pd.read_csv(data_path) # Import data
raw_df.set_index(index_col, inplace=True) # Set the document index as index
raw_df.dropna(subset=[text_col],inplace=True) # Remove all rows with empty or missing text
raw_df[text_col] = raw_df[text_col].astype('string') # make sure the text columns is all strings

### If your dataset is large, you may want to reduce the size of raw_df by selecting rows to reduce computation time intially. For instance, we normally choose to look at comments only from our newer SES survey, even though that makes up 250k of 1.5 million textual responses.

In [None]:
display(f"Number of comments: {len(raw_df)}")
raw_df.head(3)

## Gensim Components from Data

This section is helpful if you want to understand the various steps to feeding textual data into a computational framework like Gensim

### Tokenize Documents

In [None]:
texts = raw_df[[text_col]].applymap(str.split)
texts.head(2)

### Generate Dictionary

In [None]:
dictionary = Dictionary(texts[text_col])
display(f"Number of Words: {len(dictionary)}")

In [None]:
words = [*dictionary.token2id]

### Create Corpus

In [None]:
corpus = texts.applymap(dictionary.doc2bow)
corpus.head(2)

# Other Defintions

In [None]:
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html() + ("\xa0" * 5) # Spaces
    display_html(html_str.replace('<table','<table style="display:inline"'),raw=True)

# Topic Model Setup

You should not have to edit anything in this section.

## Helper Functions

In [None]:
def convert_row_to_term_score(row):
    '''Converts a word-topic matrix to a term score matrix. 
    Input should be a series of probabilities (intent is that the term is the index)'''
    normalizer = gmean(row) # Compute geometric mean of the word probabilities
    term_score_row = row.apply(lambda b: b*(np.log(b/normalizer))) #applying the transformation
    return term_score_row

## LDA Class Definition

In [None]:
class QuickLDA(object):
    def __init__(self,doc_ids, num_topics = 7):
        '''Takes a list of doc ids and creates all the LDA components'''
        self.doc_ids = list(corpus.loc[doc_ids].index) # Making sure this is ordered correctly. Probably not necessary
        self.num_topics = num_topics
        self.sub_corpus = corpus.loc[doc_ids][text_col] # This is not a dataframe, just an iterable
        self.num_docs = len(self.sub_corpus)
        self.fit_lda()
        self.score_lda()
        self.make_term_matrices()
        self.make_doc_topic_matrix()

    def fit_lda(self):
        lda = LdaModel(
            id2word = dictionary,
            passes = int(np.ceil(50000/self.num_docs)), # Extra fitting for small corpi
            num_topics = self.num_topics,
            alpha = "auto"
        )
        lda.update(self.sub_corpus)
        self.lda = lda

    def score_lda(self):
        self.perplexity = 2**(-self.lda.log_perplexity(self.sub_corpus))
        c_model = CoherenceModel(
            model = self.lda,
            texts = texts.loc[self.doc_ids][text_col], #Again can't have dataframe
            dictionary = dictionary,
            coherence = "c_v"
        )
        self.cv_score = c_model.get_coherence()
        
    def make_term_matrices(self):
        self.term_topic_matrix = pd.DataFrame(self.lda.get_topics()).transpose()
        self.term_topic_matrix.rename(
            index = dictionary.id2token,
            inplace=True
        )
        self.term_score_matrix = self.term_topic_matrix.apply(convert_row_to_term_score,axis=1)
        
    def make_doc_topic_matrix(self):
        document_topic_matrix = pd.DataFrame(
            [{doc_tuple[0]:doc_tuple[1] for doc_tuple in doc_tuple_list} for doc_tuple_list in self.lda[self.sub_corpus]])
        # Fill Missing Values
        document_topic_matrix.fillna(0,inplace = True)
        # Sort columns by topic number
        document_topic_matrix = document_topic_matrix.reindex(sorted(document_topic_matrix.columns), axis=1)
        document_topic_matrix.index = self.sub_corpus.index
        self.document_topic_matrix = document_topic_matrix
        self.topic_means = document_topic_matrix.mean().apply(lambda x: round(x, 3))

## LDA Visuals Definitions

In [None]:
def plot_term(lda, word = "class"):
    try:
        display_html(f"<h4> Probability(term|topic) for \"{word}\"",raw=True)
        display_html(lda.term_topic_matrix.loc[[word]].transpose().plot.bar(ylabel = "Conditional term probability",xlabel = "Topic"))
    except KeyError as e: print("Waiting for valid input")

In [None]:
def get_top_responses(topic_name,number_responses,lda, doc_metadata = None, max_words = 1000):
    doc_ids = lda.document_topic_matrix.sort_values(by=topic_name,ascending = False)
    doc_ids = doc_ids.index.tolist()
    doc_ids = list(filter(
        lambda doc_id: len(texts.loc[doc_id][text_col]) < max_words, 
        doc_ids))
    doc_ids = doc_ids[:number_responses]
    # Print results
    for doc_id in doc_ids:
        if doc_metadata is not None: # Check if we want to display metadata with each comment
            display(doc_metadata.loc[[doc_id]].style.hide_index())
        display_html(" • " + raw_df.loc[doc_id][nice_text_col] + "<br><br><br>", raw = True)

# Examples of how to look at subsets of your set of documents

Below is a set of examples showing how to look at particular subsets and a fitting LDA for those subsets. If you have a dataframe you like, an easy way to get the list of document IDs is to use .index.tolist(). I give separate examples here, but you can combine, or bring in your own list of document IDs based on something else like sentiment analysis.

## Getting all doc_ids for a particular question

In this example I wanted to get all of the answers to "what specific change in clarity would help learning". I use the .isin method to ask if a particular column has a value in a list that I give. So in this case you could write a bunch of question IDs out.

In [None]:
# clarity_ids = raw_df[raw_df["question_ID"].isin(
#     ["X840307","Your Document Code Here"]
#     )].index.tolist()


In [None]:
# display_html("<h4>Sample Selected Texts:", raw=True)
# for row in raw_df.loc[clarity_ids][nice_text_col].head(3):
#     display(row)

## Getting all Document IDs for a certain list of words

This example looks at all responses containing particular words and does the full LDA exploration for that set of documents.

In [None]:
# @interact(word = Combobox(options = words,continuous_update = False))
# def show_words(word):
#     display_html("Type in here if you want to see what the kernel thinks are words", raw=True)

#### Each document will need to contain at least one word from this list


In [None]:
# req_words = ["canvas"]

The following code gets all responses for which the preprocessed answer contains a word from the req_words list. It generates a list of True/False for each word pairing that might agree between the two lists, then "any" collapses that into a single True if there was any agreement. The result of apply, which is a dataframe with True/False as it's main column, it used to select a subset of the larger data as usual, then the index is extracted as a list.


In [None]:
# word_doc_ids = texts[texts[text_col].apply(
#     lambda tokenized_text: any(word in tokenized_text for word in req_words)
# )].index.tolist()
# display_html(f"<b>Number of doc ids: {len(word_doc_ids)}",raw=True)
# display_html("<h4>Sample Selected Texts:",raw= True)
# for row in raw_df.loc[word_doc_ids][nice_text_col].head(2):
#     display(row)

In [None]:
# word_lda = QuickLDA(doc_ids=word_doc_ids,num_topics=8)

# Model Inspection

After an initial run of the notebook, you only need to rerun these cells and below to change your model and output.

In [None]:
doc_ids = raw_df[raw_df["survey"] == "SES"].index.tolist()

In [None]:
basic_lda = QuickLDA(doc_ids = doc_ids,num_topics= 7) # Fit a topic model on all of the supplied textual data

In [None]:
lda = basic_lda # Set the topic model to be inspected.

Check the topic means to make sure that it actually worked. If the topic means seem too focused on one topic, then you need to change the number of topics or select more documents.

In [None]:
display_html(f"<b> Coherence Score (c_v): </b> {lda.cv_score}",raw = True)
display_html(f"<b> Perplexity: </b> {lda.perplexity}",raw = True)
display(lda.topic_means)

### Explore the distribution of a particular term

In [None]:
@interact(word = Combobox(options = list(lda.term_score_matrix.index)), continuous_update = False)
def f(word):
    plot_term(lda,word)

### Raw display of top words for all topics

In [None]:
@interact(show = False,num_top_words = (5,30,100))
def relevant_words(show,num_top_words = 14):
    # Display top words per topic
    if show:
        for c in lda.term_score_matrix.columns:
            print(f'\n Topic {c} -- {lda.topic_means[c]} \n',
                lda.term_score_matrix[c]
                .sort_values(ascending=False) #Sort most relevant words by their term score in column 'c'
                .head(num_top_words) #Take top ten most relevant words
                .index #The index is the word itself
                .tolist() #Feel free to replace with some nicer display function
                )

### Top Words per Topic

In [None]:
@interact(topic = lda.document_topic_matrix.columns, num = (5,100), cols = (1,10),include_term_score = True)
def top_words(topic,num = 30, cols = 4, include_term_score = True):
    sorted_term_score = lda.term_score_matrix.sort_values(by = topic, ascending = False)[[topic]] # Prepare terms sorted by score
    sorted_term_score.columns = ["Term Score"]
    display_html(f"<h4><u> Most Relevant words for Topic {topic} ({lda.topic_means[topic]}):", raw = True) # Heading
    if include_term_score:
        per_col = int(np.ceil(num/cols)) # Figure out how many words to put per column
        display_side_by_side(*[sorted_term_score.iloc[x: x + per_col] for x in range(0,num,per_col)]) # Display the columns. *[] used to partition the dataframe
    else:
        print(sorted_term_score.head(num).index.tolist()) # Print them out plainly if we want that for some reason.

### Top Comments by Topic

In [None]:
@interact(
    topic = lda.document_topic_matrix.columns, # Choose a topic from the doc-topic matrix
    number_responses = [1,5,10,20,100,1000], # Choose a number of responses
    max_words = [5,10,20,50,1000], # Max number of words in the responses
    include_topic_distributions = False # Choose whether you want to show the entry from the doc-topic matrix for each response
)
def top_resp(topic, number_responses = 5, include_topic_distributions = False, max_words = 1000):
    if include_topic_distributions:
        metadata = lda.document_topic_matrix # Set the metadata to display and populate it
    else: metadata = None
    display_html(f"<h2><u> Top Responses for Topic {topic} ({lda.topic_means[topic]}):", raw = True)
    return get_top_responses(topic_name = topic, number_responses = number_responses, doc_metadata = metadata, lda = lda, max_words = max_words)