# Capstone: Employee Review Monitoring

---

#### 04: <b>Topic Modeling - LDA</b>

## Library and data import

In [1]:
# Load libraries
import numpy as np
import pandas as pd
import regex as re
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.decomposition import LatentDirichletAllocation


In [2]:
# Load data
df_lda = pd.read_csv("../data/dataset.csv")

In [3]:
print(df_lda.shape)
display(df_lda.head())

(30336, 20)


Unnamed: 0,ID,Place,location,date,status,job_title,summary,positives,negatives,advice_to_mgmt,work_life_balance,culture_and_values,career_opportunities,compensation_and_benefits,senior_management,helpful_count,overall,positivesLength,negativesLength,year
0,1,startup_1,,"Dec 11, 2018",Current Employee,Anonymous Employee,Best Company to work for,People are smart and friendly,Bureaucracy is slowing things down,,4.0,5.0,5.0,4.0,5.0,0,5.0,5,5,2018
1,2,startup_1,"Mountain View, CA","Jun 21, 2013",Former Employee,Program Manager,"Moving at the speed of light, burn out is inev...","1) Food, food, food. 15+ cafes on main campus ...",1) Work/life balance. What balance? All those ...,1) Don't dismiss emotional intelligence and ad...,2.0,3.0,3.0,5.0,3.0,2094,5.0,160,401,2013
2,3,startup_1,"New York, NY","May 10, 2014",Current Employee,Software Engineer III,Great balance between big-company security and...,"* If you're a software engineer, you're among ...","* It *is* becoming larger, and with it comes g...",Keep the focus on the user. Everything else wi...,5.0,4.0,5.0,5.0,4.0,949,5.0,630,176,2014
3,4,startup_1,"Mountain View, CA","Feb 8, 2015",Current Employee,Anonymous Employee,The best place I've worked and also the most d...,You can't find a more well-regarded company th...,I live in SF so the commute can take between 1...,Keep on NOT micromanaging - that is a huge ben...,2.0,5.0,5.0,4.0,5.0,498,4.0,295,503,2015
4,10,startup_1,,"Dec 9, 2018",Current Employee,Anonymous Employee,Execellent for engineers,Impact driven. Best tech in the world.,Size matters. Engineers are a bit disconnected...,,5.0,5.0,5.0,5.0,5.0,0,4.0,7,9,2018


<u> LDA Topic Modeling </u>

Now, let’s apply the LDA model to find each document topic distribution and the high probability of word in each topic. Here, we want to specifically look at the negative reviews to find out what aspects should the organization be focusing on improving.


In [6]:
#Create a function to build the optimal LDA model
def optimal_lda_model(df_review, review_colname):
    '''
    INPUTS:
        df_review - dataframe that contains the reviews
        review_colname: name of column that contains reviews
        
    OUTPUTS:
        lda_tfidf - Latent Dirichlet Allocation (LDA) model
        dtm_tfidf - document-term matrix in the tfidf format
        tvec - word frequency in the reviews
        A graph comparing LDA Model Performance Scores with different params
    '''
    docs_raw = df_review[review_colname].tolist()

    #************   Step 1: Convert to document-term matrix   ************#

    #Transform text to vector form using the vectorizer object 
    tvec = TfidfVectorizer(strip_accents = 'unicode',
                           stop_words = 'english',
                           lowercase = True,
                           token_pattern = r'\b[a-zA-Z]{3,}\b', # num chars > 3 to avoid some meaningless words
                           max_df = 0.9,                        # discard words that appear in > 90% of the reviews
                           min_df = 10)                         # discard words that appear in < 10 reviews    

    #convert to document-term matrix
    dtm_tfidf = tvec.fit_transform(docs_raw)  

    print("The shape of the tfidf is {}, meaning that there are {} {} and {} tokens made through the filtering process.".\
              format(dtm_tfidf.shape,dtm_tfidf.shape[0], review_colname, dtm_tfidf.shape[1]))

    
    #*******   Step 2: GridSearch & parameter tuning to find the optimal LDA model   *******#

    # Define search parameters
    search_params = {'n_components': [5, 10,15], 
                     'learning_decay': [.5, .7, .9],
                    'batch_size': [64,128]}

    # Initiate the Model
    lda = LatentDirichletAllocation()

    # Initiate GridSearch Class
    model = RandomizedSearchCV(lda,
                               param_distributions=search_params,
                               n_iter = 10,
                               cv=5,
                               n_jobs=-1)

    # fit GridSearch
    model.fit(dtm_tfidf)


    #*****  Step 3: Output the optimal lda model and its parameters  *****#

    # Best Model
    best_lda_model = model.best_estimator_

    # Model Parameters
    print("Best Model's Params: ", model.best_params_)

    # Log Likelihood Score: Higher the better
    print("Model Log Likelihood Score: ", model.best_score_)

    # Perplexity: Lower the better. Perplexity = exp(-1. * log-likelihood per word)
    print("Model Perplexity: ", best_lda_model.perplexity(dtm_tfidf))


#     #***********   Step 4: Compare LDA Model Performance Scores   ***********#

#     #Get Log Likelyhoods from Grid Search Output
#     gscore=model.fit(dtm_tfidf).cv_results_
#     n_topics = [5, 10,15]

#     log_likelyhoods_5 = [gscore['mean_test_score'][gscore['params'].index(v)] for v in gscore['params'] if v['learning_decay']==0.5]
#     log_likelyhoods_7 = [gscore['mean_test_score'][gscore['params'].index(v)] for v in gscore['params'] if v['learning_decay']==0.7]
#     log_likelyhoods_9 = [gscore['mean_test_score'][gscore['params'].index(v)] for v in gscore['params'] if v['learning_decay']==0.9]

#     # Show graph
#     plt.figure(figsize=(12, 8))
#     plt.plot(n_topics, log_likelyhoods_5, label='0.5')
#     plt.plot(n_topics, log_likelyhoods_7, label='0.7')
#     plt.plot(n_topics, log_likelyhoods_9, label='0.9')
#     plt.title("Choosing Optimal LDA Model")
#     plt.xlabel("Num Topics")
#     plt.ylabel("Log Likelyhood Scores")
#     plt.legend(title='Learning decay', loc='best')
#     plt.show()
    
    return best_lda_model, dtm_tfidf, tvec

In [7]:
best_lda_model, dtm_tfidf, tvec = optimal_lda_model(df_lda, 'negatives')

The shape of the tfidf is (30336, 4034), meaning that there are 30336 negatives and 4034 tokens made through the filtering process.
Best Model's Params:  {'n_components': 5, 'learning_decay': 0.7, 'batch_size': 64}
Model Log Likelihood Score:  -159529.82953911513
Model Perplexity:  3201.6362219109474


In [8]:

#Create a function to inspect the topics we created 
def display_topics(model, feature_names, n_top_words):
    '''
    INPUTS:
        model - the model we created
        feature_names - tells us what word each column in the matric represents
        n_top_words - number of top words to display
    OUTPUTS:
        a dataframe that contains the topics we created and the weights of each token
    '''
    topic_dict = {}
    for topic_idx, topic in enumerate(model.components_):
        topic_dict["Topic %d words" % (topic_idx+1)]= ['{}'.format(feature_names[i])
                        for i in topic.argsort()[:-n_top_words - 1:-1]]
        topic_dict["Topic %d weights" % (topic_idx+1)]= ['{:.1f}'.format(topic[i])
                        for i in topic.argsort()[:-n_top_words - 1:-1]]
    return pd.DataFrame(topic_dict)


display_topics(best_lda_model, tvec.get_feature_names_out(), n_top_words = 15) 

Unnamed: 0,Topic 1 words,Topic 1 weights,Topic 2 words,Topic 2 weights,Topic 3 words,Topic 3 weights,Topic 4 words,Topic 4 weights,Topic 5 words,Topic 5 weights
0,hours,360.1,cons,523.4,work,297.1,work,652.9,company,402.8
1,long,301.1,think,324.8,people,271.0,life,512.4,big,259.3
2,work,228.4,management,152.6,management,250.4,balance,500.1,people,203.9
3,time,166.7,bad,145.1,time,220.6,hours,181.5,slow,169.9
4,shifts,135.5,company,144.5,managers,203.9,pay,178.9,large,166.7
5,day,112.0,good,143.5,don,200.6,hard,176.9,career,161.1
6,breaks,105.5,say,134.8,job,200.3,high,164.2,growth,154.8
7,short,93.8,really,125.6,employees,190.3,salary,160.2,hard,154.2
8,working,93.2,great,110.5,like,184.1,environment,157.1,politics,137.5
9,days,92.4,politics,110.3,just,142.1,low,155.7,lot,135.7


In [9]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

In [11]:
# Topic Modelling Visualization for the Negative Reviews
pyLDAvis.sklearn.prepare(best_lda_model, dtm_tfidf, tvec)

  and should_run_async(code)
  default_term_info = default_term_info.sort_values(


pyLDAVis is a great tool to interpret individual topics and the relationships between the topics. A good topic model will have fairly big, non-overlapping bubbles scattered throughout the chart instead of being clustered in one quadrant.

On the left-hand side of the visualization, each topic is represented by a bubble. The larger the bubble, the more prevalent is that topic. The indices inside the circle indicates the sorted order by the area with the number 1 being the most popular topic, and number 5 being the least popular topic. The distance between two bubbles represents the topic similarity. However, this is just an approximation to the original topic similarity matrix because we are only using a two-dimensional scatter plots to best represent the spatial distribution of all 5 topics.
The right-hand side shows the top-30 most relevant terms for the topic you select on the left. The blue bar represents the overall term frequency, and the red bar indicates the estimated term frequency within the selected topic. So, if you see a bar with both red and blue, it means the term also appears at other topics. You can hover over the term to see in which topic(s) is the term also included.

In [12]:
import pickle

pickle_out = open("lda_model.pkl", mode = "wb")
pickle.dump(best_lda_model, pickle_out)
pickle_out.close()

  and should_run_async(code)
