## Max Wienandts

Problem Statement:
-  Is BERT and DistilBERT appropriate to sentiment analysis? Is Latent Dirichlet Allocation (LDA) appropriate for topic modeling?

|SUMMARY|
|:------------------------------------------------|
|[**1. Intertopic Distance Map**](#Intertopic_Distance_Map)|
|[**2. Train model**](#Train_model)|
|[**3. Predict on test**](#Predict_on_test)|

This project have 4 Jupyter notebook:
- 1 ETL and EDA.ipynb;
- 2 LSTM BERT DistilBERT.ipynb;
- 3 Topic modeling.ipynb; and
- 4 Production.ipynb.

They should be runned in order. \
The objective of the first notebook is to known better our dataset and to clean it. \
The second notebook have the models related to sentiment analysis. \
The third one have the topic modeling. \
Finally, the last notebook is an example to how to apply the sentiment and topic models to an dataset and to a custow review.

In [12]:
# Fix bug caused by the library pyLDAvis.This just fix missing icons in jupyter lab, but it is not necessary to run the code.  
from IPython.display import HTML
css_str = '<style> \
.jp-Button path { fill: #616161;} \
text.terms { fill: #616161;} \
.jp-icon-warn0 path {fill: var(--jp-warn-color0);} \
.bp3-button-text path { fill: var(--jp-inverse-layout-color3);} \
.jp-icon-brand0 path { fill: var(--jp-brand-color0);} \
text.terms { fill: #616161;} \
</style>'
display(HTML(css_str ))

In [8]:
from pprint import pprint
import pickle

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import nltk
import gensim
from gensim import  models
import gensim.corpora as corpora

import pyLDAvis
import pyLDAvis.gensim_models as gensimvis  

In [50]:
# Notebook variables
path_read_dataset = 'df_sentiment_analysis_topic_modeling.csv'

# Operational system. If Linux "1", otherwise "0".
OS_system = 0

# Path to save model
LDA_path_win = r"C:\Users\maxwi\Python\Harvard\Electives\1 CSCI S-89 Introduction to Deep Learning\Project\1 sentiment analysis\Glassdoor 2\models\LDA\LDA_model.pk"
LDA_path_linux = "/home/max/Python/Harvard/Electives/1 CSCI S-89 Introduction to Deep Learning/Project/1 sentiment analysis/Glassdoor 2/models/LDA/LDA_model.pk"

tfidf_path_win = r"C:\Users\maxwi\Python\Harvard\Electives\1 CSCI S-89 Introduction to Deep Learning\Project\1 sentiment analysis\Glassdoor 2\models\LDA\tfidf.pk"
tfidf_path_linux = "/home/max/Python/Harvard/Electives/1 CSCI S-89 Introduction to Deep Learning/Project/1 sentiment analysis/Glassdoor 2/models/LDA/tfidf.pk"

if OS_system == 1:
    LDA_path = LDA_path_linux
    tfidf_path = tfidf_path_linux
else:
    LDA_path = LDA_path_win
    tfidf_path = tfidf_path_win

In [5]:
df = pd.read_csv(path_read_dataset)
df.head(3)

Unnamed: 0,firm,review,target,n_words,review_lower,review_without_stopwords
0,179,I can not think of any,0,6,i can not think of any,
1,174,little advancement based on owners pretty cheap,0,9,little advancement based on owners pretty cheap,advancement based owners cheap
2,47,Low career advancement opportunity and politics,0,6,low career advancement opportunity and politics,career advancement opportunity politics


<div id='Intertopic_Distance_Map' />
    
## Intertopic Distance Map

First, the libraries gensim and pyLDAvis will be used to help define how many topics should be used for the model Latent Dirichlet Allocation.

In [6]:
# Separate train, validation and test. Validation will not be used. However we will do this split to reproduce the datasets used for the sentiment analysis.
df_train_2, df_test = train_test_split(df, test_size = 0.20, random_state = 1)
df_train, df_val = train_test_split(df_train_2, test_size = 0.20, random_state = 1)

# Drop rows with empty review_without_stopwords
df_train.dropna(inplace = True)
df_test.dropna(inplace = True)

# Tokenize words
df_train['review_token'] = df_train['review_without_stopwords'].apply(lambda x: x.split())
df_test['review_token'] = df_test['review_without_stopwords'].apply(lambda x: x.split())

# stem words. This will be used in for the topic modeling.
def stem_words(vec_words):
    porter_stemmer = nltk.stem.PorterStemmer()
    vec_stemming_word = []
    for word in vec_words:
        vec_stemming_word.append(porter_stemmer.stem(word))
    return vec_stemming_word

df_train['review_token'] = df_train.apply(lambda row: stem_words(row['review_token']), axis = 1) 
df_test['review_token'] = df_test.apply(lambda row: stem_words(row['review_token']), axis = 1) 

# Index to word
id2word = corpora.Dictionary(df_train.review_token)
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in df_train.review_token]

df_train.head(3)

Unnamed: 0,firm,review,target,n_words,review_lower,review_without_stopwords,review_token
1472032,147,"Long working hours, somewhat repetitive work",0,6,long working hours somewhat repetitive work,working hours somewhat repetitive work,"[work, hour, somewhat, repetit, work]"
933606,140,If you want to make cash - but lose you moral...,1,23,if you want to make cash but lose you moral...,cash lose morals friends along place,"[cash, lose, moral, friend, along, place]"
137200,212,"Sociable, caring, supportive, teamwork, bonuses",1,5,sociable caring supportive teamwork bonuses,sociable caring supportive teamwork bonuses,"[sociabl, care, support, teamwork, bonus]"


In [4]:
# build LDA model for 6 topic.  Glassdoor has the topics:  Culture and Values, Diversity and Inclusion, Work/Life Balance, Senior Management, Compensation and Benefits, and Career Opportunities.
lda_model = gensim.models.ldamodel.LdaModel(corpus = corpus,
                                           id2word = id2word,
                                           num_topics = 6, 
                                           random_state = 1,
                                           update_every = 0,
                                           passes = 10,
                                           per_word_topics = False)

# Print the Keyword in the 6 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

In [22]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(lda_model, corpus, id2word)
vis

In [4]:
# Groups 4 and 5 are overlapping too much. Lets reduce the number of topics to 5.
# build LDA model for 6 topic.  Glassdoor has the topics:  Culture and Values, Diversity and Inclusion, Work/Life Balance, Senior Management, Compesation and Benefits, Carrer Oportunities.
lda_model_2 = gensim.models.ldamodel.LdaModel(corpus = corpus,
                                           id2word = id2word,
                                           num_topics = 5, 
                                           random_state = 1,
                                           update_every = 0,
                                           passes = 10,
                                           per_word_topics = False)

# Print the Keyword in the 5 topics
pprint(lda_model_2.print_topics())
doc_lda_2 = lda_model_2[corpus]

[(0,
  '0.025*"work" + 0.022*"manag" + 0.017*"benefit" + 0.016*"hour" + 0.014*"pay" '
  '+ 0.013*"cultur" + 0.012*"peopl" + 0.011*"opportun" + 0.011*"team" + '
  '0.011*"career"'),
 (1,
  '0.090*"work" + 0.034*"peopl" + 0.024*"life" + 0.021*"balanc" + 0.019*"time" '
  '+ 0.016*"hour" + 0.014*"job" + 0.012*"place" + 0.011*"manag" + '
  '0.010*"flexibl"'),
 (2,
  '0.031*"work" + 0.023*"employe" + 0.023*"manag" + 0.020*"peopl" + '
  '0.016*"benefit" + 0.016*"salari" + 0.012*"learn" + 0.012*"compani" + '
  '0.011*"staff" + 0.010*"friendli"'),
 (3,
  '0.068*"work" + 0.034*"compani" + 0.020*"manag" + 0.015*"pay" + '
  '0.015*"opportun" + 0.013*"hour" + 0.012*"get" + 0.012*"environ" + '
  '0.011*"flexibl" + 0.010*"time"'),
 (4,
  '0.029*"manag" + 0.028*"work" + 0.016*"team" + 0.011*"pay" + 0.011*"job" + '
  '0.010*"get" + 0.009*"peopl" + 0.009*"compani" + 0.008*"project" + '
  '0.008*"environ"')]


In [7]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis_2 = gensimvis.prepare(lda_model_2, corpus, id2word)
vis_2

In [8]:
# Groups 4 and 5 still are overlapping too much. Lets reduce the number of topics to 5.
# build LDA model for 6 topic.  Glassdoor has the topics:  Culture and Values, Diversity and Inclusion, Work/Life Balance, Senior Management, Compesation and Benefits, Carrer Oportunities.
lda_model_3 = gensim.models.ldamodel.LdaModel(corpus = corpus,
                                           id2word = id2word,
                                           num_topics = 4, 
                                           random_state = 1,
                                           update_every = 0,
                                           passes = 10,
                                           per_word_topics = False)

# Print the Keyword in the 4 topics
pprint(lda_model_3.print_topics())
doc_lda_3 = lda_model_3[corpus]

[(0,
  '0.024*"manag" + 0.023*"work" + 0.015*"benefit" + 0.014*"hour" + 0.014*"pay" '
  '+ 0.013*"team" + 0.011*"cultur" + 0.011*"peopl" + 0.010*"opportun" + '
  '0.010*"train"'),
 (1,
  '0.081*"work" + 0.032*"peopl" + 0.022*"life" + 0.018*"balanc" + 0.018*"time" '
  '+ 0.016*"job" + 0.014*"hour" + 0.012*"manag" + 0.011*"place" + '
  '0.009*"opportun"'),
 (2,
  '0.029*"work" + 0.025*"manag" + 0.020*"employe" + 0.018*"peopl" + '
  '0.015*"salari" + 0.014*"benefit" + 0.011*"compani" + 0.011*"learn" + '
  '0.011*"staff" + 0.010*"team"'),
 (3,
  '0.064*"work" + 0.032*"compani" + 0.021*"manag" + 0.016*"pay" + '
  '0.014*"opportun" + 0.012*"hour" + 0.012*"get" + 0.012*"environ" + '
  '0.010*"flexibl" + 0.010*"time"')]


In [12]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis_3 = gensimvis.prepare(lda_model_3, corpus, id2word)
vis_3

<div id='Train_model' />

## Train model

Now that we know the best quantity of topics, let's use the LDA model from sklearn. It is easier to train and predict new documents with sklearn.

In [7]:
# Preprocessing
# Join words to vectorize it using TF-IDF 
df_train['review_clean'] = df_train['review_token'].apply(' '.join)
df_test['review_clean'] = df_test['review_token'].apply(' '.join)

# Initialize regex tokenizer to use TF-IDF 
tokenizer = RegexpTokenizer(r'\w+')
# Vectorize document using TF-IDF
tfidf = TfidfVectorizer(ngram_range = (1,1), tokenizer = tokenizer.tokenize)
# Fit and Transform the documents
train_data = tfidf.fit_transform(df_train['review_clean'])
# Transform for test
test_data = tfidf.transform(df_test['review_clean'])



In [9]:
# Create LDA object
model = LatentDirichletAllocation(n_components = 4)
# Fit and Transform SVD model on data
lda_predict_train = model.fit_transform(train_data)

# Get Components 
lda_components = model.components_
# Print the topics with their terms
terms = tfidf.get_feature_names_out()
for index, component in enumerate(lda_components):
    zipped = zip(terms, component)
    top_terms_key=sorted(zipped, key = lambda t: t[1], reverse=True)[:5]
    top_terms_list=list(dict(top_terms_key).keys())
    print("Topic "+str(index)+": ",top_terms_list)

Topic 0:  ['hour', 'pay', 'salari', 'work', 'benefit']
Topic 1:  ['none', 'busi', 'custom', 'staff', 'get']
Topic 2:  ['manag', 'compani', 'chang', 'work', 'employe']
Topic 3:  ['work', 'life', 'balanc', 'peopl', 'environ']


We can see that some words don’t help to discriminate a topic, i.e., work, get, company, custom. \
The ideal would be to add these words in our vector new_stopwords, in "1 ETL and EDA.ipynb", remove them, and run the algorithm again. \
However, we will not do this here because of time.

We can discriminate the topics as:
- 0: Compensation and Benefits;
- 1: Staff;
- 2: Senior Management; and
- 3: Work/Life Balance

In [51]:
# Save model
pickle.dump(model, open(LDA_path, 'wb'))
pickle.dump(tfidf, open(tfidf_path, 'wb'))

<div id='Predict_on_test' />

## Predict on test

In [18]:
# Load model
lda_best_model = pickle.load(open(LDA_path, 'rb'))

In [20]:
# Apply model in test set
lda_predict_test = lda_best_model.transform(test_data)

In [49]:
# Select the most relevant topic
lda_predict_first_topic_test = []
for e in lda_predict_test:
    first_topic_number = np.argmax(e)
    if first_topic_number == 0:
        lda_predict_first_topic_test.append('Compesation and Benefits')
    elif first_topic_number == 1:
        lda_predict_first_topic_test.append('Staff')
    elif first_topic_number == 2:
        lda_predict_first_topic_test.append('Senior Management')
    else:
        lda_predict_first_topic_test.append('Work/Life Balance')

# Add the topics in  the dataset
df_test['first_LDA_topics'] = lda_predict_first_topic_test

df_test.head()

Unnamed: 0,firm,review,target,n_words,review_lower,review_without_stopwords,review_token,review_clean,first_LDA_topics
761696,74,"dynamic corporate culture, empowering, career ...",1,12,dynamic corporate culture empowering career ...,dynamic corporate culture empowering career ad...,"[dynam, corpor, cultur, empow, career, advanc,...",dynam corpor cultur empow career advanc opport...,Work/Life Balance
960448,114,Long working hours at times,0,5,long working hours at times,working hours times,"[work, hour, time]",work hour time,Compesation and Benefits
214882,286,"Long working hour, no work life balance",0,7,long working hour no work life balance,working hour work life balance,"[work, hour, work, life, balanc]",work hour work life balanc,Work/Life Balance
1186606,179,"WFH, Work Life balance, Onsite",1,5,wfh work life balance onsite,wfh work life balance onsite,"[wfh, work, life, balanc, onsit]",wfh work life balanc onsit,Work/Life Balance
1408488,148,"Safe, process driven, good perspectives",1,5,safe process driven good perspectives,safe process driven perspectives,"[safe, process, driven, perspect]",safe process driven perspect,Senior Management
