Extract comments from glassdoor reviews, performing polarity analysis and topic modeling for dataset. In this work, glassdoor review dataset is used. The dataset contains different fields like date, pros, cons etc.. Change in intensity of comments for every quarter of year is calulated from the dataset and for this fields topic is also identified.
The solution is split into 4 task :
(1) Data – Cleaning
(2) Sort the dataset quarterly
(3) Polarity Analysis
(4) Topic Modeling
The dataset contains characters other than unicode. The data is cleaned by reading the characters as unicode and ignoring the non-unicode characters.
text1 = unicode(text, errors='ignore')
The dataset is sorted on the basis on date field. For each sorted date, the quarter to which the date belongs is identified and this value is added to as a new column in the dataset.
list.append(pd.Timestamp(dt.date(year,month,date)).quarter)
On each quarter wise sorted data, the polarity of pros and cons for each quarter is identified. From this output the change in intensity of the comments can be calculated.
scores = sid.polarity_scores(sentence)
The topic modeling is done on the fields title, pros and cons. Each field were taken seperately and the topic set for each field is created. Each field is preprocessed by tokenization and then using this tokens LDA modeling is performed for topic identification. LDA is performed using Genism. The topic set is set to 5 and each set contains 6 words.