<a href="https://colab.research.google.com/github/AnoVando/MSIS/blob/master/MSIS521_IA3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
import nltk
nltk.download('punkt') # tokenizer
nltk.download('wordnet') # lemmatizer
nltk.download('stopwords') # used to handle words like a, an, the
nltk.download('averaged_perceptron_tagger') # Part of Speech

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
np.random.seed(2018)

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.metrics import *
from sklearn.metrics.pairwise import *
from sklearn.cluster import AgglomerativeClustering



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [0]:
url = 'https://github.com/AnoVando/MSIS/raw/master/IA3.csv'
data = pd.read_csv(url, header='infer')
reviews = data['review'].tolist()

In [3]:
print(data[:5])

   id                                             review       label
0   1  About the shop: There is a restaurant in Soi L...  restaurant
1   2  About the shop: Through this store for about t...  restaurant
2   3  Roast Coffee &amp; Eatery is a restaurant loca...  restaurant
3   4  Eat from the children. The shop is opposite. P...  restaurant
4   5  The Ak 1 shop at another branch tastes the sam...  restaurant


In [0]:
custom_list = ['quot', 'ha', 'wa']

Part 1. Topic Model
There are 1000 reviews for restaurants and films in a collection under the attached csv file. All of those
reviews are saved as text files. In this assignment, you are required to investigate the topics of those
reviews. In particular, please follow the steps listed below:
1. Transform those reviews into a term‐document matrix, lemmatize all the words, remove the
stop‐words and punctuations, set the minimal document frequency for each term to be 5 and
include 2‐gram.
2. Use the LDA model to extract the topics of each document. In particular, we assume there are 6
topics.
3. Report the topic distribution and the top‐2 topics of the first 10 restaurant reviews (id = [1:10])
and the first 10 movie reviews (id = [501:510]).
4. Find the top‐5 terms (terms with the top‐5 highest weights) for each of the 6 topics. Based on
those terms, describe what those topics are about.
5. Based on finding in 3 and 4, describe what review 1 [ID=1] and review 501 [ID=501] are about?
Please submit 1 file:
A word file includes python code with your comment #, and one screenshot on your Jupyter
Notebook showing that your code has run through successfully for each of the first four steps (4
screenshots in total). Also, report your answers to question 3, 4, and 5 at the end of the word
file.

1. Transform those reviews into a term‐document matrix, lemmatize all the words, remove the
stop‐words and punctuations, set the minimal document frequency for each term to be 5 and
include 2‐gram.

In [0]:
# Tokenize, Lemmatize and Remove Stop Words
lemmatizer = nltk.stem.WordNetLemmatizer()
processed = []
for review in reviews:
    tokens = nltk.word_tokenize(review.lower())
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token.isalpha()]
    tokens = [token for token in tokens if not token in stopwords.words('english')]
    tokens = [token for token in tokens if not token in STOPWORDS]
    tokens = [token for token in tokens if not token in custom_list]
    processed.append(tokens)


In [6]:
# Generate TF-IDF Vectors
processed_tfidf = [" ".join(x) for x in processed]

tfidf = TfidfVectorizer(ngram_range=(2, 2), min_df=5) # 2-grams and min. document frequency of 5
tfidf.fit(processed_tfidf)
vector = tfidf.transform(processed_tfidf)
vector = vector.toarray()
print(tfidf.vocabulary_)
print(vector)
print(vector.shape)

{'restaurant decorated': 876, 'average price': 19, 'price baht': 836, 'soup sweet': 977, 'sweet taste': 1045, 'cream cheese': 142, 'topic lt': 1129, 'gt atmosphere': 455, 'atmosphere nice': 14, 'lt food': 644, 'food taste': 376, 'taste gt': 1056, 'gt lt': 457, 'lt service': 645, 'service gt': 935, 'gt good': 456, 'good service': 442, 'service good': 934, 'good food': 421, 'lt value': 646, 'price expensive': 837, 'atmosphere good': 13, 'good delicious': 416, 'delicious food': 175, 'like forgive': 560, 'forgive think': 380, 'think comment': 1083, 'comment win': 126, 'win competition': 1189, 'competition good': 132, 'good restaurant': 441, 'restaurant thank': 884, 'time try': 1118, 'restaurant located': 880, 'located soi': 599, 'bottle baht': 79, 'chicken wing': 104, 'spicy taste': 991, 'united state': 1147, 'value money': 1149, 'money gt': 693, 'menu like': 674, 'look good': 607, 'want eat': 1164, 'eat restaurant': 233, 'menu recommended': 681, 'sauce delicious': 911, 'delicious deliciou

2. Use the LDA model to extract the topics of each document. In particular, we assume there are 6
topics.

In [7]:
dictionary = gensim.corpora.Dictionary(processed)
bow_corpora = [dictionary.doc2bow(doc) for doc in processed]
print(bow_corpora[1])

lda_model = gensim.models.LdaModel(bow_corpora, num_topics=6, id2word=dictionary)

for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

lda_model[bow_corpora[1]]

[(1, 1), (2, 2), (4, 1), (8, 1), (9, 1), (18, 1), (21, 3), (22, 1), (26, 3), (27, 4), (28, 1), (33, 1), (34, 2), (35, 4), (39, 1), (47, 1), (49, 2), (50, 1), (53, 3), (54, 1), (55, 1), (59, 1), (63, 2), (64, 1), (66, 2), (69, 1), (73, 1), (74, 1), (75, 2), (76, 1), (77, 1), (78, 2), (80, 1), (81, 1), (82, 2), (87, 1), (90, 1), (91, 3), (92, 2), (93, 2), (94, 1), (95, 1), (96, 1), (97, 1), (98, 1), (99, 1), (100, 2), (101, 1), (102, 1), (103, 1), (104, 1), (105, 1), (106, 1), (107, 1), (108, 1), (109, 1), (110, 1), (111, 1), (112, 2), (113, 1), (114, 1), (115, 2), (116, 1), (117, 1), (118, 1), (119, 1), (120, 1), (121, 1), (122, 1), (123, 1), (124, 1), (125, 1), (126, 1), (127, 1), (128, 1), (129, 1), (130, 1), (131, 1), (132, 1), (133, 1), (134, 1), (135, 1), (136, 1), (137, 1), (138, 2), (139, 1), (140, 1), (141, 1), (142, 3), (143, 1), (144, 1)]
Topic: 0 
Words: 0.010*"film" + 0.008*"good" + 0.007*"eat" + 0.006*"like" + 0.006*"delicious" + 0.006*"people" + 0.006*"shop" + 0.005*"food"

[(0, 0.99311703)]

3. Report the topic distribution and the top‐2 topics of the first 10 restaurant reviews (id = [1:10])
and the first 10 movie reviews (id = [501:510]).

In [8]:
url = 'https://github.com/AnoVando/MSIS/raw/master/IA3.csv'
data = pd.read_csv(url, header='infer')
data1 = data[0:9]
data2 = data[500:509]
data3 = data1.append(data2)
reviews2 = data3['review'].tolist()

# Tokenize, Lemmatize and Remove Stop Words
lemmatizer = nltk.stem.WordNetLemmatizer()
processed2 = []
for review in reviews2:
    tokens = nltk.word_tokenize(review.lower())
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token.isalpha()]
    tokens = [token for token in tokens if not token in stopwords.words('english')]
    tokens = [token for token in tokens if not token in STOPWORDS]
    tokens = [token for token in tokens if not token in custom_list]
    processed2.append(tokens)

text = gensim.corpora.Dictionary(processed2)
corpus = [text.doc2bow(doc) for doc in processed2]

lda_model2 = gensim.models.LdaModel(corpus, num_topics=2, id2word=text)

for idx, topic in lda_model2.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.013*"film" + 0.010*"like" + 0.008*"good" + 0.007*"people" + 0.007*"dolphin" + 0.006*"eat" + 0.005*"price" + 0.005*"life" + 0.005*"delicious" + 0.005*"baht"
Topic: 1 
Words: 0.010*"film" + 0.007*"good" + 0.006*"like" + 0.006*"eat" + 0.005*"time" + 0.005*"dolphin" + 0.005*"restaurant" + 0.005*"food" + 0.005*"think" + 0.005*"people"


In [0]:
# install pyLDAvis if necessary
# !pip install pyLDAvis

In [10]:
# Visualize the topics
import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model2, corpus, text)
vis

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


4. Find the top‐5 terms (terms with the top‐5 highest weights) for each of the 6 topics. Based on
those terms, describe what those topics are about.

In [11]:
lda_model.show_topics(num_topics=6, num_words=5)

[(0,
  '0.010*"film" + 0.008*"good" + 0.007*"eat" + 0.006*"like" + 0.006*"delicious"'),
 (1,
  '0.012*"people" + 0.010*"time" + 0.009*"film" + 0.007*"love" + 0.006*"like"'),
 (2,
  '0.008*"film" + 0.007*"time" + 0.006*"people" + 0.005*"eat" + 0.005*"food"'),
 (3,
  '0.010*"film" + 0.008*"good" + 0.007*"like" + 0.007*"time" + 0.006*"people"'),
 (4,
  '0.011*"like" + 0.011*"good" + 0.009*"people" + 0.007*"film" + 0.005*"love"'),
 (5,
  '0.020*"film" + 0.009*"love" + 0.009*"people" + 0.007*"like" + 0.006*"good"')]

Topic 0 is about films and good food.
Topic 1 is about people and films about times they love.
Topic 2 is about people and films about eating good food.
Topic 3 is about people and films about people they like.
Topic 4 is about films that people love.
Topic 5 is about films about loving people.