<a href="https://colab.research.google.com/github/AnoVando/MSIS/blob/master/MSIS521_IA3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
import nltk
nltk.download('punkt') # tokenizer
nltk.download('wordnet') # lemmatizer
nltk.download('stopwords') # used to handle words like a, an, the
nltk.download('averaged_perceptron_tagger') # Part of Speech

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
np.random.seed(2018)

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.metrics import *
from sklearn.metrics.pairwise import *
from sklearn.cluster import AgglomerativeClustering



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [0]:
url = 'https://github.com/AnoVando/MSIS/raw/master/IA3.csv'
data = pd.read_csv(url, header='infer')
reviews = data['review'].tolist()

In [3]:
print(data[:5])

   id                                             review       label
0   1  About the shop: There is a restaurant in Soi L...  restaurant
1   2  About the shop: Through this store for about t...  restaurant
2   3  Roast Coffee &amp; Eatery is a restaurant loca...  restaurant
3   4  Eat from the children. The shop is opposite. P...  restaurant
4   5  The Ak 1 shop at another branch tastes the sam...  restaurant


In [0]:
custom_list = ['quot', 'ha', 'wa']

Part 1. Topic Model
There are 1000 reviews for restaurants and films in a collection under the attached csv file. All of those
reviews are saved as text files. In this assignment, you are required to investigate the topics of those
reviews. In particular, please follow the steps listed below:
1. Transform those reviews into a term‐document matrix, lemmatize all the words, remove the
stop‐words and punctuations, set the minimal document frequency for each term to be 5 and
include 2‐gram.
2. Use the LDA model to extract the topics of each document. In particular, we assume there are 6
topics.
3. Report the topic distribution and the top‐2 topics of the first 10 restaurant reviews (id = [1:10])
and the first 10 movie reviews (id = [501:510]).
4. Find the top‐5 terms (terms with the top‐5 highest weights) for each of the 6 topics. Based on
those terms, describe what those topics are about.
5. Based on finding in 3 and 4, describe what review 1 [ID=1] and review 501 [ID=501] are about?
Please submit 1 file:
A word file includes python code with your comment #, and one screenshot on your Jupyter
Notebook showing that your code has run through successfully for each of the first four steps (4
screenshots in total). Also, report your answers to question 3, 4, and 5 at the end of the word
file.

1. Transform those reviews into a term‐document matrix, lemmatize all the words, remove the
stop‐words and punctuations, set the minimal document frequency for each term to be 5 and
include 2‐gram.

In [0]:
stemmer = PorterStemmer()
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in stopwords.words('english'):
            if token not in custom_list:
                result.append(lemmatize_stemming(token))
    return result

In [0]:
processed_docs = data['review'].fillna('').astype(str).map(preprocess)

In [7]:
print(processed_docs)

0      [shop, restaur, soi, langsuan, road, insid, lu...
1      [shop, store, three, year, first, time, tri, r...
2      [roast, coffe, amp, eateri, restaur, locat, se...
3      [eat, children, shop, opposit, phra, prathat, ...
4      [ak, shop, anoth, branch, tast, concentr, tell...
                             ...                        
995    [peopl, aliv, never, die, difficult, know, go,...
996    [first, time, know, chen, tianx, movi, nuclear...
997    [film, time, tear, liter, tear, ya, also, leav...
998    [rememb, child, teacher, alway, take, troubl, ...
999    [abil, episod, mouth, sasuk, year, old, gradua...
Name: review, Length: 1000, dtype: object


In [0]:
dictionary = gensim.corpora.Dictionary(processed_docs)
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [0]:
# Generate TF-IDF Vectors
#processed_tfidf = [" ".join(x) for x in processed]

#tfidf = TfidfVectorizer(ngram_range=(2, 2), min_df=5) # 2-grams and min. document frequency of 5
#tfidf.fit(bow_corpus)
#tfidf = tfidf.transform(bow_corpus)


2. Use the LDA model to extract the topics of each document. In particular, we assume there are 6
topics.

In [9]:
lda_model = gensim.models.LdaModel(bow_corpus, num_topics=6, id2word=dictionary)

for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))


Topic: 0 
Words: 0.007*"good" + 0.007*"like" + 0.007*"film" + 0.006*"also" + 0.006*"eat" + 0.006*"peopl" + 0.004*"delici" + 0.004*"tast" + 0.004*"stori" + 0.004*"one"
Topic: 1 
Words: 0.010*"also" + 0.008*"film" + 0.008*"peopl" + 0.006*"like" + 0.005*"one" + 0.005*"love" + 0.005*"say" + 0.005*"time" + 0.004*"stori" + 0.004*"make"
Topic: 2 
Words: 0.008*"also" + 0.008*"film" + 0.008*"peopl" + 0.007*"time" + 0.007*"good" + 0.006*"love" + 0.006*"like" + 0.005*"eat" + 0.005*"say" + 0.004*"think"
Topic: 3 
Words: 0.008*"love" + 0.008*"film" + 0.008*"peopl" + 0.008*"like" + 0.007*"also" + 0.006*"good" + 0.005*"life" + 0.004*"one" + 0.004*"make" + 0.004*"see"
Topic: 4 
Words: 0.012*"film" + 0.007*"peopl" + 0.007*"make" + 0.006*"good" + 0.006*"like" + 0.005*"love" + 0.005*"eat" + 0.005*"time" + 0.005*"movi" + 0.005*"also"
Topic: 5 
Words: 0.014*"film" + 0.009*"peopl" + 0.009*"like" + 0.008*"good" + 0.007*"time" + 0.007*"love" + 0.006*"think" + 0.006*"say" + 0.005*"also" + 0.004*"make"


3. Report the topic distribution and the top‐2 topics of the first 10 restaurant reviews (id = [1:10])
and the first 10 movie reviews (id = [501:510]).

In [10]:
for n in range(10):
  print(lda_model[bow_corpus][n])
      

[(0, 0.9939986)]
[(0, 0.9935906)]
[(0, 0.85997045), (3, 0.13639477)]
[(0, 0.6581291), (4, 0.33355817)]
[(0, 0.9728731)]
[(0, 0.99535745)]
[(0, 0.22570696), (4, 0.69325566), (5, 0.07516668)]
[(0, 0.7837508), (2, 0.20763251)]
[(0, 0.8798198), (1, 0.023958737), (2, 0.024051702), (3, 0.024031134), (4, 0.024066806), (5, 0.024071801)]
[(2, 0.95060885)]


The top 2 topics for the first 10 restaurant reviews are 3 & 4.

In [11]:
for n in range(510):
  if n >= 500:
      print(lda_model[bow_corpus][n])
  if n > 510:
      break
  n = n + 1

[(2, 0.035257466), (5, 0.9596021)]
[(2, 0.1539163), (3, 0.84395987)]
[(3, 0.49371675), (5, 0.50510055)]
[(2, 0.96498054)]
[(3, 0.99588525)]
[(0, 0.7140922), (2, 0.27881503)]
[(2, 0.99357164)]
[(5, 0.9752509)]
[(1, 0.55670434), (5, 0.43529034)]
[(3, 0.68732864), (5, 0.31074795)]


The top 2 topics for the first 10 movie reviews are 1 & 2.

4. Find the top‐5 terms (terms with the top‐5 highest weights) for each of the 6 topics. Based on
those terms, describe what those topics are about.

In [12]:
lda_model.show_topics(num_topics=6, num_words=5)

[(0,
  '0.007*"good" + 0.007*"like" + 0.007*"film" + 0.006*"also" + 0.006*"eat"'),
 (1,
  '0.010*"also" + 0.008*"film" + 0.008*"peopl" + 0.006*"like" + 0.005*"one"'),
 (2,
  '0.008*"also" + 0.008*"film" + 0.008*"peopl" + 0.007*"time" + 0.007*"good"'),
 (3,
  '0.008*"love" + 0.008*"film" + 0.008*"peopl" + 0.008*"like" + 0.007*"also"'),
 (4,
  '0.012*"film" + 0.007*"peopl" + 0.007*"make" + 0.006*"good" + 0.006*"like"'),
 (5,
  '0.014*"film" + 0.009*"peopl" + 0.009*"like" + 0.008*"good" + 0.007*"time"')]

Topic 0 is about films and good food.
Topic 1 is about people and films about times they love.
Topic 2 is about people and films about eating good food.
Topic 3 is about people and films about people they like.
Topic 4 is about films that people love.
Topic 5 is about films about loving people.

In [0]:
# install pyLDAvis if necessary
# !pip install pyLDAvis

In [25]:
# Visualize the topics
import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, bow_corpus, dictionary) 
vis

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


5. Based on finding in 3 and 4, describe what review 1 [ID=1] and review 501 [ID=501] are about?
Please submit 1 file:
A word file includes python code with your comment #, and one screenshot on your Jupyter
Notebook showing that your code has run through successfully for each of the first four steps (4
screenshots in total). Also, report your answers to question 3, 4, and 5 at the end of the word
file.

In [19]:
print(lda_model[bow_corpus[1]])
data.review[0]

[(0, 0.9935906)]


'About the shop: There is a restaurant in Soi Langsuan (Road) inside of Luxx Hotel. The décor of the restaurant: decorated in a rustic style, white walls, glass tables, red chairs, parquet floors, dim lights open at night, the romantic atmosphere: Duck l&#39;orange Pork Wellington and French onion soup. Average Price: 250-450 Baht Food Review: Duck l&#39;orange (455) Duck breast sliced Pork Wellington (445) is a piece of tender pork stuffed with stuffing and wrapped in a thin pastry and then baked to serve with the sauce. Duck roll (285) is a Duck wrapped with vegetables and dough wrap not much delicious French onion soup (235) Sweet taste Garnish with cheese Bake Scallop (225) is a scallop in a thick cream with cheese. Serve with a thin toast to eat together. Score by topic: &lt;Atmosphere&gt; 8/10 Atmosphere nice romantic &lt;food taste&gt; 7/10&lt;Service&gt; 9/10 Good service, good food recommendation &lt;Value&gt; 6/10 price is quite expensive and about 1,000 baht if not ordered t

In [21]:
print(lda_model[bow_corpus[501]])
data.review[501]

[(2, 0.15490067), (3, 0.8429755)]


'In the summer, go to Bali with friends. Rottweiler woke up at five in the morning and took a spider boat out to sea. To see the dolphins. I have never seen dolphins before except TV or dolls. Just remember that the smile of the dolphins is the biggest misunderstanding of mankind. This tells us that this should be derived from the phrase &quot;Dolphin Smile&quot; is the most illusory disguise in the world. The impression of the dolphins in my heart is lovely and sweet, accompanied by the early morning of Rottweiler and the bright yellow clouds of the sky. The dolphins jumped out of the sea in the morning light and were chased by a boat. The boat followed the direction in which the dolphins appeared. I couldn&#39;t help thinking, why should we chase, not wait? The dolphins were trapped in this bay full of boats. Because we were in a hurry, we were too attached to the encounter with the dolphins. I knew very early that there was a documentary about Dolphin Bay, but it took a long time to