<a href="https://colab.research.google.com/github/DurgaBhavana/5731Submissions/blob/master/INFO5731_Assignment_Four.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Four**

In this assignment, you are required to conduct topic modeling, sentiment analysis, and regression analysis.

# **Question 1: Topic Modeling**

(30 points). This question is designed to help you understand topic modeling better as well as how to visualize topic modeling results, aims to collect the human meanings of documents. Based on the yelp review data (only the review text will be used for this question), which can be download from Dropbox: https://www.dropbox.com/s/59hsrk56sfwh9u2/Assignment%20four%20data%20Yelp%20%28question%201%20and%202%29.zip?dl=0, **select two models** and write a python program to **identify the top 20 topics (with 15 words for each topic) in the dataset**. Before answering this question, please review the materials in lesson 8, as well as the introduction of these models by the links provided.

(1)   Labeled LDA (LLDA): https://github.com/JoeZJH/Labeled-LDA-Python

(2)   Biterm Topic Model (BTM): https://github.com/markoarnauto/biterm

(3)   HMM-LDA: https://github.com/dongwookim-ml/python-topic-model

(4)   SupervisedLDA: https://github.com/dongwookim-ml/python-topic-model/tree/master/notebook

(5)   Relational Topic Model: https://github.com/dongwookim-ml/python-topic-model/tree/master/notebook

(6)   LDA2VEC: https://github.com/cemoody/lda2vec

(7)   BERTopic: https://github.com/MaartenGr/BERTopic

(8)   LDA+BERT Topic Modeling: https://www.kaggle.com/dskswu/topic-modeling-bert-lda

(9)   Clustering for Topic models: (paper: https://arxiv.org/abs/2004.14914), (code: https://github.com/adalmia96/Cluster-Analysis)


**The following information should be reported:**

(1) Top 20 clusters for topic modeling.

(2) Summarize and describe the topic for each cluster. 

(3) Visualize the topic modeling reasults by using pyLDAVis: https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/#14.-pyLDAVis


In [1]:
from zipfile import ZipFile
import json
import pandas as pd
import os

In [2]:
# EXTRACTING THE ZIP FOLDER
file_name = '/content/Assignment four data Yelp (question 1 and 2).zip'
with ZipFile(file_name, 'r') as zip:
    zip.extractall()
    print('Extracted the zip now')

Extracted the zip now


In [3]:
reviews = list()
ratings = list()
files = os.listdir('/content/Assignment four data Yelp (question 1 and 2)/')

In [4]:
# LOADING AND APPENDING ALL THE JSON FILES TO A LIST
i=0
for file in files:
  with open('/content/Assignment four data Yelp (question 1 and 2)/'+file, 'r') as readfile:
    i +=1
    json_data = json.load(readfile)
    for data in json_data:
      reviews.append(data['text'])
      ratings.append(data['stars'])
    if(i==30):
      break

In [5]:
# CONVERTING TO DATAFRAME
json_df = pd.DataFrame(reviews,columns = ["Reviews"])
json_df["Ratings"] = ratings

In [6]:
json_df.head(10)

Unnamed: 0,Reviews,Ratings
0,Yumalisciousness! I'm not even a big sandwich ...,5.0
1,The place has the look and feel of a dingy yo...,2.0
2,Horrible customer service.....have been withou...,1.0
3,Such a cute little cafe! close to my house I l...,4.0
4,Been going here for years. I've never eaten an...,4.0
5,If you're looking for a vegetarian place that ...,3.0
6,So sad! This used to be one of my fave places....,1.0
7,Love the large layout and the max speed is ver...,4.0
8,1. Parking: parking is across the street from ...,3.0
9,Not fun for older kids 15+. It is so annoying ...,1.0


In [7]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
import re
from textblob import TextBlob
from nltk.stem import PorterStemmer
from textblob import Word
nltk.download('wordnet')
from nltk.corpus import stopwords

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [8]:
# PRE PROCESSING

# Special characters removal
json_df['After noise removal'] = json_df['Reviews'].apply(lambda x: ''.join(re.sub(r"[^a-zA-Z0-9]+", ' ', charctr) for charctr in x ))

# Punctuation removal
json_df['Punctuation removal'] = json_df['After noise removal'].str.replace('[^\w\s]','')

# Remove numbers
json_df['Remove numbers'] = json_df['Punctuation removal'].str.replace('\d+', '')



In [9]:
# Stopwords removal
stop_word = stopwords.words('english')
json_df['Stopwords removal'] = json_df['Remove numbers'].apply(lambda x: " ".join(x for x in x.split() if x not in stop_word))

# Lower Casing
json_df['Lower casing'] = json_df['Stopwords removal'].apply(lambda x: " ".join(x.lower() for x in x.split()))

# Tokenization
json_df['Tokenization'] = json_df['Lower casing'].apply(lambda x: TextBlob(x).words)

# Stemming
st = PorterStemmer()
json_df['Stemming'] = json_df['Tokenization'].apply(lambda x: " ".join([st.stem(word) for word in x]))

# Lemmatization
json_df['Lemmatization'] = json_df['Stemming'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))

In [10]:
json_df.head(10)

Unnamed: 0,Reviews,Ratings,After noise removal,Punctuation removal,Remove numbers,Stopwords removal,Lower casing,Tokenization,Stemming,Lemmatization
0,Yumalisciousness! I'm not even a big sandwich ...,5.0,Yumalisciousness I m not even a big sandwich ...,Yumalisciousness I m not even a big sandwich ...,Yumalisciousness I m not even a big sandwich ...,Yumalisciousness I even big sandwich person Ik...,yumalisciousness i even big sandwich person ik...,"[yumalisciousness, i, even, big, sandwich, per...",yumalisci i even big sandwich person ike defin...,yumalisci i even big sandwich person ike defin...
1,The place has the look and feel of a dingy yo...,2.0,The place has the look and feel of a dingy yo...,The place has the look and feel of a dingy yo...,The place has the look and feel of a dingy yo...,The place look feel dingy yoga studio due plai...,the place look feel dingy yoga studio due plai...,"[the, place, look, feel, dingy, yoga, studio, ...",the place look feel dingi yoga studio due plai...,the place look feel dingi yoga studio due plai...
2,Horrible customer service.....have been withou...,1.0,Horrible customer service have been withou...,Horrible customer service have been withou...,Horrible customer service have been withou...,Horrible customer service without Internet tel...,horrible customer service without internet tel...,"[horrible, customer, service, without, interne...",horribl custom servic without internet telepho...,horribl custom servic without internet telepho...
3,Such a cute little cafe! close to my house I l...,4.0,Such a cute little cafe close to my house I l...,Such a cute little cafe close to my house I l...,Such a cute little cafe close to my house I l...,Such cute little cafe close house I loved deco...,such cute little cafe close house i loved deco...,"[such, cute, little, cafe, close, house, i, lo...",such cute littl cafe close hous i love decor a...,such cute littl cafe close hous i love decor a...
4,Been going here for years. I've never eaten an...,4.0,Been going here for years I ve never eaten an...,Been going here for years I ve never eaten an...,Been going here for years I ve never eaten an...,Been going years I never eaten anything I like...,been going years i never eaten anything i like...,"[been, going, years, i, never, eaten, anything...",been go year i never eaten anyth i like new ow...,been go year i never eaten anyth i like new ow...
5,If you're looking for a vegetarian place that ...,3.0,If you re looking for a vegetarian place that ...,If you re looking for a vegetarian place that ...,If you re looking for a vegetarian place that ...,If looking vegetarian place makes say Wow I be...,if looking vegetarian place makes say wow i be...,"[if, looking, vegetarian, place, makes, say, w...",if look vegetarian place make say wow i believ...,if look vegetarian place make say wow i believ...
6,So sad! This used to be one of my fave places....,1.0,So sad This used to be one of my fave places ...,So sad This used to be one of my fave places ...,So sad This used to be one of my fave places ...,So sad This used one fave places Over last yea...,so sad this used one fave places over last yea...,"[so, sad, this, used, one, fave, places, over,...",so sad thi use one fave place over last year i...,so sad thi use one fave place over last year i...
7,Love the large layout and the max speed is ver...,4.0,Love the large layout and the max speed is ver...,Love the large layout and the max speed is ver...,Love the large layout and the max speed is ver...,Love large layout max speed enjoyable They arc...,love large layout max speed enjoyable they arc...,"[love, large, layout, max, speed, enjoyable, t...",love larg layout max speed enjoy they arcad ga...,love larg layout max speed enjoy they arcad ga...
8,1. Parking: parking is across the street from ...,3.0,1 Parking parking is across the street from ...,1 Parking parking is across the street from ...,Parking parking is across the street from h...,Parking parking across street half complex Thi...,parking parking across street half complex thi...,"[parking, parking, across, street, half, compl...",park park across street half complex thi may s...,park park across street half complex thi may s...
9,Not fun for older kids 15+. It is so annoying ...,1.0,Not fun for older kids 15 It is so annoying ...,Not fun for older kids 15 It is so annoying ...,Not fun for older kids It is so annoying ho...,Not fun older kids It annoying blow whistle ev...,not fun older kids it annoying blow whistle ev...,"[not, fun, older, kids, it, annoying, blow, wh...",not fun older kid it annoy blow whistl everi l...,not fun older kid it annoy blow whistl everi l...


### Model 1 - BiTerm Topic Modeling

In [11]:
pip install biterm

Collecting biterm
[?25l  Downloading https://files.pythonhosted.org/packages/36/ca/5a43511e6ea8ca02cc9e8be1b8898ad79b140c055d4400342dc210ba23bb/biterm-0.1.5.tar.gz (79kB)
[K     |████▏                           | 10kB 16.2MB/s eta 0:00:01[K     |████████▎                       | 20kB 22.5MB/s eta 0:00:01[K     |████████████▍                   | 30kB 9.8MB/s eta 0:00:01[K     |████████████████▌               | 40kB 9.7MB/s eta 0:00:01[K     |████████████████████▋           | 51kB 4.3MB/s eta 0:00:01[K     |████████████████████████▊       | 61kB 4.5MB/s eta 0:00:01[K     |████████████████████████████▉   | 71kB 5.1MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 3.5MB/s 
Building wheels for collected packages: biterm
  Building wheel for biterm (setup.py) ... [?25l[?25hdone
  Created wheel for biterm: filename=biterm-0.1.5-cp36-cp36m-linux_x86_64.whl size=195421 sha256=05aab37ce806cdef65e1e2d408502e18622b3518aec13590632de3131acfa3ff
  Stored in directory: 

In [12]:
import numpy as np

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(stop_words='english')
X = vec.fit_transform(json_df['Reviews'].head(500).values).toarray()

In [17]:
from biterm.utility import vec_to_biterms
vocab = np.array(vec.get_feature_names())
biterms = vec_to_biterms(X)

In [18]:
from biterm.btm import oBTM
btm = oBTM(num_topics=20, V=vocab)
topics = btm.fit_transform(biterms, iterations=10)


  0%|          | 0/10 [00:00<?, ?it/s][A
 10%|█         | 1/10 [04:12<37:52, 252.49s/it][A
 20%|██        | 2/10 [08:23<33:37, 252.17s/it][A
 30%|███       | 3/10 [12:35<29:23, 251.97s/it][A
 40%|████      | 4/10 [16:46<25:09, 251.57s/it][A
 50%|█████     | 5/10 [20:56<20:56, 251.30s/it][A
 60%|██████    | 6/10 [25:07<16:45, 251.28s/it][A
 70%|███████   | 7/10 [29:18<12:33, 251.20s/it][A
 80%|████████  | 8/10 [33:29<08:21, 250.98s/it][A
 90%|█████████ | 9/10 [37:41<04:11, 251.29s/it][A
100%|██████████| 10/10 [41:53<00:00, 251.30s/it]


In [19]:
!pip install pyLDAvis

Collecting pyLDAvis
[?25l  Downloading https://files.pythonhosted.org/packages/a5/3a/af82e070a8a96e13217c8f362f9a73e82d61ac8fff3a2561946a97f96266/pyLDAvis-2.1.2.tar.gz (1.6MB)
[K     |▏                               | 10kB 14.9MB/s eta 0:00:01[K     |▍                               | 20kB 20.6MB/s eta 0:00:01[K     |▋                               | 30kB 11.1MB/s eta 0:00:01[K     |▉                               | 40kB 9.6MB/s eta 0:00:01[K     |█                               | 51kB 4.4MB/s eta 0:00:01[K     |█▏                              | 61kB 4.8MB/s eta 0:00:01[K     |█▍                              | 71kB 5.2MB/s eta 0:00:01[K     |█▋                              | 81kB 5.4MB/s eta 0:00:01[K     |█▉                              | 92kB 5.7MB/s eta 0:00:01[K     |██                              | 102kB 6.0MB/s eta 0:00:01[K     |██▎                             | 112kB 6.0MB/s eta 0:00:01[K     |██▍                             | 122kB 6.0MB/s eta 0:00:01

In [20]:
import numpy as np
import pyLDAvis
from biterm.btm import oBTM 
from sklearn.feature_extraction.text import CountVectorizer
from biterm.utility import vec_to_biterms, topic_summuary # helper functions

if __name__ == "__main__":
    texts = json_df["Lemmatization"].head(100).values
    # vectorize texts
    vec = CountVectorizer(stop_words='english')
    X = vec.fit_transform(texts).toarray()

    # get vocabulary
    vocab = np.array(vec.get_feature_names())

    # get biterms
    biterms = vec_to_biterms(X)

    # create btm
    btm = oBTM(num_topics=20, V=vocab)

    print("\n\n Train Online BTM ..")
    for i in range(0, len(biterms), 100): # prozess chunk of 200 texts
        biterms_chunk = biterms[i:i + 100]
        btm.fit(biterms_chunk, iterations=10)
    topics = btm.transform(biterms)

    print("\n\n Visualize Topics ..")
    vis = pyLDAvis.prepare(btm.phi_wz.T, topics, np.count_nonzero(X, axis=1), vocab, np.sum(X, axis=0))
    # pyLDAvis.save_html(vis, './vis/online_btm.html')  # path to output
    print(vis)

    print("\n\n Topic coherence ..")
    topic_summuary(btm.phi_wz.T, X, vocab, 10)

    # print("\n\n Texts & Topics ..")
    # for i in range(len(texts)):
    #     print("{} (topic: {})".format(texts[i], topics[i].argmax()))



 Train Online BTM ..



  0%|          | 0/10 [00:00<?, ?it/s][A
 10%|█         | 1/10 [00:25<03:51, 25.70s/it][A
 20%|██        | 2/10 [00:51<03:26, 25.78s/it][A
 30%|███       | 3/10 [01:17<03:00, 25.76s/it][A
 40%|████      | 4/10 [01:42<02:34, 25.72s/it][A
 50%|█████     | 5/10 [02:08<02:08, 25.63s/it][A
 60%|██████    | 6/10 [02:34<01:43, 25.79s/it][A
 70%|███████   | 7/10 [03:00<01:17, 25.82s/it][A
 80%|████████  | 8/10 [03:26<00:51, 25.97s/it][A
 90%|█████████ | 9/10 [03:52<00:25, 25.92s/it][A
100%|██████████| 10/10 [04:18<00:00, 25.89s/it]




 Visualize Topics ..
PreparedData(topic_coordinates=              x         y  topics  cluster      Freq
topic                                               
0      0.136729  0.115347       1        1  9.367696
4      0.123923  0.128127       2        1  8.423392
8      0.096017  0.084911       3        1  7.551605
13     0.124381  0.068550       4        1  6.991518
7      0.121284  0.106933       5        1  6.081147
16     0.057685 -0.018511       6        1  5.955514
6      0.143144  0.048399       7        1  5.704505
5      0.079611  0.041879       8        1  5.694070
12     0.083281  0.018830       9        1  5.661890
10     0.064097 -0.367454      10        1  5.386652
1     -0.474744  0.028376      11        1  4.932292
17     0.100625  0.073875      12        1  4.773647
18     0.056582 -0.382914      13        1  4.330498
11     0.001745 -0.052851      14        1  4.241870
9      0.088624  0.006368      15        1  4.170538
19     0.036013 -0.034054      16        1  3

In [21]:
pyLDAvis.display(vis, './vis/online_btm.html')



### Model 2 - LDA through Gensim

In [22]:
from pprint import pprint
# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
# spacy for lemmatization
import spacy
# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline
# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()


In [23]:
# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

In [24]:
json_data = json_df.values.tolist()

In [26]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(json_data))

print(data_words[:1])

[['not', 'even', 'big', 'sandwich', 'person', 'but', 'ike', 'definitely', 'has', 'my', 'love', 'the', 'dutch', 'crunch', 'bread', 'is', 'the', 'bomb', 'and', 'the', 'dirty', 'sauce', 'is', 'just', 'garlicky', 'goodness', 'staff', 'is', 'fun', 'and', 'friendly', 'and', 'the', 'menu', 'offers', 'one', 'of', 'kind', 'exciting', 'combinations', 'all', 'sandwiches', 'are', 'hot', 'and', 'made', 'to', 'order', 'and', 'customized', 'to', 'your', 'level', 'of', 'awesomeness', 'this', 'place', 'is', 'always', 'packed', 'at', 'lunch', 'time', 'so', 'plan', 'accordingly', 'not', 'even', 'big', 'sandwich', 'person', 'but', 'ike', 'definitely', 'has', 'my', 'love', 'the', 'dutch', 'crunch', 'bread', 'is', 'the', 'bomb', 'and', 'the', 'dirty', 'sauce', 'is', 'just', 'garlicky', 'goodness', 'staff', 'is', 'fun', 'and', 'friendly', 'and', 'the', 'menu', 'offers', 'one', 'of', 'kind', 'exciting', 'combinations', 'all', 'sandwiches', 'are', 'hot', 'and', 'made', 'to', 'order', 'and', 'customized', 'to',

In [27]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

#Trigram
print(trigram_mod[bigram_mod[data_words[0]]])



['not', 'even', 'big', 'sandwich', 'person', 'but', 'ike', 'definitely', 'has', 'my', 'love', 'the', 'dutch_crunch_bread', 'is', 'the', 'bomb', 'and', 'the', 'dirty', 'sauce', 'is', 'just', 'garlicky_goodness', 'staff', 'is', 'fun', 'and', 'friendly', 'and', 'the', 'menu', 'offers', 'one', 'of', 'kind', 'exciting_combinations', 'all', 'sandwiches', 'are', 'hot', 'and', 'made', 'to', 'order', 'and', 'customized', 'to', 'your', 'level', 'of', 'awesomeness', 'this', 'place', 'is', 'always', 'packed', 'at', 'lunch', 'time', 'so', 'plan_accordingly', 'not', 'even', 'big', 'sandwich', 'person', 'but', 'ike', 'definitely', 'has', 'my', 'love', 'the', 'dutch_crunch_bread', 'is', 'the', 'bomb', 'and', 'the', 'dirty', 'sauce', 'is', 'just', 'garlicky_goodness', 'staff', 'is', 'fun', 'and', 'friendly', 'and', 'the', 'menu', 'offers', 'one', 'of', 'kind', 'exciting_combinations', 'all', 'sandwiches', 'are', 'hot', 'and', 'made', 'to', 'order', 'and', 'customized', 'to', 'your', 'level', 'of', 'awe

In [29]:
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [30]:
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load('en', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1])

[['even', 'big', 'person', 'ike', 'definitely', 'love', 'bread', 'bomb', 'dirty', 'sauce', 'garlicky_goodness', 'staff', 'fun', 'friendly', 'menu', 'offer', 'kind', 'exciting_combination', 'sandwich', 'hot', 'make', 'order', 'customize', 'level', 'awesomeness', 'place', 'always', 'pack', 'lunch', 'time', 'plan', 'accordingly', 'even', 'big', 'sandwich', 'person', 'ike', 'definitely', 'love', 'bread', 'bomb', 'dirty', 'sauce', 'garlicky_goodness', 'staff', 'fun', 'friendly', 'menu', 'offer', 'kind', 'exciting_combination', 'sandwich', 'hot', 'make', 'order', 'customize', 'level', 'awesomeness', 'place', 'always', 'pack', 'lunch', 'time', 'plan', 'accordingly', 'even', 'big', 'sandwich', 'person', 'ike', 'definitely', 'love', 'bread', 'bomb', 'dirty', 'sauce', 'garlicky_goodness', 'staff', 'fun', 'friendly', 'menu', 'offer', 'kind', 'exciting_combination', 'sandwich', 'hot', 'make', 'order', 'customize', 'level', 'awesomeness', 'place', 'always', 'pack', 'lunch', 'time', 'plan', 'accordi

In [31]:
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1])

[[(0, 7), (1, 2), (2, 7), (3, 2), (4, 7), (5, 9), (6, 9), (7, 9), (8, 2), (9, 7), (10, 2), (11, 7), (12, 7), (13, 9), (14, 2), (15, 7), (16, 7), (17, 7), (18, 7), (19, 9), (20, 9), (21, 9), (22, 9), (23, 9), (24, 9), (25, 9), (26, 9), (27, 9), (28, 9), (29, 9), (30, 9), (31, 9), (32, 9), (33, 13), (34, 2), (35, 7), (36, 7), (37, 2), (38, 9), (39, 1)]]


In [32]:
# Human readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

[[('accordingly', 7),
  ('alway', 2),
  ('always', 7),
  ('awesom', 2),
  ('awesomeness', 7),
  ('big', 9),
  ('bomb', 9),
  ('bread', 9),
  ('custom', 2),
  ('customize', 7),
  ('definit', 2),
  ('definitely', 7),
  ('dirty', 7),
  ('even', 9),
  ('excit', 2),
  ('exciting_combination', 7),
  ('friendly', 7),
  ('fun', 7),
  ('garlicky_goodness', 7),
  ('hot', 9),
  ('ike', 9),
  ('kind', 9),
  ('level', 9),
  ('love', 9),
  ('lunch', 9),
  ('make', 9),
  ('menu', 9),
  ('offer', 9),
  ('order', 9),
  ('pack', 9),
  ('person', 9),
  ('place', 9),
  ('plan', 9),
  ('sandwich', 13),
  ('sauc', 2),
  ('sauce', 7),
  ('staff', 7),
  ('thi', 2),
  ('time', 9),
  ('wordlist', 1)]]

In [33]:
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

In [35]:
# Print the Keyword in the 20 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.075*"order" + 0.065*"come" + 0.038*"friend" + 0.036*"table" + '
  '0.033*"take" + 0.033*"go" + 0.030*"back" + 0.029*"food" + 0.026*"server" + '
  '0.025*"get"'),
 (1,
  '0.085*"room" + 0.048*"old" + 0.046*"stay" + 0.037*"hotel" + 0.028*"kid" + '
  '0.022*"different" + 0.018*"would" + 0.014*"place" + 0.014*"floor" + '
  '0.014*"check"'),
 (2,
  '0.055*"give" + 0.045*"make" + 0.043*"say" + 0.031*"would" + 0.018*"call" + '
  '0.018*"get" + 0.018*"money" + 0.018*"almost" + 0.015*"talk" + 0.014*"fix"'),
 (3,
  '0.063*"night" + 0.031*"get" + 0.029*"late" + 0.026*"wife" + 0.026*"dog" + '
  '0.016*"fact" + 0.015*"decent" + 0.015*"play" + 0.015*"total" + '
  '0.013*"leave"'),
 (4,
  '0.069*"love" + 0.054*"staff" + 0.049*"make" + 0.045*"always" + '
  '0.043*"great" + 0.032*"friendly" + 0.025*"place" + 0.024*"clean" + '
  '0.022*"amazing" + 0.018*"owner"'),
 (5,
  '0.062*"day" + 0.029*"work" + 0.023*"must" + 0.021*"see" + 0.015*"provide" + '
  '0.015*"meet" + 0.014*"light" + 0.014*"new"

In [36]:
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -6.844708117272013

Coherence Score:  0.533089203468305


In [37]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

# **Question 2: Yelp Review Sentiment Analysis**

(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.

The data can be download from Dropbox: https://www.dropbox.com/s/59hsrk56sfwh9u2/Assignment%20four%20data%20Yelp%20%28question%201%20and%202%29.zip?dl=0 

The data was saved in json format, here is an example of the data (for this task, you only need to use the star rating and the review text fields):

{
    // string, 22 character unique review id
    "review_id": "zdSx_SD6obEhz9VrW9uAWA",

    // string, 22 character unique user id, maps to the user in user.json
    "user_id": "Ha3iJu77CxlrFm-vQRs_8g",

    // string, 22 character business id, maps to business in business.json
    "business_id": "tnhfDv5Il8EaGSXZGiuQGg",

    // integer, star rating
    "stars": 4,

    // string, date formatted YYYY-MM-DD
    "date": "2016-03-09",

    // string, the review itself
    "text": "Great place to hang out after work: the prices are decent, and the ambience is fun. It's a bit loud, but very lively. The staff is friendly, and the food is good. They have a good selection of drinks.",

    // integer, number of useful votes received
    "useful": 0,

    // integer, number of funny votes received
    "funny": 0,

    // integer, number of cool votes received
    "cool": 0
}

The sentiment of can be accessed based on the star rating, if no star information avaliable for a record, just remove that record. Detail star and sentiment level can be matched blew:

Very positive = 5 stars

Positive = 4 stars

Neutral = 3 stars

Negative = 2 stars

Very negative = 1 star

Here is code for yelp data preprocessing: https://github.com/Yelp/dataset-examples. 

Answer the following questions:

(1) Features used for sentiment classification and explain why you select these features (tf-idf, sentiment lexicon, word2vec, etc). Considering achieve the best performance as you can. 

(2) Select two of the supervised learning algorithm from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build a sentiment classifier respectively. 

(3) Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9. 

In [None]:
# Write your code here





# **Question 3: House price prediction**

(40 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download from Dropbox: https://www.dropbox.com/s/52j9hpxppfo921o/assignment4-question3-data.zip?dl=0. Here is an axample for the implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878. 


In [None]:
# Write your code here