                 Name: Shreya            Andrew ID: sshreya             Zillow Zestimate Prediction     

### The Amazon online marketplace relies on product reviews written by customers based on experience.The Amazon review dataset has over 21.9 million product reviews for 1.2 million products. We will hypothesize the classes of concerns expressed by customers through unsupervised learning using Latent Dirichlet Allocation (LDA) to generate topic models from the text. Each cluster is described by words that predict membership in the cluster. 

In [1]:
import csv
import pandas as pd
import json
import random

# from google.colab import drive

# drive.mount('/content/gdrive/', force_remount=True)

# # !pip install numpy
# %cd /content/gdrive/My Drive/Amazon

In [2]:
# set the file path and load the json file to like of dfs in chunks of 10000
file_path = "Home_and_Kitchen.json"
dfs = []
for chunk in pd.read_json(file_path, lines=True, chunksize=10000):
    dfs.append(chunk.sample(frac=0.1))

In [3]:
# Concatenate the list of dataframes into a single dataframe
review_df = pd.concat(dfs, ignore_index=True)
len(review_df)

2192857

In [4]:
review_df.head()

Unnamed: 0,overall,vote,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,style,image
0,5,,True,"03 13, 2017",A1G91EWH4WCOHS,1933682612,JESSICA HAYES,Bought this for my stepdaughter when she start...,A classic,1489363200,{'Format:': ' Toy'},
1,4,,False,"09 14, 2017",A35JNJZKZY0S8Q,B00002N601,Houstonian,"Generally speaking, this product is good enoug...",Only one little thing not that good,1505347200,{'Size:': ' 6 qt'},
2,5,,True,"11 10, 2016",A4P9EJZ09I1BP,710105482X,Carlpak1,There are 100's of uses for this product howev...,Useful,1478736000,,
3,4,,True,"01 30, 2013",A8THV1OK4VNFP,0983124248,K. Christensen,After years of just putting stickers on paper ...,Sticker book,1359504000,{'Format:': ' Spiral-bound'},
4,5,,True,"12 17, 2013",A2V1K4KARXFH42,B0000224M6,kd,"Smells good. I'm happy,\nSunflowers are a nice...",Nice,1387238400,{'Size:': ' 428'},


### Since we are focusing on the classes of concerns expressed by customers
#### Check for all reviews and their count and remove the ones with overall rating 4 and above

In [5]:
review_df['overall'].unique()
rating_counts = review_df['overall'].value_counts()

# print the result
print(rating_counts)

5    1415539
4     303188
1     201343
3     161625
2     111162
Name: overall, dtype: int64


In [6]:
indexAge = review_df[ (review_df['overall'] >= 4) ].index
review_df.drop(indexAge , inplace=True)
review_df
print(len(review_df))

474130


In [7]:
review_df.isnull().sum()

overall                0
vote              373831
verified               0
reviewTime             0
reviewerID             0
asin                   0
reviewerName          35
reviewText           186
summary               69
unixReviewTime         0
style             221735
image             455253
dtype: int64

In [8]:
## Remove the rows with missing reviewText

review_df = review_df.dropna(subset = ['reviewText'])

In [9]:
len(review_df)

473944

### Stemming of each tokenized list of each review post removing all the stop words

In [10]:
from gensim import models, similarities
from gensim.corpora.dictionary import Dictionary
# Plus a few other assorted inputs.
import numpy as np
# We'd typically start by tokenizing the data 
from gensim.utils import tokenize
# And then stem all words.
from gensim.parsing.porter import PorterStemmer

!pip install pattern
import nltk
nltk.download('stopwords')



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kashy\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
review_df['reviewText'].iloc[0]

'It is big and cools the house. It is more powerful than the smaller two-fan models which is why I got it.\n\nBUT BE SURE TO CHECK WINDOW SIZE. I could not put it where I intended. Only the largest windows in the living room were big enough.'

In [12]:
# remove everything from reviewText except alphanumeric characters

review_df['reviewText'] = review_df['reviewText'].str.replace('[^A-Za-z0-9]+', ' ')

  review_df['reviewText'] = review_df['reviewText'].str.replace('[^A-Za-z0-9]+', ' ')


In [13]:
# remove all stop words from the reviewText

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

# define a function to remove stop words
def remove_stopwords(text):
    words = [word.lower() for word in text.split() if word.lower() not in stop_words]
    return " ".join(words)

# apply the function to the 'text' column
review_df['reviewText'] = review_df['reviewText'].apply(remove_stopwords)

In [14]:
#create tokenized list of text from each review

tokenized_texts = [list(tokenize(text)) for text in review_df['reviewText']]
print(tokenized_texts[:1])

[['big', 'cools', 'house', 'powerful', 'smaller', 'two', 'fan', 'models', 'got', 'sure', 'check', 'window', 'size', 'could', 'put', 'intended', 'largest', 'windows', 'living', 'room', 'big', 'enough']]


In [15]:
# And then stem all words.
from gensim.parsing.porter import PorterStemmer

stemmer = PorterStemmer()
stemmed_texts = [[stemmer.stem(word) for word in text] for text in tokenized_texts]
len(stemmed_texts)

473944

In [16]:
# remove the list of words which are empty
clean_lemmatized_text = []
for doc in stemmed_texts:
    # Remove any empty documents
    if len(doc)!=0:
        clean_lemmatized_text.append(doc)
len(clean_lemmatized_text)

473746

In [17]:
# Create a corpus from a list of texts.
from gensim.corpora.dictionary import Dictionary
# The dictionary just extracts and numbers each distinct word.
dictionary = Dictionary(clean_lemmatized_text, prune_at=20000)
# A corpus is a sparse datastore containing the number of times each word appears in each document.
corpus = [dictionary.doc2bow(text) for text in clean_lemmatized_text]
print(len(corpus))

473746


In [18]:
# Print a sample of dictionary items.
top_words_in_doc_0 = sorted(corpus[0], key=lambda e: e[1], reverse=True)[:10]
for word_index, count in top_words_in_doc_0:
  print(f'{dictionary[word_index]}\tindex: {word_index}\tcount: {count:,}')

big	index: 0	count: 2
window	index: 19	count: 2
check	index: 1	count: 1
cool	index: 2	count: 1
could	index: 3	count: 1
enough	index: 4	count: 1
fan	index: 5	count: 1
got	index: 6	count: 1
hous	index: 7	count: 1
intend	index: 8	count: 1


# Models trained on test

####  Let's start with 10 topics.

In [30]:
# Build an LDA model.
# Note: we could also do this with SKLearn using LinearDiscriminantAnalysis.

# Let's start with 10 topics.
num_topics = 10
model = models.LdaMulticore(corpus, num_topics=num_topics, id2word=dictionary, passes=10, workers=4, dtype=np.float64)

In [31]:
# Let's again look at the most salient words per topic, but note that we no longer
# have "labels" now.
for ix in range(num_topics):
  top10 = np.argsort(model.get_topics()[ix])[-10:]
  print(f'{ix}:  {" ".join([dictionary[index] for index in top10])}')  # See any patterns?

0:  rust pot get like leak clean filter us smell water
1:  nice price would cheap expect qualiti pictur color like look
2:  small open mug bottl on fit glass cup us lid
3:  make work blade like stick get cook candl cut us
4:  screw look came togeth put box arriv broken on piec
5:  light fan turn us make time clock unit work coffe
6:  fit cover soft mattress comfort like wash bed sheet pillow
7:  purchas ship on item review amazon order receiv return product
8:  back floor would hold get work vacuum us bag chair
9:  broke first last year monei time month us work on


In [32]:
# How coherent are these topics?
from gensim.models.coherencemodel import CoherenceModel

cm = CoherenceModel(model=model, texts=clean_lemmatized_text, dictionary=dictionary, coherence='c_v')
coherence = cm.get_coherence()
print("Coherence metric for number of topics = 10: ",coherence)

Coherence metric for number of topics = 10:  0.5571307163553102


In [33]:
# assume you have already trained an LdaMulticore model called `lda_model`
# and a Dictionary object called `dictionary`, and a list of reviews called `reviews`
num_topics = 10  # number of topics to use for the analysis

# create a list to store the reviews for each topic
reviews_by_topic = [[] for _ in range(num_topics)]

# loop over each review and assign it to its most probable topic
for review in clean_lemmatized_text:
    bow = dictionary.doc2bow(review)
    topic_dist = model.get_document_topics(bow)
    topic_id = max(topic_dist, key=lambda x: x[1])[0]
    reviews_by_topic[topic_id].append(review)

In [34]:
# print the number of reviews in each topic with top 1 review in each topic
for i, reviews in enumerate(reviews_by_topic):
    print(f'Topic {i}: {len(reviews)} reviews')
    count = 0
    for r in reviews:
      if count < 1:
        print(f'\t {r}')
        count = count+1

Topic 0: 25373 reviews
	 ['almost', 'explod', 'air', 'vent', 'lock', 'pressur', 'regul', 'never', 'move', 'never', 'releas', 'steam', 'like', 'suppos', 'us', 'metal', 'tong', 'remov', 'pressur', 'regul', 'tremend', 'blast', 'steam', 'escap', 'whew']
Topic 1: 87124 reviews
	 ['bit', 'small', 'slat', 'bit', 'flimsi']
Topic 2: 45190 reviews
	 ['ask', 'wan', 'good', 'person', 'man', 'woman', 'god', 'read', 'bibl', 'thing', 'god', 'said', 'god', 'well', 'prayer', 'god', 'well', 'rightou', 'well', 'answer', 'prayer', 'wicket', 'sin', 'god', 'well', 'bless', 'save', 'put', 'god', 'first', 'becus', 'ask', 'faith', 'well', 'happen', 'put', 'trust', 'god', 'count']
Topic 3: 34308 reviews
	 ['eminem', 'terribl', 'white', 'rapper', 'want', 'peopl', 'laugh', 'feel', 'sorri', 'screw', 'man', 'tri', 'make', 'concept', 'album', 'mix', 'togeth', 'childhood', 'life', 'mix', 'poor', 'parodi', 'celebrati', 'on', 'thing', 'describ', 'littl', 'stori', 'confus', 'hell', 'sh', 'ty', 'skit', 'mention', 'eminem

#### By changing the hyperparameters of LDA model the number of topics to 20 and then 5 to see the impact on coherence.
#### Now check for number of topics = 20

In [35]:
# Okay, let's do something more exciting: build an LDA model.
# Note: we could also do this with SKLearn using LinearDiscriminantAnalysis.

# Let's start with 20 topics.
num_topics = 20
model1 = models.LdaMulticore(corpus, num_topics=num_topics, id2word=dictionary, passes=10, workers=4, dtype=np.float64)

In [36]:
# Let's again look at the most salient words per topic, but note that we no longer
# have "labels" now.
for ix in range(num_topics):
  top10 = np.argsort(model1.get_topics()[ix])[-10:]
  print(f'{ix}:  {" ".join([dictionary[index] for index in top10])}')  # See any patterns?

0:  get make strong trai scent us mold ic like smell
1:  would star gift first us time order got back on
2:  even terribl cheap poor worth product wast bui qualiti monei
3:  dai time broke week on last year us month work
4:  knife make work get blade hand bowl cut handl us
5:  hair carpet suction brush us get rug floor clean vacuum
6:  cheapli thin look materi candl feel like cheap pillow made
7:  ok smaller much qualiti good would price small size expect
8:  air set turn fan time batteri unit clock work light
9:  ship came product receiv packag broken box arriv item return
10:  shower top seal cup lid mug bottl leak glass water
11:  print curtain photo nice white chair like pictur color look
12:  fabric soft wash comfort fit cover mattress bag bed sheet
13:  tini room poster love purpos kid realli small cute us
14:  fall on hold top back assembl screw piec togeth put
15:  name lock space close juic video imag contain fit lid
16:  link us grind maker make machin filter water cup coffe


In [37]:
# How coherent are these topics?
from gensim.models.coherencemodel import CoherenceModel

cm = CoherenceModel(model=model1, texts=clean_lemmatized_text, dictionary=dictionary, coherence='c_v')
coherence = cm.get_coherence()
print("Coherence metric for number of topics = 20: ",coherence)

Coherence metric for number of topics = 20:  0.5746281338368437


In [38]:
# assume you have already trained an LdaMulticore model called `lda_model`
# and a Dictionary object called `dictionary`, and a list of reviews called `reviews`
num_topics = 20  # number of topics to use for the analysis

# create a list to store the reviews for each topic
reviews_by_topic = [[] for _ in range(num_topics)]

# loop over each review and assign it to its most probable topic
for review in clean_lemmatized_text:
    bow = dictionary.doc2bow(review)
    topic_dist = model1.get_document_topics(bow)
    topic_id = max(topic_dist, key=lambda x: x[1])[0]
    reviews_by_topic[topic_id].append(review)

In [39]:
# print the number of reviews in each topic
for i, reviews in enumerate(reviews_by_topic):
    print(f'Topic {i}: {len(reviews)} reviews')
    count = 0
    for r in reviews:
      if count <1:
        print(f'\t {r}')
        count = count+1

Topic 0: 12175 reviews
	 ['seem', 'like', 'eminem', 'fallen', 'path', 'rapper', 'rap', 'topic', 'rapper', 'happend', 'rappin', 'bout', 'shroom', 'etc', 'first', 'album', 'quit', 'shame', 'someon', 'differ', 'revolutionairi', 'conform', 'album', 'fade', 'career']
Topic 1: 29713 reviews
	 ['first', 'tell', 'would', 'take', 'month', 'receiv', 'product', 'right', 'purchas', 'annoi', 'wait', 'product', 'daughter', 'want', 'product', 'final', 'arriv', 'differ', 'item', 'look', 'suppos', 'quilt', 'instead', 'pillow', 'hole', 'could', 'stick', 'hand', 'wai', 'pretti', 'much', 'useless', 'bother', 'return', 'try', 'exchang', 'year', 'old', 'want', 'wait', 'anoth', 'month', 'awar', 'seller']
Topic 2: 19124 reviews
	 ['worth', 'bui']
Topic 3: 36707 reviews
	 ['eminem', 'terribl', 'white', 'rapper', 'want', 'peopl', 'laugh', 'feel', 'sorri', 'screw', 'man', 'tri', 'make', 'concept', 'album', 'mix', 'togeth', 'childhood', 'life', 'mix', 'poor', 'parodi', 'celebrati', 'on', 'thing', 'describ', 'litt

#### Now for number of topics = 5

In [40]:
# Okay, let's do something more exciting: build an LDA model.
# Note: we could also do this with SKLearn using LinearDiscriminantAnalysis.

# Let's start with 5 topics.
num_topics = 5
model2 = models.LdaMulticore(corpus, num_topics=num_topics, id2word=dictionary, passes=10, workers=4, dtype=np.float64)

In [41]:
# How coherent are these topics?
from gensim.models.coherencemodel import CoherenceModel

cm = CoherenceModel(model=model2, texts=clean_lemmatized_text, dictionary=dictionary, coherence='c_v')
coherence = cm.get_coherence()
print("Coherence metric for number of topics = 5: ",coherence)

Coherence metric for number of topics = 5 0.508363460152388


In [42]:
# assume you have already trained an LdaMulticore model called `lda_model`
# and a Dictionary object called `dictionary`, and a list of reviews called `reviews`
num_topics = 5  # number of topics to use for the analysis

# create a list to store the reviews for each topic
reviews_by_topic = [[] for _ in range(num_topics)]

# loop over each review and assign it to its most probable topic
for review in clean_lemmatized_text:
    bow = dictionary.doc2bow(review)
    topic_dist = model2.get_document_topics(bow)
    topic_id = max(topic_dist, key=lambda x: x[1])[0]
    reviews_by_topic[topic_id].append(review)

# print the number of reviews in each topic
for i, reviews in enumerate(reviews_by_topic):
    print(f'Topic {i}: {len(reviews)} reviews')
    count = 0
    for r in reviews:
      if count < 1:
        print(f'\t {r}')
        count = count+1

Topic 0: 56392 reviews
	 ['tough', 'clean', 'like', 'stainless', 'cookwar', 'job', 'like', 'better', 'solut', 'found', 'though']
Topic 1: 96499 reviews
	 ['ask', 'wan', 'good', 'person', 'man', 'woman', 'god', 'read', 'bibl', 'thing', 'god', 'said', 'god', 'well', 'prayer', 'god', 'well', 'rightou', 'well', 'answer', 'prayer', 'wicket', 'sin', 'god', 'well', 'bless', 'save', 'put', 'god', 'first', 'becus', 'ask', 'faith', 'well', 'happen', 'put', 'trust', 'god', 'count']
Topic 2: 183100 reviews
	 ['product', 'scratch', 'base', 'also', 'top', 'cover', 'even', 'though', 'bought', 'new', 'item', 'complet', 'satisfi']
Topic 3: 74556 reviews
	 ['almost', 'explod', 'air', 'vent', 'lock', 'pressur', 'regul', 'never', 'move', 'never', 'releas', 'steam', 'like', 'suppos', 'us', 'metal', 'tong', 'remov', 'pressur', 'regul', 'tremend', 'blast', 'steam', 'escap', 'whew']
Topic 4: 63199 reviews
	 ['big', 'cool', 'hous', 'power', 'smaller', 'two', 'fan', 'model', 'got', 'sure', 'check', 'window', 's

### Writing the best model to Pickle file

##### Since LDA model with number of topics = 20 is the model with least coherence, it is the best model and hence we will write this model to pickle file

In [43]:
import pickle
filename = 'topic.model'
pickle.dump(model1, open(filename, 'wb'))