# Process Amazon Reviews to detect any potential health issues

Using the Amazon dataset, we will try to detect any potential harmful products by analyzing the user reviews. To do so, we will use **topic modelling** with the **Latent Dirichlet Allocation (LDA)** model. Our hope is that reviews of potential harmful products will be assigned to their own topic, topic that we would be able to find by analyzing the words weights associated to that topic. Our pipeline is as follows :
* We first start by importing our data and removing any non-useful columns.
* We then preprocess the reviews : we remove any stopwords and stem the words with the help of **nltk** library to standardize them.
* Using the **gensim** library, we create a corpus representing all stemmed reviews in a **bag of words** representation
* Finally, we run the LDA model to create our topics

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk
import gensim
%matplotlib inline

In [2]:
#Download stopwords and wordnet for lemmatization (only need to be executed once)
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /home/fares/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/fares/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
REVIEWS_PATH = "cleaned_tenth.json"

Start by importing the data :

In [4]:
reviews = pd.read_json(REVIEWS_PATH, lines=True)

#TBD: Which columns to keep/remove
reviews = reviews.drop(columns=['reviewerName', 'helpful', 'reviewTime'])

#Convert the utc timestamp to readable dates
reviews['unixReviewTime'] = pd.to_datetime(reviews['unixReviewTime'],unit='s')

reviews.head()

Unnamed: 0,reviewerID,asin,unixReviewTime,reviewText,overall,summary
0,A1ZQZ8RJS1XVTX,0657745316,2013-10-11,"No sugar, no GMO garbage, no fillers that come...",5,Best vanilla I've ever had
1,A31W38VGZAUUM4,0700026444,2012-12-06,"This is my absolute, undisputed favorite tea r...",5,Terrific Tea!
2,A3I0AV0UJX5OH0,1403796890,2013-12-02,I ordered spongbob slippers and I got John Cen...,1,grrrrrrr
3,A3QAAOLIXKV383,1403796890,2011-06-12,The cart is fine and works for the purpose for...,3,Storage on Wheels Cart
4,AB1A5EGHHVA9M,141278509X,2012-03-24,This product by Archer Farms is the best drink...,5,The best drink mix


Create a function to process the reviews using the nltk library :
* We tokenize the sentence,
* remove any potential stop words,
* remove tokens containing only punctuations (such as '!!!', '...', etc.. which where quite common),
* remove words below a given length,
* stem the words to have them all represented in a standardized way. 

In [5]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

stop_words = set(stopwords.words('english')) 

stemmer = PorterStemmer()

def process_text(sentence):
    token_words = nltk.word_tokenize(sentence)
    no_stopwords = [word.lower() for word in token_words if word not in stop_words and not \
                    all(c in string.punctuation for c in word) and not len(word) < 2]
    return [stemmer.stem(word) for word in no_stopwords]

print(process_text('I ordered spongbob slippers and I got John'))

['order', 'spongbob', 'slipper', 'got', 'john']


We add a new column to our dataframe containing the processed reviewText (notice that we only keep reviews with a low score, under the fair assumption that reviews exposing health issues would have a low rating).

In [6]:
stemmed = reviews.copy()
stemmed = stemmed[stemmed['overall'] < 3]
stemmed['reviewStemmed'] = stemmed['reviewText'].apply(lambda x : process_text(x))

stemmed.head()

Unnamed: 0,reviewerID,asin,unixReviewTime,reviewText,overall,summary,reviewStemmed
2,A3I0AV0UJX5OH0,1403796890,2013-12-02,I ordered spongbob slippers and I got John Cen...,1,grrrrrrr,"[order, spongbob, slipper, got, john, cena, ha..."
5,A3DTB6RVENLQ9Q,1453060375,2013-03-03,Don't buy this item - rip off at this price. ...,1,Oops. Made a mistake and ordered this. I mis...,"[do, n't, buy, item, rip, price, my, bad, mist..."
46,A3KJ9TZ2HLL7SA,5901002482,2012-11-28,I wrote an earlier scathing review of this pro...,1,Packaging problem,"[wrote, earlier, scath, review, product, while..."
48,ACEL2LY99MAB0,6162362183,2014-04-21,I read the reviews before I bought it. It got ...,2,Very disappointed.,"[read, review, bought, it, got, excit, review,..."
61,A2F3CK8F9VIFPL,616719923X,2013-07-29,I bought it because i like green tea but the t...,1,Yuck,"[bought, like, green, tea, tast, bad, came, me..."


Here, we simply store the dataframe in a pickle for later usage.

In [7]:
stemmed.to_pickle("reviews_stemmed_tenth")

Now, we create a dictionnary containing all the words found in our processed reviews, and our corpus consisting of all reviews in a bag of words representation.

In [9]:
from gensim import corpora
dictionary = corpora.Dictionary(stemmed['reviewStemmed'])

corpus = [dictionary.doc2bow(text) for text in stemmed['reviewStemmed'].values]
'''for c in corpus[:1]:
    for word, freq in c:
        print(dictionary[word] + ": " + str(freq))'''

'for c in corpus[:1]:\n    for word, freq in c:\n        print(dictionary[word] + ": " + str(freq))'

We now create our LDA model and have a look at the found topics.

In [22]:
ldamodel = gensim.models.ldamulticore.LdaMulticore(corpus, num_topics = 75, passes=15, id2word=dictionary, minimum_probability=0)

ldamodel.print_topics(num_topics=-1)

[(0,
  '0.048*"\'s" + 0.019*"joe" + 0.018*"trader" + 0.016*"n\'t" + 0.012*"one" + 0.011*"nut" + 0.010*"hazelnut" + 0.009*"like" + 0.009*"\'re" + 0.009*"flavor"'),
 (1,
  '0.060*"brand" + 0.028*"cocoa" + 0.022*"lime" + 0.018*"juic" + 0.018*"crust" + 0.017*"sour" + 0.016*"costco" + 0.013*"key" + 0.011*"planter" + 0.010*"fell"'),
 (2,
  '0.050*"jerki" + 0.037*"lipton" + 0.028*"beef" + 0.025*"n\'t" + 0.012*"look" + 0.011*"money" + 0.011*"turkey" + 0.010*"tri" + 0.010*"like" + 0.010*"thi"'),
 (3,
  '0.080*"packag" + 0.061*"veri" + 0.051*"bag" + 0.035*"disappoint" + 0.031*"the" + 0.026*"oatmeal" + 0.025*"plastic" + 0.016*"away" + 0.013*"candi" + 0.011*"one"'),
 (4,
  '0.010*"pistachio" + 0.008*"bad" + 0.008*"he" + 0.008*"\'s" + 0.007*"it" + 0.007*"n\'t" + 0.007*"even" + 0.007*"decemb" + 0.007*"2014" + 0.006*"ick"'),
 (5,
  '0.035*"product" + 0.032*"cereal" + 0.018*"\'s" + 0.017*"one" + 0.014*"food" + 0.012*"tast" + 0.010*"eat" + 0.008*"would" + 0.008*"it" + 0.007*"good"'),
 (6,
  '0.050*"mix

In [28]:
ldamodel.save('lda_tenth.model')
#load : 
# ldamodel =  gensim.models.LdaModel.load('lda_tenth.model')

Now that we have our topics, we create a new column in our dataframe that tells us in which topic would that particular review be :

In [23]:
stemmed['topic'] = stemmed['reviewStemmed'].apply( \
                        lambda x: sorted(ldamodel.get_document_topics(dictionary.doc2bow(x)), \
                                key=lambda x: (x[1]), reverse=True)[0][0])
stemmed.head() 

Unnamed: 0,reviewerID,asin,unixReviewTime,reviewText,overall,summary,reviewStemmed,topic
2,A3I0AV0UJX5OH0,1403796890,2013-12-02,I ordered spongbob slippers and I got John Cen...,1,grrrrrrr,"[order, spongbob, slipper, got, john, cena, ha...",10
5,A3DTB6RVENLQ9Q,1453060375,2013-03-03,Don't buy this item - rip off at this price. ...,1,Oops. Made a mistake and ordered this. I mis...,"[do, n't, buy, item, rip, price, my, bad, mist...",12
46,A3KJ9TZ2HLL7SA,5901002482,2012-11-28,I wrote an earlier scathing review of this pro...,1,Packaging problem,"[wrote, earlier, scath, review, product, while...",36
48,ACEL2LY99MAB0,6162362183,2014-04-21,I read the reviews before I bought it. It got ...,2,Very disappointed.,"[read, review, bought, it, got, excit, review,...",16
61,A2F3CK8F9VIFPL,616719923X,2013-07-29,I bought it because i like green tea but the t...,1,Yuck,"[bought, like, green, tea, tast, bad, came, me...",68


Seems like the topic 43 corresponds to our search...

In [27]:
healthReviews = stemmed[stemmed['topic'] == 43].reviewText.values 

print(healthReviews[0])
print(healthReviews[1])
print(healthReviews[])

We were looking for food coloring that is natural, NOT artificial, due to concerns about health issues caused by artificial coloring (ADHD, cancer, thyroid issues, etc.). These contain Red 40 and other baddies :( We will not be using them.
The main ingredient in this is monosodium glutamate. If it is for miso soup, keep looking.
I guess someone in the warehouse needed a pack of gum because when it arrive the box was ripped an a pack of gum removed.  Your very welcome.


Let us now visualise our topics : for that we make use of the **pyLDAvis** library, which makes it easy to create interactive topic modeling plots :

In [2]:
import pyLDAvis.gensim

lda_display = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary=ldamodel.id2word, sort_topics=False)
lda_display

NameError: name 'lda_model' is not defined