In [1]:
#load in relevant library's
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from lxml import html  
import requests
import csv

#### The scraper
Note that because we scrape so much it can take 15-20 min, that's why i included the file to CSV. So you only have to load in the CSV. Of course you can still run the code.
If you do decide to run the scrapper, please enable the code below the scrapper that saves the DF to csv.

In [65]:
#Load out the amazon product we want to scrape, in this case it's an water resistant speaker. We use a browser user agent string, this will because otherwise it will start scraping
#based on my own IP and then amazon will block me
df = pd.DataFrame()
#here we put the whole page scrapper in a loop that changes the page id in the url, so now we don't scrape 1 page with 8 reviews, but 700 pages with 8 reviews
for i in range(700):
    amazon_url = f"https://www.amazon.com/VicTsing-Wireless-Water-Resistant-Hands-Free-Speakerphone/product-reviews/B00MYYCGKW/ref=cm_cr_getr_d_paging_btm_{i}?ie=UTF8&reviewerType=all_reviews&pageNumber={i}"

    user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
    
    #connect us to the right html of the page
    headers = {"User-Agent": user_agent}
    page = requests.get(amazon_url, headers = headers)
    parser = html.fromstring(page.content)
    
    #here we link every element we want to scrape to the right html element. For now i selected only a few html points. But of course you can scrape every element, like username,
    #related pictures and how many people find it usefull
    xpath_reviews = '//div[@data-hook="review"]'
    xpath_rating  = './/i[@data-hook="review-star-rating"]//text()' 
    xpath_title   = './/a[@data-hook="review-title"]//text()'
    xpath_date    = './/span[@data-hook="review-date"]//text()'
    xpath_body    = './/span[@data-hook="review-body"]//text()'
    
    
    #here we put the whole html of the review in a variable then we loop over the body and select every element out of the review we want.
    #after that we put the dataelements in a dict and append the dict to our empty dataframe
    reviews = parser.xpath(xpath_reviews)
    for review in reviews:
        rating  = review.xpath(xpath_rating)
        title   = review.xpath(xpath_title)
        date    = review.xpath(xpath_date)
        body    = review.xpath(xpath_body)

        review_dict = {'rating': rating,
                       'title': title,             
                       'date': date,
                       'body': body
                       }
        df = df.append(review_dict, ignore_index=True)

In [2]:
#df.to_csv("Amazonwaterspeaker.csv", index = False, quotechar='"') #enable this code when running the scrapper.
#opening the datafile and showing the dataframe
df2 = pd.read_csv("Amazonwaterspeaker.csv")
df = df2
df.head()

Unnamed: 0,body,date,rating,title
0,['This is a good deal for the price. I bought...,"['November 12, 2017']",['4.0 out of 5 stars'],['This is a good deal for the price']
1,"[""[I am changing my review from 5 stars to 3 s...","['January 10, 2018']",['3.0 out of 5 stars'],"['Great shower speaker, I use it every day - u..."
2,"[""Exactly what I was looking for... I just wan...","['February 20, 2017']",['5.0 out of 5 stars'],['Made me a shower rock star']
3,"[""This is my second VicTsing shower speaker in...","['August 3, 2017']",['5.0 out of 5 stars'],['Great sound quality']
4,['Used it pretty much daily in the shower but ...,"['May 9, 2017']",['2.0 out of 5 stars'],['Stopped recharging after about 8 months']


In [3]:
#here we define a function which transform our 1-5 into 2 classes, positive and negative
def ratingclass(rating):
    if rating == "['4.0 out of 5 stars']" or rating == "['5.0 out of 5 stars']":
        return "positive"
    else:
        return "negative"
    
#apply the function to our dataframe
df["rating"] = df["rating"].apply(ratingclass)


In [4]:
#read out the new dataframe
df.head()

Unnamed: 0,body,date,rating,title
0,['This is a good deal for the price. I bought...,"['November 12, 2017']",positive,['This is a good deal for the price']
1,"[""[I am changing my review from 5 stars to 3 s...","['January 10, 2018']",negative,"['Great shower speaker, I use it every day - u..."
2,"[""Exactly what I was looking for... I just wan...","['February 20, 2017']",positive,['Made me a shower rock star']
3,"[""This is my second VicTsing shower speaker in...","['August 3, 2017']",positive,['Great sound quality']
4,['Used it pretty much daily in the shower but ...,"['May 9, 2017']",negative,['Stopped recharging after about 8 months']


In [5]:
#here we transform our reviews to a text matrix, so we can use it in our model
text = df["body"].values.astype("U")
vect = CountVectorizer(stop_words='english').fit(text)
matrix = vect.transform(text)

#### Bag of words model
Here we apply the bag of words model. Which basicly does a count of each word in a review text row. With the goal that we can turn text into usable data for a machine learning model.

In [6]:
#split our data into train and test set. Our X is the matrix we created from the bag of words model and our Y is the rating class
X = matrix
y = df["rating"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [7]:
#put the train data in our model and train our model.
clf = MultinomialNB()
clf.fit(X_train, y_train)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

#### Naive Bayes classifer model
the model we are using is Naïve Bayes. the Naïve Bayes classifer is a supervised machine learning model that classifies based on probabillity. You assign certain conditions to percentages. It then multiplies all the percentages to see what chance each class has. Of course the more conditions you get the better it can calculate the percentages.

In [8]:
#calculate our accuracy
clf.score(X, y)

0.9199640287769785

In [9]:
#evaluate our test predictions
y_pred = clf.predict(X)
from sklearn.metrics import classification_report
y_test_pred = clf.predict(X_test)
print(classification_report(y_test, y_test_pred))

             precision    recall  f1-score   support

   negative       0.75      0.62      0.68       287
   positive       0.92      0.96      0.94      1381

avg / total       0.89      0.90      0.90      1668



### Evaluating our model
We can see that positive reviews are much easier to predict than negative reviews. This makes sense because there are firstly alot more positive reviews than negative.
Secondly the most positive reviews are very short and straight to the point. While negative reviews are longer which gives a larger margin of error, because our model only predicts on words and not senteces.

In [10]:
#predict on our current dataframe, so we can see where it went wrong
df["prediction"] = clf.predict(X)

In [11]:
#show our new dataframe with predictions so we can analyse specific cases
df.head()

Unnamed: 0,body,date,rating,title,prediction
0,['This is a good deal for the price. I bought...,"['November 12, 2017']",positive,['This is a good deal for the price'],positive
1,"[""[I am changing my review from 5 stars to 3 s...","['January 10, 2018']",negative,"['Great shower speaker, I use it every day - u...",negative
2,"[""Exactly what I was looking for... I just wan...","['February 20, 2017']",positive,['Made me a shower rock star'],positive
3,"[""This is my second VicTsing shower speaker in...","['August 3, 2017']",positive,['Great sound quality'],negative
4,['Used it pretty much daily in the shower but ...,"['May 9, 2017']",negative,['Stopped recharging after about 8 months'],negative


#### Looking at the cases that were predicted wrong
Here we look at 3 reviews which were predicted wrong and why. You don't have to fully read the reviews and can just skip to the why it was wrong sections explain it. The 3 reviews i chose were:

13: I like the product, and it serves its purpose of being a shower bluetooth speaker with solid range, but the carabiner hook broke within a day which I wasn\'t too happy with, and the design is somewhat flawed.", \'Cons\', \'- The carabiner hook broke as soon as I tried to attach it to a command hook, and the quality of the carabiner was also very shoddy\', "- The skip song/previous song buttons are the same as the volume buttons so sometimes when I\'m trying to increase the volume I skip the song instead", "- There\'s no indicator to how much battery you have left", \'- It take awhile to shut the speaker off while pressing down the power button\', \'Pros\', "+ Price: This is a solid speaker for the price you pay for it. It serves the purpose that I needed which was just playing music in my shower. I\'ve accidentally splashed some water on it, and nothing bad happened", \'+ Pairs quickly to computer or phone with good range\', "+ Audio quality is good enough for my needs (haven\'t tried phone function yet.

Why it was wrong: 13 is falsely rated positive. I think this is because there are alot of words within the range of like and quality, which are very common in the positive reviews. It gives some credit to the speaker, but then explains everything wrong with it, but with critisim not with bad buzzwords like garbage or bad.

28: So, I bought this speaker because I love listening to music while I\'m showering. It\'s just so peaceful. After seeing so many awesome reviews, I thought I\'d love it. Yeah, no. Not so much...", "PROS: It is actually waterproof and is very water resistant. It\'s not made of cheap plastic and is durable. The selection has different colors and it is sturdy. This is not battery operated and it charges via USB cord that comes inside the box. It pairs very easily with your device, within seconds. The speaker notifies you that it is turned on or is turning off with a voice.", "CONS: The suction cup is extremely weak and couldn\'t hold the speaker to the inside of my shower. It\'s only tile! The sound quality is not great. The music can be blasted and it\'s still hard to make out the words in the song. Blurry output... This takes forever to dry. Only when it\'s dry is the sound quality at its best, which defeats the purpose of a \'Bluetooth shower speaker\'. The voice that tells you whether the device is on or turning off is so loud and there is no way of controlling the volume of this. My phone was no more than two feet away from the speaker and it kept breaking up. My phone said that the signal was strong, so it wasn\'t that. It was the overall quality of the speaker. It\'s frustrating having to wonder when the next break in the song will be. Nothing very peaceful about that...", "I would give this product 2.5 stars, but I am so disappointed with it that I refuse to give it the second half to make it 3 stars, so it\'s staying at 2. Hopefully this company moves on to making other products because speakers are not their strong point.", "This product is worth testing personally if you find a good deal. I hope that it was just my luck with this speaker and all those great reviews are from customers who actually loved it. Whether or not you decide to keep it, Amazon\'s customer service is great! I had it shipped back a day after my complaint.", "So, it looks like I\'m back to singing in the shower. Good luck to everyone who is trying to solve the same issue. Lol", "UPDATE: After I had returned this product back to the Amazon, I was asked if I would consider buying another speaker of a different brand. I had already purchased a new speaker to replace this one because I was not fond of the quality. It has now been a while since I returned the speaker and I am receiving emails from the seller, VicTsing, itself. Because of its attempts to settle a dispute between one of its products and a customer, I have raised my review one star. Yes, I am still disappointed with the product that I had purchased; yet, I am overly satisfied with the company\'s customer service. They are polite, attentive, helpful and genuinely eager to satisfy the customer\'s needs. For this, I thank you.

Why it was wrong: Here we can see the same problem, it can't believe why there are so many great reviews and why people are so positive. This of course are already alot of good buzzwords. In this review we can also see the problem with bag of words. She saying "not great" and "disapointed with the quality" but of course the algortihm counts great and quality as positive buzzwords, because it interprets words and not sentences.

29: My daughters love their speakers. Mainly bought them for when they travel for competition. they can stick these speakers on the bus window and it\'s good to go. No one has to worry about holding the speaker or having it fall off. The suction lasts a long time - hours. Because the speakers are waterproof, it is okay to hang on their backpacks even when it rains. Perfect size speaker. Charge lasts long too. Easy to use.", \'Update 3 mo after purchase: One of the speakers the power button no longer works. The sound is off too. Basically, when it worked it was GREAT. The other speaker still works and my daughter uses it constantly. Maybe the second speaker was just a bad buy. My daughter with the not working speaker wants me to order another for her cause she really liked it. So... what to do? Order one more and keep all the packaging because if this one stops working - I am returning to Amazon!

Why it was wrong: In the last one is the most obivous in getting it wrong. She was firstly very positive about the product, but the quality soon declined and she updated here review into negative. I think one of these reviews is the hardest to ever get right for a text mining model and will almost always be classifed wrong.
