# Project - Analysing Amazon.com reviews

 **Note**: *This data is based off Amazon's public pages, and the aim of this project is  **NOT** to violate any of Amazon's policies*.  

Today I will use web scraping tools, Natural Language Processing, data cleaning and data analysis to find the most common words in Amazon reviews of books. 

I will use "Harry Potter and the Deathly Hallows" as a standing example.

Let's start with the libraries that need to be imported.

In [3]:
# make HTTP calls
import requests

# parse HTML files
from bs4 import BeautifulSoup

# use NLP
import nltk

# pretty printing
import pprint

There is one large function that will explained step by step with screenshots and extensive comments throughout the code.

In [362]:
# function signature: takes the book name, author and list of tags (to be explained later)
def word_freq(name, author, list_of_tags):

    # The goal here is to fully automate extracting reviews. 
    # To start with, we need the corresponding Amazon page URL that users see, 
    # where the book's types and prices are listed. 
    # This page contains the ASIN (Amazon Standard Identification Numbers: https://www.amazon.com/gp/seller/asin-upc-isbn-info.html), 
    # which will be extremely crucial in getting the reviews.

    # Using that, one can identify the URL of the book review's page.
    # To that extent, a call to the Google Search API is made. 
    # According to the official documention, the method to encode the 
    # search query is 'http://google.com/search?q=[query]+Amazon.com+Book' with
    # + as the delimiters between any word in the query.
    # Amazon.com and Book are included in the search by default as Amazon.com is
    # our target site and 'Book' is mentioned to disambuguate the book from any
    # movie of the same name, so that the search results are URLs for the book
    # and not the movie.

    # For example, to make a search for "Harry Potter and the Deathly Hallows" with author "JK Rowling", 
    # the correct encoding is 'http://google.com/search?q=Harry+Potter+and+the+Deathly+Hallows+JK+Rowling+Amazon.com+Book'.

    # To that extent, lines 24 - 28 encode the book name and author in the above format.

    book_list = name.split()
    author_list = author.split()

    query = "+".join(book_list)
    query += "+" + "+".join(author_list)

    # Line 32 puts the query string together and makes a request using the inbuilt requests library.
    # The 'dummy' part of the code naming will be explained later.
    
    dummy_r = requests.get('http://google.com/search?q=' + query + '+Amazon.com+Book')
    
    # extract the text (HTML) content of the requested page
    dummy_source_code = dummy_r.text
    
    # make this text more accessible by converting it to HTML Document Object Tree (if you don't know what this term means,
    # just think of the output of dummy_soup as a tree with the first tag (<html>) as the root, and other nested tags in it
    # such as <head> and <body> as its children. The conversion makes accessing the children and particular tags easier).
    # Done through the BeautifulSoup library
    dummy_soup = BeautifulSoup(dummy_source_code, "lxml")

Now that we have the HTML content of the search results page, we need the correct URL to make a request once again. 

We can't just do with ANY of the search results; we need 'amazon.com'! 
To find the correct one, we will have to get a list of all the URLs and 'find' the one we want.

<img src = "GS1.png" style = "width: 50%;">
<img src = "GS2.png" style = "width: 50%;">

The above images show the results page with our query. 
We want the Amazon.com one, which has been boxed for convenience.

Note that the first few entries are NOT '.com' entries. Google probably suggests local Amazon sites to the global American one. 
Since I am running this search from India, the '.in' makes sense.


But how does one get the URL now that we now where it is? Chrome's Inspector comes to help.

After some digging I found that each review 'rectangle' is enclosed within an 'h3' tag.
Within that, the URL is the 'href' attribute of the 'a' tag within it.

See image below for details.

<img src = "GS3.png" style = "width: 50%;">

Time to also use BeautifulSoup again!

In [None]:
    # purpose explained abovr
    h3_lst = dummy_soup.find_all('h3')
    list_of_urls = []
    
    # loop for all h3 elements
    for items in h3_lst:
        try:
            a = items.find('a')
            href = a['href']

            # only want the main Amazon site. Note the '/' at the end is important.
            # Without it, sites such as amazon.com.au will also match.
            # A more sophisticated way would be to use regex, but this does the trick!
            if 'amazon.com/' in href: 
                
                # remove extraneous URL information that is sometimes found in certain URLs
                start = href.index("=")
                end = href.index("&")
                href_stripped = href[start + 1:end]
                
                # If you look at the extracted URl, it will be like
                # 'https://www.amazon.com/Harry-Potter-Deathly-Hallows-Rowling/dp/1606868829'
                # The number after '/dp/' is the ASIN!
                
                # Turns out to get the main review page, we just need to replace the 'dp' with
                # 'product-reviews'. That's it!
                
                href_new = href_stripped.replace('/dp/', '/product-reviews/')
                
                list_of_urls.append(href_new)
                
        except:
            # incase of error, do nothing
            pass

    # Assuming multiple queries, I take the first one.
    # This is because Google ranks queries by 'popularity', and presuming
    # 'popularity' here translates to accuracy, the first one would be the most
    # likely one to take
    dummy_url = list_of_urls[0]

    # extract text and convert to HTML, as before
    amazon = requests.get(dummy_url)
    amazon_soup = BeautifulSoup(amazon.text, "lxml")

This is what one should see as the product reviews page for the book

<img src = "review1.png" style = "width: 80%;">

If one scrolls down past the 'Top positive review' area, one will notice that there are 10 reviews on the page.

While, it would be good to get these reviews, it would be better to get ALL possible reviews for this book.

There are 2 factors to consider here:
1. How many pages of reviews are there?
2. How does one extract the above fact?

<img src = "review2.png" style = "width: 80%;">

The first question is solved by scrolling down to the bottom of the page, where one finds the buttons to move through the button. One wants the last page here (boxed). 

Now for the second question! As done above, snoop around with the Inspector, and you find that the buttons are part of a 'ul' tag, and this element is the last 'li' element in this list. 
The below code helps achieve exactly that!

In [None]:
    # find ul tag
    ul = amazon_soup.find("ul", class_ = "a-pagination")

    # find last page number
    last_li = ul.find_all('li', class_ = "page-button")[-1]
    
    # convert string value to integer
    last_li = int(last_li.text)

In [None]:
    # now that we have the total number of pages, we can loop over that number to get the URLs of all the pages

    # store all page URLs in this list
    review_pages = []

    # Turns out that the pages follow a common format, with only the page number value changing in the URL.
    # Look at the below code for reference

    for i in range(last_li):
        next_page = dummy_url + '/ref=cm_cr_getr_d_paging_btm_' + str(i) + '?pageNumber=' + str(i)
        review_pages.append(next_page)
        
    # Now I explain why I called the earlier HTML page 'dummy'.
    # The reason was that that page was only to get the generic product reviews page URL
    # Using that, we have the actual page URLs that we need and don't need the earlier variables relating
    # to that URL

In [None]:
 reviews_list = []
    
    # [review_pages] now contains the URLs of all the review pages.
    # Loop through each, making a request turn-by-turn
    for pages in review_pages:

        r = requests.get(pages)    
        source_code = r.text
        soup = BeautifulSoup(source_code, "lxml")

        # find actual text
        # Again, found by fiddling with the Inspector
        reviews = soup.find_all("div", class_ = "a-section review")

        # [reviews] is the list of 10 reviews found on each page.
        # Loop through each of them and extract the review text
        for review in reviews:
            review_text = review.find("span", class_ = "a-size-base review-text").text
            reviews_list.append(review_text)

In [None]:
    # [reviews_list] is a list of all review text.
    # we will use a dictionary to keep a count of word frequency
    dict_tags = {}

        for reviews in reviews_list:
            
            # use NLP to identify each word with its part of speech.
            # Refer to http://www.nltk.org/data.html for an example of what the next
            # 2 lines will do.
            tokens = nltk.word_tokenize(reviews)
            tagged = nltk.pos_tag(tokens)

            for tags in tagged:
                # As written below, [tag_text] would be something like 'the'
                # and [tag_type] would be 'DT', which stands for determiner,
                # as 'the' is a determiner.
                
                # Refer to http://nishutayaltech.blogspot.in/2015/02/penn-treebank-pos-tags-in-natural.html
                # for a list of all types of possible parts of speech (POS) and their abbreviations
                tag_text = tags[0]
                tag_type = tags[1]

                # only consider parts of speech in [list_of_tags]
                
                # Thus, [list_of_tags] is a list of all possible parts of speeches used as filters.
                # This is done because if we just make a dictionary of all possible words in the
                # reviews, common words such as 'the' and 'a' will have the highest frequency.
                # Now, if that is what you want, they you have the opportunity to put such parts of speech.
                # But this function is meant to work with any set of parts of speech the user provides
                
                if tag_type in list_of_tags:

                    # normal mapping of frequencies of words in a dictionary
                    if tag_text in dict_tags:
                        dict_tags[tag_text] += 1
                    else:
                        dict_tags[tag_text] = 1


In [None]:
    # [dict_sorted] contains a set of words sorted by descending frequency occurence
    dict_sorted = sorted(dict_tags.items(), key=lambda x:x[1], reverse = True) # descending

    # finding relative frequency of each word in the dictionary
    
    # sum of frequencies
    sum_freq = float(sum(dict_tags.values()))

    rel_freq_list = []

    # calculate relative frequencies
    for tags in dict_sorted:
        
        # adding all values to a new dictionary with the absolute frequency replaced by the relative frequency
        tags_text = ( round(tags[1]/sum_freq, 3) )
        rel_freq_list.append( (tags[0], tags_text) )

    # [rel_freq_list] is the dictionary of words with relative frequencies
    return rel_freq_list

Woohoo! We're done!

Below is the entire code written in snippets above in one function that takes in the book name, author and the parts of speech worth exploring.

Try it out after going through the entire example!

In [9]:
def word_freq(name, author, list_of_tags):

    # extract book name and author to form search query
    book_list = name.split()
    author_list = author.split()

    query = "+".join(book_list)
    query += "+" + "+".join(author_list)

    # search the web
    dummy_r = requests.get('http://google.com/search?q=' + query + '+Amazon.com+Book')

    dummy_source_code = dummy_r.text
    # print dummy_source_code

    dummy_soup = BeautifulSoup(dummy_source_code, "lxml")

    h3_lst = dummy_soup.find_all('h3')
    list_of_urls = []
    
    for items in h3_lst:
        try:
            a = items.find('a')
            href = a['href']
            if 'amazon.com/' in href: # main Amazon site
                start = href.index("=")
                end = href.index("&")
                href_stripped = href[start + 1:end]
                href_new = href_stripped.replace('/dp/', '/product-reviews/')
                list_of_urls.append(href_new)
        except:
            pass

    
    # use first search query
    dummy_url = list_of_urls[0]

    amazon = requests.get(dummy_url)
    amazon_soup = BeautifulSoup(amazon.text, "lxml")

    # find number of pages of reviews
    ul = amazon_soup.find("ul", class_ = "a-pagination")

    # find last page number
    last_li = ul.find_all('li', class_ = "page-button")[-1]
    last_li = int(last_li.text)

    # store all page URLs in this list
    review_pages = []

    for i in range(last_li):
        next_page = dummy_url + '/ref=cm_cr_getr_d_paging_btm_' + str(i) + '?pageNumber=' + str(i)
        review_pages.append(next_page)

    # print "Review pages: "
    # print review_pages

    reviews_list = []

    for pages in review_pages:
        # print "Page: "
        # print page
        # make a request to all pages turn-by-turn
        r = requests.get(pages)    
        source_code = r.text
        soup = BeautifulSoup(source_code, "lxml")

        # find actual text
        reviews = soup.find_all("div", class_ = "a-section review")

        # loop over all reviews
        for review in reviews:
            review_text = review.find("span", class_ = "a-size-base review-text").text
            reviews_list.append(review_text)


    # print reviews_list

    # make a dictionary of all the words in the reviews
    dict_tags = {}

    for reviews in reviews_list:
        tokens = nltk.word_tokenize(reviews)
        tagged = nltk.pos_tag(tokens)

        for tags in tagged:
            tag_text = tags[0]
            tag_type = tags[1]

            # only consider parts of speech in [list_of_tags]
            if tag_type in list_of_tags:

                if tag_text in dict_tags:
                    dict_tags[tag_text] += 1
                else:
                    dict_tags[tag_text] = 1

    dict_sorted = sorted(dict_tags.items(), key=lambda x:x[1], reverse = True) # descending

    # sum find of frequencies
    sum_freq = float(sum(dict_tags.values()))

    rel_freq_list = []

    # calculate relative frequencies
    for tags in dict_sorted:
        tags_text = ( round(tags[1]/sum_freq, 4) )
        rel_freq_list.append( (tags[0], tags_text) )

    return rel_freq_list

In [10]:
dic = word_freq('Harry Potter and the Deathly Hallows', 'JK Rowling', ['NNP', 'NNPS']) #only proper nouns

In [15]:
# analysing the top 20 entries
dic[:20]

[(u'Harry', 0.1887),
 (u'Potter', 0.103),
 (u'Rowling', 0.0443),
 (u'Voldemort', 0.0212),
 (u'Hogwarts', 0.0141),
 (u'J.K.', 0.014),
 (u'Dumbledore', 0.0136),
 (u'Great', 0.0129),
 (u'Hallows', 0.0124),
 (u'JK', 0.0124),
 (u'Book', 0.0122),
 (u'Ron', 0.0121),
 (u'Phoenix', 0.0116),
 (u'Deathly', 0.0116),
 (u'Hermione', 0.0108),
 (u'Dale', 0.0101),
 (u'Jim', 0.0096),
 (u'Snape', 0.0089),
 (u'Lord', 0.0083),
 (u'HP', 0.0079)]

Before, we analyze the above let us clean this data up a bit by combining terms that convey the same meaning:
1. 'Harry', 'Potter', and 'HP' all correspond to Harry Potter, the titular character (also in the book title). So, we can replace these 3 terms with one super term having frequency = 0.2996
2. Similarly 'Rowling', 'JK' and 'J.K.' all correspond to JK Rowling, the author of the series. The new frequency is 0.0707
3. 'Voldemort' and 'Lord' both refer to Lord Voldemort, the series' antagonist. New frequency = 0.0295
4. 'Deathly' and 'Hallows' correspond to the Deathly Hallows, the trio of power (also in the book title). New frequency = 0.024

Thus the top 5 entries are:
1. Harry Potter - 29.86
2. JK Rowling - 0.0707
3. Voldemort - 0.0295
4. Deathly Hallows - 0.024
5. Hogwarts - 0.0141

These entries make sense. Given that the last book talks about how Harry discovers the Deathly Hallows and eventually defeats Voldemort after the Battle of Hogwarts and how it seems natural to also mention the author in the review (as meta-data for the book), our analysis seems pretty good! 


In [16]:
dic2 = word_freq('Harry Potter and the Deathly Hallows', 'JK Rowling', ['JJ', 'RB']) # adjectives and adverbs

In [17]:
dic2[:20] # top 20

[(u"n't", 0.0407),
 (u'not', 0.0384),
 (u'so', 0.0252),
 (u'great', 0.0218),
 (u'good', 0.0191),
 (u'just', 0.0191),
 (u'very', 0.0176),
 (u'much', 0.0159),
 (u'first', 0.014),
 (u'really', 0.0131),
 (u'even', 0.0118),
 (u'last', 0.0118),
 (u'many', 0.0116),
 (u'other', 0.0116),
 (u'only', 0.0111),
 (u'well', 0.0111),
 (u'again', 0.0095),
 (u'new', 0.0095),
 (u'as', 0.0095),
 (u'too', 0.0082)]

Given that the series and especially the book was a hit, it is not surprising that 'great' and 'good' are 3rd and 4th, taking 40% of the frequency values. This once again confirms our analysis was great!

# Conclusion & Future improvements

## Conclusion

What is the biggest learning I got from this process? 
Experimentation is paramount! It was only after I used the Chrome Inspector that I found out where the actual information resided. Fiddling around was key.

## Future improvements

Better knowledge of NLP will enable me to analyse this text corpurus better.

Machine learning can be used to better club words with similar meanings.