### Task 1

Name: Sai Srujan<br>

This notebook covers Task 1 - Data Collection.<br>
In this task we scrape the complete set of web pages from http://mlg.ucd.ie/modules/python/assign2/21200803/.<br>
We scrape all the product reviews of different years from all the pages present in the above mentioned link.

In [1]:
import json
import urllib.request
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [2]:
# Common web page url prefix for scraping the reviews from all the web pages
web_page_url = "http://mlg.ucd.ie/modules/python/assign2/21200803"

The main web page is first downloaded and then BeautifulSoup library is used to create the tree representation of the web page. From that, we extract the links to the pages of product review across different years to download those pages and then scrape the review data from them.

In [3]:
# Download the web page to scrape the data
web_page = requests.get(web_page_url)

# tree structure of the page will be represented, once the page is downloaded
web_content = BeautifulSoup(web_page.content, "html.parser")

# Finding all the anchor tags, as this contains the links to the pages where the review data is present
a_tags = web_content.find_all("a")

review_pages_url = []
# For every anchor tag in the page, we obtain the 'href' attribute value which will provide the link to the review pages
for tag in a_tags:
    link = tag.attrs['href']
    # Check if the url is of the review pages, as review pages contain the word 'review' in the url
    if('reviews' in link):
        review_pages_url.append(link)
print("No. of review links: %d" % len(review_pages_url))        

No. of review links: 72


As, there is a link for every month of the year, we are scraping the data for 6 years, so we should have 72 links and it can be observed that we have obtained all the links

Finding the number of reviews available from each year and counting the total number of reviews to verify later if we have scraped all the reviews.

In [4]:
# Find all the divs with class 'list-title' which would provide the information on the number of reviews for each year
year_divs = web_content.find_all("div", {"class":["list-title"]})
total_reviews = 0
for year in year_divs:
    year_text = year.text.strip()
    print(year_text)
    # From the year text we split and extract the number of reviews
    num_year_reviews = year_text.split("(")[1].split(" ")[0]
    # Convert it into int and calculate the total count
    total_reviews += int(num_year_reviews.replace(',', ''))
print("\nTotal number of reviews:%d" % total_reviews )

Year: 2016 (1,528 reviews)
Year: 2017 (1,560 reviews)
Year: 2018 (1,528 reviews)
Year: 2019 (1,578 reviews)
Year: 2020 (1,524 reviews)
Year: 2021 (1,526 reviews)

Total number of reviews:9244


Downloading one of the review page from the links obtained in the previous step and displaying the tree structure to scrape the review data.

In [5]:
test_url = web_page_url + "/" + review_pages_url[0]
page = requests.get(test_url)
content = BeautifulSoup(page.content, "html.parser")
print(content.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta content="noindex" name="robots"/>
  <meta content="Content on this site is posted for teaching purposes only. Original data is from theguardian.com" name="description"/>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   Product Reviews Archive January 2016
  </title>
  <link href="../images/favicon.ico" rel="shortcut icon"/>
  <!-- Bootstrap core CSS -->
  <link href="../assets/css/bootstrap.css" rel="stylesheet"/>
  <!-- Custom styles for this template -->
  <link href="../assets/css/style.css" rel="stylesheet"/>
  <link href="../assets/css/font-awesome.min.css" rel="stylesheet"/>
  <script src="../assets/js/modernizr.js">
  </script>
 </head>
 <body>
  <div class="container mtb">
   <div class="jumbotron">
    <h1>
     <a href="index.html">
      Product Reviews Archive
     </a>
    </h1>
    <p>
     Review collectio

From the above tree structure it can be observed that the product reviews from every month are distributed across many pages, as only around 30 reviews are displayed in one page. So in order to get the data from all the pages, we first find the results div in the web page which gives information about the number of pages present(Ex: Page 1 of 6) and then based on that we increment the page number in the url and scrape the reviews from all the web pages from a particular month.<br>
Since the links in the main web page will be pointing to first review page of every month, we can access that page and then get to know the number of pages for that month in a particular year. We can download every page and then scrape the review data from all the pages.

We iterate through all the urls which we obtained from the main web page and download the sub-pages from every month as mentioned. We create an object with the required details for each review and then add it into an array.<br>
We scrape the review rating, title, review body and helpfulness information from each review.

In [6]:
product_reviews = []
# Iterate through each url to scrape the product reviews
for url in review_pages_url:
    print("Scraping data from %s" % url)
    # Creating the complete url to download the page
    review_page_url = web_page_url + "/" + url
    # Download the web page to scrape the data
    review_page = requests.get(review_page_url)
    # tree structure representing the page will be created, once the page is downloaded
    review_content = BeautifulSoup(review_page.content, "html.parser")
    # Find the results div to obtain the number of pages of product reviews that exist for that month
    results_info = review_content.find_all(class_=["results"])
    for result in results_info:
        # Split the text and retrieve the last part of the text which contain the total number of pages
        result_text = result.text.split(" ")
        num_pages = result_text[-1]
        
        # Download the pages by changing the page number in the url, so we run a for loop and generate the page numbers
        for i in range(1,int(num_pages)+1):
            """
            Url would be like 'reviews-2016-jan-01.html'
            So we first split the extension.
            Then from the first part of the split value, we extract everything until the last digit.
            We then replace the digit with the next number, recreate the url and download the page.
            """
            url_1 = url.split(".")[0]
            url_2 = url.split(".")[1]
            url_3 = url_1[:-1]
            final_url = web_page_url + "/" + url_3 + str(i) + "." +  url_2
            # Download the page obtained from the url after replacing it with the required page number
            review_month_pages = requests.get(final_url)
            # tree structure representing the page will be created, once the page is downloaded
            review_month_content = BeautifulSoup(review_month_pages.content, "html.parser")
            
            # Obtain all the reviews by extracting all the divs with the class 'review or review-alt' as one of these classes
            # will be used to represent the product review
            reviews = review_month_content.find_all("div", {"class":["review", "review-alt"]})
            
            # Parse each review obtained in that page to get the required information
            for review in reviews:
                product_review = {}
                # To retrieve the review title div
                title = review.find("h5", {"class":["review-title", "review-title2"]})
                
                """
                The review title div also contains the rating information.
                We retrieve the image tag which consists of the star rating, from which the 'alt' attribute is considered.
                The alt attribute consists of the image name, such as 5-star.
                So based on the rating the image name would be different like 3-star, 2-star and so on.
                From this value we split the string with '-' and consider the first part which gives the rating. Ex: 3, 5, etc.
                """
                product_review['rating'] = title.find("img").attrs['alt'].split("-")[0]
                
                # The title of the review
                product_review['title'] = title.text.strip()
                
                # The review body would be encapsulated in the paragraph tag with the class name 'review-body'
                # So we find that div and strip the text part which would give us the review body.
                product_review['review_body'] = review.find("p", {"class":["review-body"]}).text.strip()
                
                # Helpfulness information is encapsulated in the paragraph tag with 'metadata' class name
                helpful = review.find_all("p", {"class":["metadata"]})
                for help in helpful:
                    helpful_review = help.text.strip()
                    """ 
                    If the text contains 'helpful' word then we only consider that information as there can be more
                    paragraphs with same class name
                    """
                    if "helpful" in helpful_review:
                        """
                        Split the text and consider the number of users who felt the review was helpful
                        out of the people who had voted the review
                        """
                        product_review['helpful_user'] = helpful_review.split(" ")[0]
                        # Split the text and consider the total number of users who had voted the review
                        product_review['total_user'] = helpful_review.split(" ")[3]
                product_reviews.append(product_review)

Scraping data from reviews-2016-jan-01.html
Scraping data from reviews-2016-feb-01.html
Scraping data from reviews-2016-mar-01.html
Scraping data from reviews-2016-apr-01.html
Scraping data from reviews-2016-may-01.html
Scraping data from reviews-2016-jun-01.html
Scraping data from reviews-2016-jul-01.html
Scraping data from reviews-2016-aug-01.html
Scraping data from reviews-2016-sep-01.html
Scraping data from reviews-2016-oct-01.html
Scraping data from reviews-2016-nov-01.html
Scraping data from reviews-2016-dec-01.html
Scraping data from reviews-2017-jan-01.html
Scraping data from reviews-2017-feb-01.html
Scraping data from reviews-2017-mar-01.html
Scraping data from reviews-2017-apr-01.html
Scraping data from reviews-2017-may-01.html
Scraping data from reviews-2017-jun-01.html
Scraping data from reviews-2017-jul-01.html
Scraping data from reviews-2017-aug-01.html
Scraping data from reviews-2017-sep-01.html
Scraping data from reviews-2017-oct-01.html
Scraping data from reviews-2017-

Create a dataframe of all the reviews obtained across all the pages<br>
And check if we have extracted all the reviews across all years by verifying it with the count we obtained previously. 

In [7]:
product_review_df = pd.DataFrame(product_reviews)
if (total_reviews == len(product_review_df)):
    print("Total reviews extracted: %d" % len(product_review_df))
    print("\nAll the reviews are extracted succesfully, comparing to the total count obtained before.")

Total reviews extracted: 9244

All the reviews are extracted succesfully, comparing to the total count obtained before.


In [8]:
product_review_df.head(5)

Unnamed: 0,rating,title,review_body,helpful_user,total_user
0,5,This filter works PERFECT!,"Seriously, I love my Keurig. I love the conven...",472,477
1,4,This stuff is great for muffins,There's a recipe on the back of the package fo...,17,17
2,1,Curiously awful,Cola is by far my favorite drink. My wife and ...,1,14
3,1,Rancid!,"I love chia, but I have gotten two different p...",23,26
4,1,They taste like boogers,"If you don't like the sound of a salty, vinega...",4,19


In [9]:
# Saving the reviews information into csv file named Product_Reviews.csv
product_review_df.to_csv("Product_Reviews.csv", index=False)