# Title : Product Reviews Data Collection


### Comp Number : 21202384

### URL : 'http://mlg.ucd.ie/modules/python/assign2/21202384/'

In [1]:
# Import needed libraries
import requests
from bs4 import BeautifulSoup
import re # Regular Expression Library
import os
import json

Beautiful Soup is a Python Library that makes web scrapping easier, it can easily identify tags and interfaces very well with HTML language.
Requests Library is used to query the site to return the raw HTML content of the page. Re is the Regular expression Library used to string manipulation

We aim to get product reviews and other infomation from The URL above. The Main page gives a list of the Product reviews classified by months with Links to access them. we first scrap this main page to get the HTML links for all the Months to be collected and store these links in a python List.

In [2]:
url_to_site='http://mlg.ucd.ie/modules/python/assign2/21202384/'
main_page=requests.get(url_to_site) # Querys the url passed to it and return the Raw HTML content if request is successful
main_page.status_code

200

Status Code 200 indicates a succesful request

In [3]:
# A View of The Page 
main_page.text



The Cell above shows the raw HTML content of the main page. There are numerous HTML tags in the page.The HTML tag 'a' has a class 'href' which holds the links to the reviews. We first convert the raw text of this page into a Beautiful soup object, then from this object extract these information from the tag 'a' and class 'href'.

In [4]:
# Turning Request result into a beautiful Object using HTML parser
soup=BeautifulSoup(main_page.text,'html.parser')
soup.title

<title>Product Reviews Archive Home</title>

The Code below extracts the HTML links from the page.

In [5]:
links=[] # Empty list to be populated with the links
for link in soup.find_all('a'):
    links.append(link.get('href')) # Populates the list with the links
links[:5]

['index.html',
 'reviews-2016-jan-01.html',
 'reviews-2016-feb-01.html',
 'reviews-2016-mar-01.html',
 'reviews-2016-apr-01.html']

A List of links in the main page is shown above. The first entry is not considered a useful link and will therefore be dropped

In [6]:
del links[0]
links[:5]

['reviews-2016-jan-01.html',
 'reviews-2016-feb-01.html',
 'reviews-2016-mar-01.html',
 'reviews-2016-apr-01.html',
 'reviews-2016-may-01.html']

A List of Available links from the main page has been extracted, these pages contains the products review to be extracted. The list of links will be iterated over and each of the html link will be used to 'construct' a useful web page link pointing to the product review for that month and year. i.e the links will be appended to the main page url link. Example of a useful link constructed from the htmls extracted is 'http://mlg.ucd.ie/modules/python/assign2/21202384/reviews-2016-jan-01.html'

To Break the codes into bits for ease of explanation and presentation we will use Python Functions and Classes where applicable.

The Reviews of each month span multiple pages. The First Link has reviews that span 6 pages and reviews data needs to be extracted from the multiple pages of each link. The Cell Below contains the function 'Num of Pages', it recieves as input a product review link, searches for the information about the number of pages in the link and return the total number of pages in that link.

In [7]:
def Num_of_Pages(html_link):
    url=url_to_site+html_link 
    page=requests.get(url) # Request for site
    soup1=BeautifulSoup(page.text,'html.parser')
    h4=soup1.find_all('h4') # This finds the total number of pages in a single html link
    return int(h4[0].contents[0][-1])
num_of_pages=Num_of_Pages(links[0])
num_of_pages

6

The product reviews span over multiple pages in a single HTML link, the code above scraps the first page, find the tag containing the information about the number of pages(this is contained in tag 'h4') then extracts the total number of page as an integer.

The Function Below iterates from 1 to the total number of Pages for the link passed to it, and yield these pages. Yielding these pages means rather than returning the objects itself, it returns a generator that can be used to generate the objects.

In [8]:
# The Function below iterates over the Number of Pages and Yields the pages 
def Link_Pages(html_link):
    url=url_to_site+html_link
    num_of_pages=Num_of_Pages(html_link)
    for page in range(1,num_of_pages+1):
        url=url[:-6]+str(page)+url[-5:]
        request_page=requests.get(url) # Request for site
        yield BeautifulSoup(request_page.text,'html.parser')

The Function below extracts the titles from the links and their pages. These Titles are contained in the 'h5' tag as text

In [10]:
# Function to extract title
def Title_Extraction(pages):
    pages=pages.body # Gets the boby of the page
    titles_list=pages.find_all("h5") # Find the h5 tag
    for i,title in enumerate(titles_list):
        titles_list[i]=BeautifulSoup(title.contents[-1]).get_text(strip=True) # Gets the last content of the h5 tag
    return titles_list

The h5 tag contains two piece of information. The Source of the Review and the Title. We extract just the title with the code above.

The star ratings is contained in the img Tag.it is an image in pure HTML format. The function below using regular expression to extract the Star rating information from th 'img' tag

In [11]:
# Function to extract Star Ratings
def Star_Ratings_Extract(pages):
    pages=pages.body;rating_list=[]
    for rating in pages.find_all('img'):
        rating_list.append(re.findall(r'\d+-star',str(rating))[0])
    return rating_list

the regular expression '\d+-star' implies tha every entry with a digit followed by '-star' should be extracted. that means an entry such as '5-star' will be extracted.

The Review Helpfulness information is contained in html tag 'p' and class 'metadata'. there are other information under this tag and class name but the reviews helpfulness information starts with a digit. this will be used to distinguis from other infi in the class tag and tag.

In [12]:
# Review HelpFulness Extraction Function
def Review_HelpFulness_Extraction(pages):
    pages=pages.body;review_help_list=[]
    reviews_help=pages.find_all("p", class_="metadata")
    for i,review_help in enumerate(reviews_help):
        reviews_help[i]=review_help.contents[0]
        if reviews_help[i][0].isdigit(): # Selects all data that starts with a didit
            review_help_list.append(reviews_help[i])
    return review_help_list

The Review Text information is contained in html tag 'p' and class 'review-body'.

In [13]:
# Function to extract Review Text
def Reviews_Extraction(pages):
    pages=pages.body;reviews_list=[]
    reviews=pages.find_all("p", class_="review-body")
    for i,review in enumerate(reviews):
        reviews_list.append(review.contents[0])
    return reviews_list

### Extracting Data

In [16]:
# Extracting titles
titles_list=[]
for link in links:
    link_page=Link_Pages(link)
    for page in link_page:
        titles_list.extend(Title_Extraction(page))
print('Number of Titles in List : {}'.format(len(titles_list)))
titles_list[-5:]

Number of Titles in List : 9244


['Perhaps too compostable?',
 'This gum is really great!',
 'You may need to do a little math to ensure the Superstore quantity is a good deal',
 'Who can resist?',
 'Best Coca Tea']

In [18]:
# Extracting Star Ratings
star_rating_list=[]
for link in links:
    link_page=Link_Pages(link)
    for page in link_page:
        star_rating_list.extend(Star_Ratings_Extract(page))
print('Number of Stars Ratings in List : {}'.format(len(star_rating_list)))
star_rating_list[-5:]

Number of Stars Ratings in List : 9244


['3-star', '5-star', '5-star', '4-star', '5-star']

In [19]:
# Extracting Review Helpfulness Info
review_help_list=[]
for link in links:
    link_page=Link_Pages(link)
    for page in link_page:
        review_help_list.extend(Review_HelpFulness_Extraction(page))
print('Number of Review HelpFulness Information in List : {}'.format(len(review_help_list)))
review_help_list[-5:]

Number of Review HelpFulness Information in List : 9244


['20 out of 21 users found this review helpful',
 '17 out of 17 users found this review helpful',
 '23 out of 25 users found this review helpful',
 '38 out of 40 users found this review helpful',
 '6 out of 17 users found this review helpful']

In [20]:
# Extracting Review Helpfulness Info
reviews_list=[]
for link in links:
    link_page=Link_Pages(link)
    for page in link_page:
        reviews_list.extend(Reviews_Extraction(page))
print('Number of Reviews in List : {}'.format(len(reviews_list)))
reviews_list[-5:]

Number of Reviews in List : 9244


["I bought these bags to go with Trading ECO-2000 2.4 Gallon Kitchen Compost Waste Collector that I purchased on Superstore. The bags fit perfectly, but they seem to start degrading within a few days. If I took out compost every day, it would be fine, but I empty my compost bucket about twice a week. When I lift a bag to take it outside, it is dripping, so I have to take the whole bucket out and scrub it thoroughly afterwards to get rid of the odor. Previously, when I used a regular plastic shopping bag, leakage was almost never a problem, but it is a regular occurance with these liners. I like that the bags are biodegradable, but rather than purchasing these again, I'll probably go back to re-using plastic bags when my supply runs out--unless I find a sturdier brand. Any suggestions? …",
 "If you have problems with Aspartame (which is in every gum out there...not just sugarfree gums) and you love to chew gum then SteviaDent is really the only option unless you want to eat TicTacs. It 

The Needed Information and data has been extracted as lists. We put these lists into a dictionary using their names as keys and save these dictionary to disk.

In [21]:
# Saving List as Dictionary
data_dict=dict()
data_dict['Titles']=titles_list
data_dict['Star_Ratings']=star_rating_list
data_dict['Reviews Help']=review_help_list
data_dict['Reviews']=reviews_list
data_dict.keys()

dict_keys(['Titles', 'Star_Ratings', 'Reviews Help', 'Reviews'])

#### Data Save

In [29]:
# Saving file path and creating file directory
path_to_save=r'C:\Users\DELL\Documents\\'
try:
    os.mkdir(path_to_save+'Product_Reviews')
except:
    pass
# Convert to json and save
json.dump(data_dict,fp=open(path_to_save+'Product_Reviews\Reviews_Data.json','w'))


The extracted data has been transformed into a dictionary and saved to disk.