# A Web Crawler Program

Name: Sau Yee Yiu

Python Version: Python 3

In this program, the customer reviews of Myer (an upmarket department store chain in Australia) collected from the Trustpilot website (https://au.trustpilot.com/) is used as an example to demonstrate how a Python package can be created to scape all customer reviews from that website and prepare a dataset for later sentiment analysis. 

## Step 1: Import the libraries needed for web scaping


In [None]:
#Import Libraries
import requests
from bs4 import BeautifulSoup
from time import sleep
import pandas as pd

## Step 2: Initially, try fetch the content of the first page

Fisrt use BeautifulSoup to parse the HTML data collected from get() function, and store the data in BeautifulSoup format called ‘soup.’ Then use BeautifulSoup’s prettify function to examine th structure of the underlying HTML in the first page.

In [None]:
#Specify the web page for scraping
page = requests.get('https://au.trustpilot.com/review/www.myer.com.au')
print(page.status_code)
#Parse the HTML into the BeautifulSoup parse tree format
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())

## Step 3: Get the title of the page


In [None]:
#To get the title of the page
title = soup.title
print(title)


## Step 4: Retrieving the target content of the first web page

use the find_all method to find all customer reviews on the web page. Using Chrome DevTools to examine the html, each customer review is found to be stored in a 'div' container with a class attribute name "review-card".

In [None]:
#use find_all() to extract all the div containers that have a class attribute review-card
myer_reviews = soup.find_all('div', class_='review-card')
print(type(myer_reviews))
print(len(myer_reviews)) #print total number of customer reviews on first webpage


## Step 5: Extract customer reviews data on the first webpage

The following shows how to extract the reviewer's name, total number of reviews written by the reviewer, the 1-5 stars review rating, title of review and the text review from each review listed in myer_reviews. 

In [None]:
reviewer_name = []
reviewer_count=[]
review_rating=[]
review_title=[]
review_details=[]

#extract each review from the div container (i.e., myer_reviews)
for r in myer_reviews:
#get the name of a reviewer
    ID=r.find('div', class_='consumer-information__name').text
    reviewer_name.append(ID.strip())
#get the numbers of reviews written by the reviewers
    r_count=r.find('span').text
    #remove the word review from_r_count
    index=r_count.find(' review')
    r_count_sub=int(r_count[:index])
    reviewer_count.append(r_count_sub)
#get the title of the review
    r_title=r.find('h2', class_='review-content__title').text
    review_title.append(r_title.strip()) #strip method removes the '\n' characters in r_title
#get the reviewer rating
    r_rate=r.find('img')['alt']
    #only extract the text in r_rate
    index=r_rate.find(': ')
    review_rating.append(r_rate[index+2:])
#get the details of the review
    r_text = r.find('p', class_='review-content__text').text
    review_details.append(r_text.strip()) #strip method removes the '\n' characters in r_text

print(len(review_details))


## Step 6 - Now extend the method in Step 5 to extract all customer reviews data of Myer (i.e., There are 19 webpages containing customer reviews of Myer).

In [None]:
reviewer_name = []
reviewer_count=[]
review_rating=[]
review_title=[]
review_details=[]

page_no=[str(i) for i in range(1,20)] #total number of pages is 20.

for page in page_no:
    if page=='1':
      #Specify with which web page you are going to be scraping
      web = requests.get('https://au.trustpilot.com/review/www.myer.com.au')
      if web.status_code == 200:
        print("Succeffully dowloaded page ", page)
      else:
        print("failed to download page ", page)
    else:
      #Specify with which web page you are going to be scraping
      web = requests.get('https://au.trustpilot.com/review/www.myer.com.au?page='+ page)
      if web.status_code == 200:
        print("Succeffully dowloaded page ", page)
      else:
        print("failed to download page ", page)
     
    #Parse the HTML into the BeautifulSoup parse tree format
    soup = BeautifulSoup(web.text, 'html.parser')

    #use find_all() to extract all the div containers that have a class attribute review-card
    myer_reviews = soup.find_all('div', class_='review-card')
    #extract each review from the div container (i.e., myer_reviews)
    for r in myer_reviews:
        #get the name of a reviewer
        ID=r.find('div', class_='consumer-information__name').text
        reviewer_name.append(ID.strip())
        #get the numbers of reviws written by the reviewers
        r_count=r.find('span').text
        #remove the word review from_r_count
        index=r_count.find(' review')
        r_count_sub=int(r_count[:index])
        reviewer_count.append(r_count_sub)
        #get the title of the review
        r_title=r.find('h2', class_='review-content__title').text
        review_title.append(r_title.strip())
        #get the reviewer rating
        r_rate=r.find('img')['alt']
        #just extract rating text
        index=r_rate.find(': ')
        review_rating.append(r_rate[index+2:])
        #get the details of the review
        r_text = r.find('p', class_='review-content__text').text
        review_details.append(r_text.strip())
      
    #control the loop rate using the sleep() function
    sleep(3)
print('done')

## Step 7: Convert the data collected from Step 6 to Python’s panda dataframe.

In [None]:
myer_reviews = pd.DataFrame({'reviewer name': reviewer_name, 'number of reviews by reviewer': reviewer_count, 
                              'review title': review_title, 'review rating': review_rating,
                               'review' : review_details})
myer_reviews.head()

## Step 8: Save as CSV file

Save the dataframe content to a CSV file.


In [None]:
export_csv = myer_reviews.to_csv ('myer_customer_reviews.csv', index = None, header=True)