# Scraping Trustpilot

[Trustpilot](https://www.trustpilot.com) is a consumer site containing product reviews for various brands out there. Each consumer review is associated with a star rating, this allows for labeling the data with respect to stars. Therefore, we want to scrape both text and star rating info from each review on the website.

The site is arranged with the following structure:
***
    Category Lv1
        |> Category Lv2
            |> Category Lv3
                |> Company
                    |> Review
***                   

For any webscraping task, we need to do the following:
 - Get URLs to scrape
 - Determine which element within each URL contain relevant information to be extracted

After some playing around on the website, it seems that the company URLs are rendered dynamically on the website front end through Ajax calls. This means we can't really extract the company content by accessing the page source html.

<img src="imgs/img1.png" alt="Drawing" style="width: 800px;"/>

To get around this, we can use `Selenium` which is a web automation platform that allows us to deploy **headless** browser sessions. Using this tool, we can then emulate an user clicking on each category, narrowing down to each sub-category on the website and going through all the companies one by one noting down (extract) their URLs. Once we have all of the URLs, then we can deploy a spider using `Scrapy` to crawl through `Trustpilot` scraping and storing all company reviews.

# 1. Collecting company URLs using Selenium

Starting from the [Categories](https://www.trustpilot.com/categories) page, the following can be observed:

 - The main body with all Lv1 and Lv2 categories are contained in the `<section class="rightColumn___17BWv">` element.
 
 <img src="imgs/img2.png" alt="Drawing" style="width: 800px;"/>
 
 
 - Each Lv1 category is contained in the `<h3 class="subCategoryHeader___36ykD">` element.
 
 <img src="imgs/img3.png" alt="Drawing" style="width: 800px;"/>
 
 
 - Each Lv2 category is contained in the `<div class="subCategoryItem____3ksKz">` element. And the URL link for each Lv2 category is contained in the `<a>` element within, and can be extracted from the its `href` attribute.
 
 <img src="imgs/img4.png" alt="Drawing" style="width: 800px;"/>

It seems that the second observtion above is the most useful, as we can use this info to loop through each Lv1 category, then use `Selenium` to extract URLs of companies within each Lv1 category.

 Since the Lv1 URL links can be easily extracted from the page source html, this can be achieve nicely with `BeautifulSoup` and `requests`.

## 1.1 Get Lv1 Company URLs

In [1]:
# Import relevant packages
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup

### Functions for making requests to specified URLs using `requests`

In [2]:
# Set up functions for making requests to specified URLs
def is_good_response(resp):
    """
    Returns True if the response of a request seems to be HTML, 
    False otherwise.
    """
    # Extract the header content type from response
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200 # Status code being 200 means successful response
            and content_type is not None  # response header content type shouldn't be None
            and content_type.find('html') > -1) # Check the content type of the response, if the resulting string contains html then it's okay

def get_url(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    # Tries to get a response from the specified URL, log the error if it occurs
    try:
        # closing() is used as good practice to make sure that the network resource is closed outside the scope of the with block
        # setting stream=True allows access to Response.raw attribute which is the stream of Bytes
        with closing(get(url, stream=True)) as resp:
            # Check if response returned HTML/XML data
            if is_good_response(resp):
                return resp.content # Response.content returns a bytes object
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None
    
# Set up functions
def log_erro(e):
    """
    It is always a good idea to log errors. 
    This function just prints them, but you can
    make it do anything.
    """
    print(e)

### Functions for parsing response from the website using `BeautifulSoup`

In [17]:
def get_cat_urls(html_byte, base_url):
    '''
    Custom function for extracting Lv2 category URLs from
    Trustpilot web page.
    '''
    bs = BeautifulSoup(html_byte, 'html.parser')
    cat_divs = bs.select('h3.subCategoryHeader___36ykD') # Get all the div element that contains CatLv1 urls
    url_list = [div.select_one('a.navigation___2Efid')['href'] for div in cat_divs] # Get the URLs embeded in each a tag, within each div element
    url_list = [f"{base_url}/{url.split('/')[-1]}" for url in url_list] # Join CatLv1 urls to the base Trustpilot url
    
    return url_list

In [18]:
# Define base url
base_url = 'https://www.trustpilot.com/categories'
html_byte = get_url(base_url)
url_list = get_cat_urls(html_byte, base_url)

# Check first 5 URLs
url_list[:5]

['https://www.trustpilot.com/categories/animals_pets',
 'https://www.trustpilot.com/categories/beauty_wellbeing',
 'https://www.trustpilot.com/categories/business_services',
 'https://www.trustpilot.com/categories/construction_manufactoring',
 'https://www.trustpilot.com/categories/education_training']

In [41]:
print(f"Number of Lv1 Categories: {len(url_list)}")

Number of Lv1 Categories: 22


## 1.2 Use `Selenium` to extract company urls

Using the previously extracted Lv1 Category urls, we now can use `Selenium` to find urls for companies. This will be done by looping over the companies of each Lv1 Category and extracting their URLs.

If we take the category `animals_pets` as an example, company urls are embedded in an `<a class="wrapper___2rOTx>` as shown in the image below.

<img src="imgs/img5.png" alt="Drawing" style="width: 800px;"/>
 
However, each Lv1 category may have multiple pages, with each page containing many companies. Therefore, the basic workflow should consist of the following steps:
 - For each Lv1 category:
  - Loop through all pages
   - For each page:
       - Extract all company urls on the page
       
For each page within each Lv1 category, the `Next Page` button is located at the bottom of the bage and can be access with the `<a class="paginationLinkNormalize___scOgG paginationLinkNext___1LQ14">` element as shown below.

<img src="imgs/img6.png" alt="Drawing" style="width: 800px;"/>



#### **Note: It is important to be aware that if the website page source code changes, then the listed elements may result in failure to scrape the site.**

In [211]:
from selenium.webdriver import Firefox, FirefoxOptions
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# opts = FirefoxOptions()
# opts.headless = True

In [224]:
# Function for checking if next page exist
def next_page_check(browser, x_path):
    """
    Check if a page has the 'Next Page' button. If button doesn't exist,
    then indicate last page has been reached. The button page element is
    accessed by specifying a xpath to the element.
    """
    try:
        next_page_button = browser.find_element_by_xpath(x_path)
        return  True, next_page_button
    except NoSuchElementException as e:
        print(f'    last page reached.')
        return  False, None

In [223]:
def get_compurl_on_page(browser, x_path):
    """
    Extract all urls for companies displayed on a page.
    """
    company_elems = browser.find_elements_by_xpath(x_path)
    # Get all the company urls, filter out non company review related urls
    company_urls = [company_elem.get_attribute('href') 
                    for company_elem in company_elems
                    if 'review/' in company_elem.get_attribute('href')]
    # Filter out repeated company urls
    company_urls = list(set(company_urls))
    return company_urls

In [222]:
def wait_element_xpath(browser, xpath, timeout=10):
    """
    Wait for specified elements to load on a page. The maximum wait time is 
    specified by a timeout parameter (default to 10s) at which point an error 
    message will be logged. The elements are specified by means of an xpath.
    """
    try:
        element_present = EC.presence_of_all_elements_located((By.XPATH, xpath))
        element = WebDriverWait(browser, timeout).until(element_present)
        print(f'    Successfully loaded {browser.current_url}')
    except Exception as e:
        print(f'Error: {e}')

In [220]:
def extract_compurls_by_url(url, xpath_comp, xpath_next, timeout=10):
    """
    Given a specified url, extract all company urls within and follow 
    next page links repeating the same extraction process until last page 
    is reached.
    """
    # Initialising selenium with a headless Firefox driver.
    # This prevents Selenium from opening up an actual Firefox window
    # Increase speed for scraping
    opts = FirefoxOptions()
    opts.add_argument('--headless')
    opts.add_argument('--no-sandbox')
    opts.add_argument('start-maximized')
    opts.add_argument('disable-infobars')
    opts.add_argument("--disable-extensions")
    
    # Use context manager to ensure browser is closed after usage
    with Firefox(options = opts) as browser:
        browser.get(url)
        wait_element_xpath(browser, xpath_comp)
        
        # Check if next page exist
        # For each page extract company URLs and store them
        # When last page is reached, return the extracted company URLs
        next_page = True
        comp_urls = []
        while next_page:
            comp_urls += get_compurl_on_page(browser, xpath_comp)
            comp_urls = list(set(comp_urls))
            next_page, next_button = next_page_check(browser, xpath_next)
            if next_page:
                browser.get(next_button.get_attribute('href'))
                wait_element_xpath(browser, xpath_comp)
    return comp_urls

In [226]:
# Define the xpath for extracting company urls and finding next page button
# Note, these urls can change pending page source code updates
xpath_next_page = '//a[@Class="paginationLinkNormalize___scOgG paginationLinkNext___1LQ14"]'
xpath_companies = '//a[@Class="wrapper___2rOTx"]'

# Loop through all pages and extract all companies for each category
comp_urls_by_cat = {}
for url in url_list:
    category = url.split("/")[-1]
    category = " ".join(category.split("_"))
    print(f'Scraping company urls for category: {category}')
    comp_urls_by_cat[category] = extract_compurls_by_url(url, xpath_companies, xpath_next_page, timeout=15)
    print(f'    Found {len(comp_urls_by_cat[category])} company urls for {category}\n')

Scraping company urls for category: animals pets
    Successfully loaded https://www.trustpilot.com/categories/animals_pets
    Successfully loaded https://www.trustpilot.com/categories/animals_pets?page=2
    Successfully loaded https://www.trustpilot.com/categories/animals_pets?page=3
    Successfully loaded https://www.trustpilot.com/categories/animals_pets?page=4
    Successfully loaded https://www.trustpilot.com/categories/animals_pets?page=5
    Successfully loaded https://www.trustpilot.com/categories/animals_pets?page=6
    Successfully loaded https://www.trustpilot.com/categories/animals_pets?page=7
    Successfully loaded https://www.trustpilot.com/categories/animals_pets?page=8
    Successfully loaded https://www.trustpilot.com/categories/animals_pets?page=9
    Successfully loaded https://www.trustpilot.com/categories/animals_pets?page=10
    Successfully loaded https://www.trustpilot.com/categories/animals_pets?page=11
    Successfully loaded https://www.trustpilot.com/cat

In [245]:
# Saving the data as csv for quick access later
import json
import pandas as pd

with open('company_urls.json', 'w') as file:
    json.dump(comp_urls_by_cat, file, indent=2)
    
comp_dict = {'category': None,
             'url': None}
for cat in comp_urls_by_cat:
    for url in comp_urls_by_cat[cat]:
        if not comp_dict['category']:
            comp_dict['category'] = [cat]
        if not comp_dict['url']:
            comp_dict['url'] = [url]
        comp_dict['category'].append(cat)
        comp_dict['url'].append(url)

df = pd.DataFrame(comp_dict)
df.to_csv('./data/company_urls.csv', index=False)

In [246]:
df.head() # Show subset of data sample

Unnamed: 0,category,url
0,animals pets,https://www.trustpilot.com/review/www.myollie.com
1,animals pets,https://www.trustpilot.com/review/www.myollie.com
2,animals pets,https://www.trustpilot.com/review/bingopetco.com
3,animals pets,https://www.trustpilot.com/review/a-z-reptiles...
4,animals pets,https://www.trustpilot.com/review/crittersandt...


## 1.3 Deploy spider and scrape review data per company using Scrapy

Before we start scraping, we will have to set up a new Scrapy project in the directory where we’d like to store our spider.

In [4]:
!scrapy startproject ScrapyTurstpilot

New Scrapy project 'ScrapyTurstpilot', using template directory '/home/mason/Mason/Learning/Projects/Web-Sentiment-ML-Model-Deploy/venv/lib/python3.7/site-packages/scrapy/templates/project', created in:
    /home/mason/Mason/Learning/Projects/Web-Sentiment-ML-Model-Deploy/src/data_collection/ScrapyTurstpilot

You can start your first spider with:
    cd ScrapyTurstpilot
    scrapy genspider example example.com


The initial directory created for the new Scrapy project will have the following structure:

    ScrapyTrustpliot/
        scrapy.cfg            # deploy configuration file

        ScrapyTrustpilot/     # project's Python module, we import our code from here
            __init__.py

            items.py          # project items definition file

            middlewares.py    # project middlewares file

            pipelines.py      # project pipelines file

            settings.py       # project settings file

            spiders/          # directory where we put our spiders
                __init__.py

To build the scraper, a spider needs to be created in the `spiders` folder. In this case, `tp_spider.py` has been created which contains the starting urls (company urls) for the spider to crawl from as well as the scraping logic.

Before any scraping takes place, some settings for the Scrapy project should be adjusted. There are many mechanisms that can be used to set Scrapy settings each with different level of priority (more details can be found [here](https://docs.scrapy.org/en/latest/topics/settings.html)). The standard method is to provide project level settings, using the project settings module i.e. `ScrapyTrustpilot.settings.py`. The relevant settings that should be changed for this project are listed below:

 - `ROBOTSTXT_OBEY` default: `False` but set to `True` in `settings.py`. Change this to `False` so that spiders don't obey robots.txt policies which generally inhibits any form of scraping.
 - `CONCURRENT_REQUESTS` default: 16. Change this to 32 to increase maximum number of concurrent requests that will be performed by the Scrapy downloader.
 - `DOWNLOAD_DELAY` default: 0. Change this to 0.5 to increase the amount of time (s) that the downloader should wait before downloading consecutive pages from the same website. This is used to throttle crawling speed to avoid hitting servers too hard (being more 'polite').
 - `FEEDS` default: {}. Change this for custom export format for scraped data. Multiple exporters can be defined each as a dictionary within the `FEEDS` dictionary. Each exporter dictionary has a file path as its key, and should at least contain the `format` field in its values e.g. csv, json, xml etc.

Looking at a single company's page on trustpilot, there seems to be the following interesting information that can be extracted:

 - company_name: the company name being reviewed
 - company_logo: the URL of the logo of the company being reiewed
 - company_website: the website of the company being reviewed
 - url_website: the company URL on Trustpilot
 - comment: The text review from each reviewer.
 - rating: the number of start (1 to 5)
 
These variables are discussed individually where the xpath to their corresponding element is derived (**using experiments with the Scrapy shell**).

### **company_name**

The xpath can be specified as: `//span[@class="multi-size-header__big"]/text()`

<img src="imgs/img7.png" alt="Drawing" style="width: 800px;"/>

### **company_logo**

The xpath can be specified as: `//img[@class="business-unit-profile-summary__image"]/@src`. Note the url returned will be without the `https:` portion and this should be appended.

<img src="imgs/img8.png" alt="Drawing" style="width: 800px;"/>

### **company_website**

The xpath can be specified as: `//a[@class='badge-card__section badge-card__section--hoverable company_website']/@href`.

<img src="imgs/img9.png" alt="Drawing" style="width: 800px;"/>

### **url_website**

This can simply be found by using the Scrapy `response` object attribute `response.url`.

### **comment**

Each review has a heading and a body. There is a little nuance here, it seems that each review always have a heading but may be void of an actual comment body. Therefore, for each review, we need to extract any text available (heading or body or both).

The xpath to all the review blocks can be expressed by: `//div[@class='review-content__body']`

<img src="imgs/img10.png" alt="Drawing" style="width: 800px;"/>

The xpath to extract the text from an individual review block can be expressed by: `.//text()`

### **rating**

The xpath to all ratings on a page can be expressed by: `//div[@class='star-rating star-rating--medium']/img/@alt`

<img src="imgs/img11.png" alt="Drawing" style="width: 800px;"/>

## 1.4 Overall process for scraping review data

The overall process for scraping review data from trustpilot will have the following steps:
 - Loop through previously collected company URLs
     - For each company:
         - Extract company info
         - Extract reiview comments and ratings from all available pages

In order to extract all available reviews for a company, we will need to deal with pagination by following the next page link. Scrapy allows recursive follwing of links with `response.follow()`, as long as we can provide the relevant pagination element. The xpath to the next-page button element can be expressed by: `//a[@class='button button--primary next-page']/@href`.

<img src="imgs/img12.png" alt="Drawing" style="width: 800px;"/>

Based on the above process, the `tp_spider.py` script is set up to complete the afforementioned scraping procedure.

## 1.5 Launching the spider

In [None]:
!cd src/ScrapyTurstpilot/ScrapyTurstpilot
!scrapy crawl trustpilot

## 1.6 Investigate the data

In [1]:
import pandas as pd

First thing first, let's see how big the dataset is...

In [56]:
# The data is quite big (A few GBs), so we have to process it in chunks
df = pd.read_csv('ScrapyTurstpilot/scraped_data.csv', chunksize=100_000)

len_data = 0
memory_usage = pd.Series(dtype='object')

for chunk in df:
    
    # Determine length of dataset
    len_data += len(chunk)
    
    # Determine memory usage of dataframe
    if len(memory_usage) == 0:
        memory_usage = chunk.memory_usage(deep=True)/1e6
    else:
        memory_usage += chunk.memory_usage(deep=True)/1e6

print(f'Number of reviews scraped from Trustpilot: {len_data:,} reviews\n')
print(f'Memory usage per column (MB):\n{memory_usage}\n')
print(f'Total memory usage of dataframe: {memory_usage.sum()/1e3:.2f} GB')

Number of reviews scraped from Trustpilot: 5,980,003 reviews

Memory usage per column (MB):
Index                 0.007916
company_name        423.946037
company_logo        839.180596
company_website     926.550001
url_website         694.144113
comment            2006.976003
rating               47.840024
dtype: float64

Total memory usage of dataframe: 4.94 GB


So the dataset contains roughly ~6 million reviews scraped from trustpilot and will occupy around ~5 GB memory if loaded in its entirety. Out of all the variables, `comment` takes up most space as it should since it contains large quantity of text which actually includes emojis. Let's have a look at the last chunk read from the scraped data to see what the data looks like.

In [57]:
chunk.head()

Unnamed: 0,company_name,company_logo,company_website,url_website,comment,rating
5900000,PureFormulas,https://s3-eu-west-1.amazonaws.com/tpd/logos/4...,https://www.pureformulas.com?utm_medium=compan...,https://www.trustpilot.com/review/www.pureform...,"excellent, love being able to read the bottle ...",5
5900001,PureFormulas,https://s3-eu-west-1.amazonaws.com/tpd/logos/4...,https://www.pureformulas.com?utm_medium=compan...,https://www.trustpilot.com/review/www.pureform...,"Fast delivery,love the items I've ordered & fa...",5
5900002,PureFormulas,https://s3-eu-west-1.amazonaws.com/tpd/logos/4...,https://www.pureformulas.com?utm_medium=compan...,https://www.trustpilot.com/review/www.pureform...,"So simple and easy to shop on line,I will orde...",5
5900003,ThriftBooks,https://s3-eu-west-1.amazonaws.com/tpd/screens...,http://www.thriftbooks.com?utm_medium=company_...,https://www.trustpilot.com/review/www.thriftbo...,Great bargains and quick delivery.,5
5900004,ThriftBooks,https://s3-eu-west-1.amazonaws.com/tpd/screens...,http://www.thriftbooks.com?utm_medium=company_...,https://www.trustpilot.com/review/www.thriftbo...,Grandson loved the books.,4


A few things to do with the data to make it better, note any cleaning or processing really just need to be performed on `comment` and `rating`:

 - Remove any characters from `comment` that are not letters, numbers and normal punctuations.
 - Convert all letters in `comment` to lower case.
 - Remove any `comment` with less that 3 words.
 - Remove any rows with `rating` missing or not a integer out of 5.
 - Drop duplicate observations accross the entire dataset, could be accomplished using hashtables.
 - Investigate the distribution of `rating` to see if any inherent bias exist in the label data.
