Kelvin Indogun; 

## Review Website Scraper Script

#### Initial Setup for Web Scraping
In preparation for the web scraping task, I began by importing various Python libraries that were essential for this process.

In [1]:
# Importing required libraries for web scraping and data processing
from bs4 import BeautifulSoup  # For HTML parsing
import requests  # For making HTTP requests to the website
import pandas as pd  # For data manipulation and storage
import datetime as dt  # For handling date and time
import time  # For managing delays in requests
import logging  # For logging information during the scraping process
import random  # For randomizing request intervals
import re  # For regular expression operations

#### Logging Setup for Web Scraping

To efficiently monitor and troubleshoot the web scraping process, I set up a logging system. Logging is essential in capturing real-time events, errors, and informational messages that occur during the scraping. This setup will help to keep track of the script's execution flow and debug any issues.

I configured a logger with two handlers:

File Handler: This handler writes logs to a file named 'scraping_logs.log'. It is useful for storing a permanent record of the script's execution, which I can review later for analysis or debugging.

Stream Handler: This handler outputs logs to the console. It's beneficial for real-time monitoring of the script's progress and immediate awareness of any issues or significant events.

Both handlers use a standard format to record the timestamp, log level, and message, providing a consistent and detailed log output.

In [2]:
# Setting up the logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# file handler to write logs to a file
file_handler = logging.FileHandler('scraping_logs.log')
file_handler.setFormatter(logging.Formatter('%(asctime)s - %(levelname)s - %(message)s'))
logger.addHandler(file_handler)

# stream handler to print logs to the console
stream_handler = logging.StreamHandler()
stream_handler.setFormatter(logging.Formatter('%(asctime)s - %(levelname)s - %(message)s'))
logger.addHandler(stream_handler)

##### Base code was obtained from https://github.com/evan-roberts/trustpilot-scrape/blob/main/trustpilot.py
##### However, A lot of Modifications have been made to the code base, making it more modular

#### Generating Random Headers for Web Requests

To make the web scraping activities more robust and less detectable by websites, I implemented a function to generate random HTTP headers for each request. This function, get_random_headers, selects a user agent at random from a predefined list and sets other HTTP header fields such as 'Accept-Encoding', 'Accept', and 'Accept-Language'.

The use of different user agents helps in mimicking the behavior of different browsers, making the scraping requests appear more like regular browser traffic. This strategy is often employed to prevent the scraper from being identified and blocked by the website's anti-bot mechanisms.

In [3]:
def get_random_headers():
    # List of user agents representing different browsers
    user_agent_list = ["Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36",
                       "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.60",
                       "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/118.0"]
    # Randomly select a user agent and set other headers
    return {
        'User-Agent': random.choice(user_agent_list),
        'Accept-Encoding': 'gzip, deflate',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Accept-Language': 'en',
    }

#### Fetching Web Pages with Error Handling

The fetch_page function is central to the web scraping script, as it handles the task of sending HTTP requests to retrieve web pages. It is designed to be resilient against common issues like request failures or non-200 HTTP responses.

The function takes several parameters, including the number of retries, delay between retries, and a backoff factor to increase the delay after each failed attempt. This approach helps manage situations where the server might be temporarily unresponsive or busy.

Each time the function sends a request, it checks the HTTP response status code. A status code of 200 indicates a successful request, and the function then returns the page content. If the status code is not 200, or if a request exception occurs (e.g., timeout), the function logs an error message and retries the request after a specified delay, increasing the delay with each attempt.

If the function fails to fetch the page after the designated number of retries, it logs an error and returns None. This robust error handling ensures that our scraping script can manage and recover from common network issues, enhancing its reliability.

In [4]:
def fetch_page(session, url, retries=3, delay=10, backoff_factor=2):
    # Attempt to fetch the page content with specified number of retries
    for attempt in range(retries):
        try:
            response = session.get(url, timeout=10)  # Send a GET request to the URL
            if response.status_code == 200:
                return response.text  # Return page content if status code is 200
            else:
                # Log error if status code is not 200
                logging.error(f"Error {response.status_code} for {url}. Attempt {attempt + 1}. Retrying after {delay} seconds...")
        except requests.RequestException as e:
            # Log error if request fails
            logging.error(f"Request failed for {url}. Error: {e}. Attempt {attempt + 1}. Retrying after {delay} seconds...")
        
        # Sleep before the next retry
        time.sleep(delay)
        delay *= backoff_factor  # Increase delay for the next retry
    
    # Log an error if unable to fetch after all retries
    logging.error(f"Failed to fetch {url} after {retries} attempts.")
    return None

#### Parsing Dates from Web Scraped Content

In [5]:
def parse_date(date_string):
    # List of expected date formats
    date_formats = ["%d %B %Y", "%d %b %Y"]  # %B for full month name, %b for abbreviated name

    # Try parsing the date string with each format
    for fmt in date_formats:
        try:
            return dt.datetime.strptime(date_string, fmt).date()  # Convert to date object if format matches
        except ValueError:
            # Continue to next format if current one doesn't match
            pass
    
    # Log a warning if the date string doesn't match any format
    logging.warning(f"Could not parse date: {date_string}")
    return "N/A"  # Return "N/A" if unable to parse

#### Parsing Individual Reviews

The parse_review function plays a key role in the web scraping script by extracting detailed information from each review. Given a review element, the name of the business, and the page number from which the review is being scraped, it extracts the following information:

Review Title: The title of the review, 

Review Date: The date when the review was posted, processed to handle relative dates (like "2 days ago") as well as specific dates.

Review Rating: Numerical rating given by the reviewer.

Review Text: The main text content of the review.

Date of Experience: When the reviewer experienced the service or product.

Page Number: The page number from which the review was scraped, useful for tracking and organizing data.

This function incorporates robust error handling and logging. In case any part of the parsing fails, the function logs the error with details about the review that caused the issue, making it easier to identify and fix parsing problems.

In [6]:
def parse_review(review, business, page_number):
    try:
        # Review title
        review_title_element = review.find("h2", class_="typography_heading-s__f7029")
        review_title = review_title_element.get_text() if review_title_element else "N/A"

        # Review date
        review_date_element = review.find("time")
        review_date_text = review_date_element.get_text() if review_date_element else "N/A"
        review_date = "N/A"
        if "hours ago" in review_date_text.lower() or "hour ago" in review_date_text.lower():
            review_date = dt.datetime.now().date()
        elif "a day ago" in review_date_text.lower():
            review_date = dt.datetime.now().date() - dt.timedelta(days=1)
        elif "days ago" in review_date_text.lower():
            days_ago = int(re.search(r'(\d+) days ago', review_date_text).group(1))
            review_date = dt.datetime.now().date() - dt.timedelta(days=days_ago)
        elif "minutes ago" in review_date_text.lower() or "minute ago" in review_date_text.lower():
            review_date = dt.datetime.now().date()
        elif "seconds ago" in review_date_text.lower() or "second ago" in review_date_text.lower():
            review_date = dt.datetime.now().date()
        elif review_date_text:
           review_date = parse_date(review_date_text)

        # Review rating
        review_rating_element = review.find("img", alt=True)
        review_rating = "N/A"
        if review_rating_element and review_rating_element["alt"]:
            match = re.search(r"(\d+)", review_rating_element["alt"])
            if match:
                review_rating = int(match.group(1))

        # Review text
        review_text_element = review.find("p", class_="typography_body-l__KUYFJ")
        review_text = review_text_element.get_text() if review_text_element else "N/A"

        # Date of experience
        date_of_experience_element = review.find("p", class_="typography_body-m__xgxZ_")
        date_of_experience = "N/A"
        if date_of_experience_element:
            match = re.search(r":\s*(.*)", date_of_experience_element.get_text())
            if match:
                date_of_experience = match.group(1)
    
        return {
            'business': business,
            'review_title': review_title,
            'review_date': review_date,
            'review_rating': review_rating,
            'review_text': review_text,
            'date_of_experience': date_of_experience,
            'page_number': page_number
        }
    except Exception as e:
        logging.error(f"Error parsing review. Error: {e}. Review content: {review}")
        return None


#### Scraping Business Reviews Across Multiple Pages

The scrape_business_reviews function automates the process of extracting reviews for a given business across multiple pages. It iterates through pages from a specified starting page  to an ending page , scraping reviews from each page.

Key steps in this function include:

Generating Dynamic HTTP Headers: Before each request, it updates the session headers with random user-agent strings to mimic different browsers.

Rate Limiting: To avoid overwhelming the server and to mimic human browsing behavior, the function includes a delay between requests.

Review Extraction: For each page, the function parses the HTML content using BeautifulSoup and extracts individual reviews using the previously defined parse_review function.

Session Renewal: To prevent issues with long-lived sessions like being flagged as a bot, the function renews the HTTP session at regular intervals determined by SESSION_RENEW_FREQUENCY.

In [7]:
SESSION_RENEW_FREQUENCY=100

def scrape_business_reviews(session, business_url, from_page=1, to_page=1000):
    # Initializes a list to store scraped reviews
    reviews = []

    # Extracts and clean the business name from URL for logging
    business_name = business_url.split('/')[-1]
    business_name = re.sub(r'(www\.)?(\.com|\.co\.uk)$', '', business_name)

    # Loop through the specified range of pages
    for i in range(from_page, to_page + 1):
        logging.info(f"Scraping page {i} of {business_name}")
        
        # Updates session headers and pause execution to mimic human behavior
        session.headers.update(get_random_headers())
        time.sleep(random.uniform(5, 20))

        # Fetches the content of the current page
        page_url = f"{business_url}?page={i}"
        page_html = fetch_page(session, page_url)
        if page_html is None:
            continue  # Skip to the next page if fetching failed

        # Parses the page content and extract reviews
        soup = BeautifulSoup(page_html, "html.parser")
        for review_html in soup.find_all("article", {"data-service-review-card-paper": "true"}):
            try:
                review = parse_review(review_html, business_name, i)
                if review:
                    reviews.append(review)
                else:
                    logging.warning(f"No review data found on page {i} for {business_name}.")
            except Exception as e:
                logging.error(f"Error parsing review on page {i} for {business_name}. Error: {e}")

        # Renews the session after a set number of pages
        if i % SESSION_RENEW_FREQUENCY == 0:
            session = requests.Session()
            session.headers.update(get_random_headers())
            logging.info("Renewed session.")

    return reviews


#### Comprehensive Scraping Across Multiple Businesses
The scrape_all_businesses function orchestrates the scraping of review data from a list of business URLs . This function is designed to handle large-scale scraping tasks by iterating through each business URL, scraping reviews from a range of pages, and aggregating the results.

Key features of this function include:

Session Management: It initializes an HTTP session with randomized headers to start the scraping process.

Iterative Scraping: The function scrapes reviews for each business URL using the scrape_business_reviews function.

Data Aggregation: Successfully scraped reviews are added to a master list (all_reviews), which is used to compile the final dataset.

Checkpointing: To safeguard against data loss, the function saves interim results to CSV files after every few businesses (determined by CHECKPOINT_FREQUENCY). This checkpointing mechanism is especially useful for long-running scraping tasks.

Dataframe Creation: Once all businesses have been scraped, the aggregated review data is converted into a Pandas DataFrame for easy analysis and storage.

In [8]:
CHECKPOINT_FREQUENCY = 5

def scrape_all_businesses(businesses_url, from_page=1, to_page=500):
    # Initialize a list to store all reviews and a new HTTP sessio
    all_reviews = []
    session = requests.Session()  # Create an initial session
    session.headers.update(get_random_headers())
    
     # Iterate over each business URL to scrape reviews
    for idx, business_url in enumerate(businesses_url, 1):
        logging.info(f"Scraping reviews for {business_url}")
        reviews = scrape_business_reviews(session, business_url, from_page, to_page)
        # Add successfully scraped reviews to the master list
        if reviews:
            filtered_reviews = [review for review in reviews if review is not None]
            all_reviews.extend(filtered_reviews)
            logging.info(f"Added {len(filtered_reviews)} reviews to all_reviews. Total reviews: {len(all_reviews)}")
        else:
            logging.warning(f"No reviews found for {business_url}.")
         
        # Save data to a checkpoint file at regular intervals
        if idx % CHECKPOINT_FREQUENCY == 0:
            timestamp = dt.datetime.now().strftime("%Y%m%d_%H%M%S")
            checkpoint_filename = f'scraped_reviews_checkpoint_{timestamp}.csv'
            checkpoint_df = pd.DataFrame(all_reviews, columns=['business', 'review_title', 'date_of_experience', 'review_date', 'review_rating', 'review_text', 'page_number'])
            checkpoint_df.to_csv(checkpoint_filename, index=False)
            logging.info(f"Checkpoint: Saved data after scraping {idx} businesses to {checkpoint_filename}.")
    
    # Create a DataFrame from the aggregated reviews
    master_df = pd.DataFrame(all_reviews, columns=['business', 'review_title', 'date_of_experience', 'review_date', 'review_rating', 'review_text', 'page_number'])
    return master_df

#### Main Execution: Scraping Reviews from Multiple Businesses

In the main execution block of the script, I defined a list of URLs, each corresponding to a business page on Trustpilot. These URLs are the targets for scraping reviews. The script processes each URL in the list using the previously defined scrape_all_businesses function, which systematically extracts review data from each page and aggregates it.

After scraping the reviews from all the specified businesses, the script compiles the results into a Pandas DataFrame. This DataFrame, master_df, contains all the collected review information in a structured format, making it suitable for analysis or machine learning tasks.

Finally, I saved the DataFrame to a CSV file, scraped_reviews.csv, ensuring that the scraped data is stored in an accessible format.

In [9]:
# Main execution block of the web scraping script
if __name__ == "__main__":
    # List of business URLs to be scraped
    businesses_url = ["https://uk.trustpilot.com/review/www.staysure.co.uk", "https://uk.trustpilot.com/review/wise.com",
                      "https://uk.trustpilot.com/review/www.worldremit.com","https://uk.trustpilot.com/review/domesticandgeneral.com",
                      "https://uk.trustpilot.com/review/1stcentralinsurance.com","https://uk.trustpilot.com/review/purplebricks.co.uk",
                      "https://uk.trustpilot.com/review/www.rac.co.uk","https://uk.trustpilot.com/review/www.homeserve.com",
                      "https://uk.trustpilot.com/review/www.allcleartravel.co.uk","https://uk.trustpilot.com/review/www.quidco.com",
                      "https://uk.trustpilot.com/review/www.hastingsdirect.com","https://uk.trustpilot.com/review/www.topcashback.co.uk",
                      "https://uk.trustpilot.com/review/www.revolut.com","https://uk.trustpilot.com/review/www.monzo.com",
                      "https://uk.trustpilot.com/review/vitality.co.uk","https://uk.trustpilot.com/review/www.lowell.co.uk",
                      "https://uk.trustpilot.com/review/stormgain.com","https://uk.trustpilot.com/review/www.monese.com",
                      "https://uk.trustpilot.com/review/www.theaa.com","https://uk.trustpilot.com/review/www.travelex.co.uk",
                      "https://uk.trustpilot.com/review/payingtoomuch.com","https://uk.trustpilot.com/review/www.paysafecard.com",
                      "https://uk.trustpilot.com/review/premiumcredit.com","https://uk.trustpilot.com/review/acemoneytransfer.com",
                      "https://uk.trustpilot.com/review/www.esure.com","https://uk.trustpilot.com/review/amigoloans.co.uk"]
    # Scrape reviews from all listed businesses
    master_df = scrape_all_businesses(businesses_url)
    
    # Save the scraped data to a CSV file
    master_df.to_csv('scraped_reviews.csv', index=False)

2023-10-11 06:03:05,793 - INFO - Scraping reviews for https://uk.trustpilot.com/review/www.staysure.co.uk
2023-10-11 06:03:05,795 - INFO - Scraping page 1 of www.staysure
2023-10-11 06:03:23,500 - INFO - Scraping page 2 of www.staysure
2023-10-11 06:03:34,961 - INFO - Scraping page 3 of www.staysure
2023-10-11 06:03:47,210 - INFO - Scraping page 4 of www.staysure
2023-10-11 06:03:55,992 - INFO - Scraping page 5 of www.staysure
2023-10-11 06:04:04,186 - INFO - Scraping page 6 of www.staysure
2023-10-11 06:04:21,977 - INFO - Scraping page 7 of www.staysure
2023-10-11 06:04:38,772 - INFO - Scraping page 8 of www.staysure
2023-10-11 06:04:46,741 - INFO - Scraping page 9 of www.staysure
2023-10-11 06:05:01,076 - INFO - Scraping page 10 of www.staysure
2023-10-11 06:05:09,920 - INFO - Scraping page 11 of www.staysure
2023-10-11 06:05:20,521 - INFO - Scraping page 12 of www.staysure
2023-10-11 06:05:40,231 - INFO - Scraping page 13 of www.staysure
2023-10-11 06:05:50,162 - INFO - Scraping pag