# TrustPilot Scraper | Development

This notebook conducts development and exploration into the TrustPilot web scraper, based on the specific parameters laid out from Davies Hickman. Further analysis of data is carried out in a separate notebook.

*The key objective for the project is to identify companies on TrustPilot that are receiving negative reviews because of their poor customer service using WhatsApp, Messaging, SMS, Text, Webchat, etc.*

Target Site: www.uk.trustpilot.com

#### Categories:
 - Money & Insurance
 - Travel & Vacation
 - Food, Beverages & Tobacco
 - Restaurants & Bars
 - Events & Entertainment
 - Beauty & Well-being
 - Shopping & Fashion
 - Home & Garden
 - Vehicle & Transportation
 - Electronics & Technology
 - Animals & Pets
 - Business services (for logistics)

#### Challenges:
 - Can't filter immediately to below 3 stars - only have Any, 3+, 4+ or 4.5+. This means we have to go through the whole of the TrustPilot category and find the companies with 3 stars or less.

## 0.0 Import Libraries

In [1]:
# Data manipulation & stats
import pandas as pd
import numpy as np
import re
from scipy.stats import truncnorm

# Data visualisation
import matplotlib.pyplot
%matplotlib inline
import seaborn as sns

# Standard libraries
import os
import datetime
import time
from tqdm import tqdm
import random

# Web scraping
import requests
from bs4 import BeautifulSoup

## 1.0 Setup Config

### 1.01 Local paths

In [2]:
notebooks_dir_path = os.getcwd()
repo_dir_path = notebooks_dir_path.replace("/notebooks", "")
data_dir_path = os.path.join(repo_dir_path, "data")

### 1.02 URLs

In [4]:
# Category urls
trustpilot_base_url = "https://uk.trustpilot.com"
categories_base_url = os.path.join(trustpilot_base_url, "categories")

# Url params
# Page query
num_categories_pages = 500
categories_pages = range(1, num_categories_pages + 1)
num_categories_pages_base_query = f"?page={categories_pages[0]}"

# Other queries
args_url = ["sort=latest_review"] # Sort by latest reviews
args_url_query = f"&{'&'.join(args_url)}"

print(
    f"Example complete URL: {os.path.join(categories_base_url, 'money_insurance') + num_categories_pages_base_query + args_url_query}"
)

Example complete URL: https://uk.trustpilot.com/categories/money_insurance?page=1&sort=latest_review


## 2.0 Scraper Development

In [5]:
def random_wait_time(min_seconds, max_seconds, mean, std):
    """
    Waits for a random number of seconds, following a truncated normal distribution.
    """
    # Calculate the lower and upper bounds for truncation
    a = (min_seconds - mean) / std
    b = (max_seconds - mean) / std
    random_wait_time = truncnorm.rvs(a, b, loc=mean, scale=std)
    random_wait_time = max(min_seconds, min(max_seconds, random_wait_time))
    time.sleep(random_wait_time)


def scrape_trustpilot(trustpilot_base_url: str, 
                      categories_base_url: str, 
                      category_suffix: str, 
                      num_categories_pages: int, 
                      args_url_query: str,
                      max_score: float,
                      min_num_reviews: int,
                      max_num_review_pages: int) -> pd.DataFrame:


    # Create category dir if doesn't exist
    data_category_dir_path = os.path.join(data_dir_path, category_suffix)
    data_tmp_dir_path = os.path.join(data_category_dir_path, "tmp")
    data_tmp_companies_dir_path = os.path.join(data_tmp_dir_path, "companies")
    data_tmp_reviews_dir_path = os.path.join(data_tmp_dir_path, "reviews")
    
    if not os.path.exists(data_category_dir_path):
        os.makedirs(data_category_dir_path)
    if not os.path.exists(data_tmp_dir_path):
        os.makedirs(data_tmp_dir_path)
    if not os.path.exists(data_tmp_companies_dir_path):
        os.makedirs(data_tmp_companies_dir_path)
    if not os.path.exists(data_tmp_reviews_dir_path):
        os.makedirs(data_tmp_reviews_dir_path)

    category_target_url = os.path.join(categories_base_url, category_suffix)

    # Get categories page list
    categories_pages = range(1, min(num_categories_pages, 500) + 1) # Always a max of 500

    # Initialise dataframe of companies
    companies_df = pd.DataFrame()

    # Loop through category/company pages on TrustPilot
    print(f"Collecting company data for {category_suffix}...")
    for categories_page in tqdm(categories_pages):
        # Get page query
        page_query = f"?page={categories_page}"
        # Create complete url to request
        category_target_complete_url = f"{category_target_url}{page_query}{args_url_query}"
        response_companies = requests.get(category_target_complete_url)

        # If response code is OK, scrape html
        if response_companies.status_code == 200:
            soup_companies = BeautifulSoup(response_companies.text, 'html.parser')

            # Extract company names
            company_names = soup_companies.find_all(
                "p", 
                class_="typography_heading-xs__jSwUz typography_appearance-default__AAY17 styles_displayName__GOhL2"
            )
            company_names = [element.text for element in company_names]
            # Extract links to reviews
            review_links = soup_companies.find_all(
                "a", 
                class_="link_internal__7XN06 link_wrapper__5ZJEx styles_linkWrapper__UWs5j"
            )
            review_links = [os.path.join(trustpilot_base_url, element.get("href")[1:]) for element in review_links]
            # Extract ratings
            company_scores_num_reviews = soup_companies.find_all(
                "p", 
                class_="typography_body-m__xgxZ_ typography_appearance-subtle__8_H2l styles_ratingText__yQ5S7"
            )
            company_scores = [
                float(re.search(r"(\d+\.\d+)", element.text.split("|")[0]).group(1)) for element in company_scores_num_reviews
            ]
            # Extract num reviews
            num_reviews = [int("".join(filter(str.isdigit, element.text.split("|")[1]))) for element in company_scores_num_reviews]
    
            # Create dataframe of company names and links for this batch
            try:
                companies_df_temp = pd.DataFrame(
                    {
                        "company_name": company_names,
                        "review_link": review_links,
                        "company_score": company_scores,
                        "num_reviews": num_reviews,
                    }
                )
                companies_df_temp["categories_page"] = categories_page
                # Write unfiltered tmp file to memory
                companies_df_temp.to_csv(
                    os.path.join(data_tmp_companies_dir_path, f"companies_df_{category_suffix}_{categories_page}_tmp.csv"), 
                    index=False
                )
                # Filter to parameters set
                companies_df_temp = companies_df_temp.loc[
                    (companies_df_temp.company_score <= max_score) &
                    (companies_df_temp.num_reviews >= min_num_reviews)
                ]
                # Concatenate sub df with master df
                companies_df = pd.concat(
                    [companies_df, companies_df_temp],
                    ignore_index=True
                )
                # Write master df to memory - overwrite each iteration
                companies_df.to_csv(
                    os.path.join(data_category_dir_path, f"companies_df_{category_suffix}_raw.csv"),
                    index=False
                )
            except Exception as e:
                print(f"Error creating companies_df_temp for page {page_query}: {e}")
                pass

        else:
            companies_df = pd.DataFrame()
            print(f"Failed to fetch response_companies page. Status code: {response_companies.status_code}")

        random_wait_time(min_seconds=2, max_seconds=12, mean=3.5, std=0.75) # Random wait between looping through category companies
        
    # Add dummy columns for address
    companies_df["address"] = None
    companies_df["is_uk"] = None
    # Rewrite final companies df to memory
    companies_df = companies_df.drop_duplicates(
        subset=["company_name", "review_link", "company_score"]
    )
    companies_df.to_csv(
        os.path.join(data_category_dir_path, f"companies_df_{category_suffix}_raw.csv"),
        index=False
    )
    
    print("Company data collected.")
    print(f"Collecting review data for {category_suffix}")
    companies_df_full = pd.DataFrame()

    # Loop through scraped companies to get reviews
    for idx, row in tqdm(companies_df.iterrows(), total=companies_df.shape[0]):
        company_name = row["company_name"]
        company_reviews_base_url = row["review_link"]
        
        response_reviews = requests.get(company_reviews_base_url)
    
        # If response code is OK, scrape html
        if response_reviews.status_code == 200:
            soup_reviews = BeautifulSoup(response_reviews.text, "html.parser")
        
            try:
                # Get company address information
                address_list = soup_reviews.find(
                    "ul", class_="typography_body-m__xgxZ_ typography_appearance-default__AAY17 styles_contactInfoAddressList__RxiJI"
                )
                address_list = [element.text.lower() for element in address_list]
                address_concat = ", ".join(address_list) # Concatenate to one string
                # Update df
                companies_df.loc[idx, "address"] = address_concat
                if "united kingdom" in address_concat or "uk" in address_concat:
                    companies_df.loc[idx, "is_uk"] = True
                else:
                    companies_df.loc[idx, "is_uk"] = False
            except Exception as e:
                print(f"Failed to get address data for {company_name}: {e}")
                pass

            # Extract number of review pages
            review_pages = soup_reviews.find(
                "nav", class_="pagination_pagination___F1qS"
            )

            review_pages = [element.text for element in review_pages]
            num_review_pages = int(review_pages[len(review_pages) - 2])
            # Define maximum number of review pages to loop through
            num_review_pages = min(num_review_pages, max_num_review_pages + 1)
            review_pages = range(1, num_review_pages + 1)

            # Initialise dataframe of reviews
            reviews_df = pd.DataFrame()

            # Loop through reviews and review pages
            for review_page in review_pages:
                company_reviews_page_url = f"{company_reviews_base_url}?page={review_page}"
                
                response_reviews_page = requests.get(company_reviews_page_url)
            
                # If response code is OK, scrape html
                if response_reviews_page.status_code == 200:
                    soup_reviews_page = BeautifulSoup(response_reviews_page.text, "html.parser")

                    try:
                        # Get review/experience dates
                        reviews_dates = soup_reviews_page.find_all(
                            "p", class_="typography_body-m__xgxZ_ typography_appearance-default__AAY17"
                        )    
                        # Clean dates
                        dates = []
                        for date_element in reviews_dates:
                            date_text = date_element.get_text(strip=True)
                            date_value = date_text.split(":")[-1].strip()
                            # Convert the date to a different format
                            try:
                                # Assuming the date is in the format "21 November 2023"
                                datetime_object = datetime.datetime.strptime(date_value, "%d %B %Y")
                                formatted_date = datetime_object.strftime("%Y-%m-%d")
                                dates.append(formatted_date)
                            except Exception as e:
                                pass # Pass on non-date elements

                        # Get review score
                        reviews_scores = soup_reviews_page.find_all("div", class_="styles_reviewHeader__iU9Px")
                        # Clean scores
                        scores = []
                        for score in reviews_scores:
                            try:
                                score = score["data-service-review-rating"]
                                score = int(score)
                                scores.append(score)
                            except Exception as e:
                                pass

                        # Get review text
                        reviews_reviews = soup_reviews_page.find_all(
                            'p', 
                            class_='typography_body-l__KUYFJ typography_appearance-default__AAY17 typography_color-black__5LYEn'
                        )
                        reviews = [element.text for element in reviews_reviews]

                        try:
                            # Create dataframe of reviews for this batch
                            reviews_df_temp = pd.DataFrame(
                                {
                                    "date": dates,
                                    "score": scores,
                                    "review": reviews,
                                }
                            )
                            reviews_df_temp["reviews_page"] = review_page
                            reviews_df_temp["company_name"] = company_name                        
                            # Write unfiltered tmp file to memory
                            reviews_df_temp.to_csv(
                                os.path.join(
                                    data_tmp_reviews_dir_path, f"reviews_df_{category_suffix}_{company_name}_{review_page}_tmp.csv"
                                ), 
                                index=False
                            )
                            # Concatenate sub df with master df
                            reviews_df = pd.concat(
                                [reviews_df, reviews_df_temp],
                                ignore_index=True
                            )
                            # Write master df to memory - overwrite each iteration
                            reviews_df.to_csv(
                                os.path.join(data_category_dir_path, f"reviews_df_{category_suffix}_{company_name}.csv"),
                                index=False
                            )
                        except Exception as e:
                            print(f"Error creating reviews_df_temp for {company_name}/{review_page}: {e}")
                            pass
                            
                    except Exception as e:
                        print(f"Failed to extract review data for {company_name}/{review_page}: {e}")
                        pass
                        
                random_wait_time(min_seconds=2, max_seconds=12, mean=3.5, std=0.75) # Random wait through individual review pages

            # Merge reviews_df with companies_df
            companies_df_full = pd.concat(
                [
                    companies_df_full,
                    pd.merge(companies_df, reviews_df, on="company_name", how="inner")
                ],
                ignore_index=True
            )
            companies_df_full.to_csv(os.path.join(data_category_dir_path, f"companies_df_{category_suffix}_full.csv"), index=False)
        else:
            print(f"Failed to fetch response_reviews page. Status code: {response_reviews.status_code}")
            pass
        
        random_wait_time(min_seconds=2, max_seconds=12, mean=3.5, std=0.75) # Random wait through getting company review pages
        

    return companies_df_full

In [8]:
# Run scraper for all categories
category_suffix_list = [
    #"money_insurance",
    #"travel_vacation",
    #"food_beverages_tobacco",
    #"restaurants_bars",
    #"events_entertainment",
    #"beauty_wellbeing",
    #"shopping_fashion",
    #"home_garden",
    #"vehicles_transportation",
    #"electronics_technology",
    #"animals_pets",
    "shipping_logistics"
]

for category_suffix in category_suffix_list:
    companies_df_full = scrape_trustpilot(
        trustpilot_base_url=trustpilot_base_url, 
        categories_base_url=categories_base_url, 
        category_suffix=category_suffix,
        num_categories_pages=100, 
        args_url_query=args_url_query,
        max_score=3.0,
        min_num_reviews=1000,
        max_num_review_pages=50
    )

Collecting company data for shipping_logistics...


 39%|███████████████████████████████████████████████▉                                                                           | 39/100 [02:26<03:38,  3.58s/it]

Error creating companies_df_temp for page ?page=40: All arrays must be of the same length


 73%|█████████████████████████████████████████████████████████████████████████████████████████▊                                 | 73/100 [04:43<01:39,  3.69s/it]

Error creating companies_df_temp for page ?page=74: All arrays must be of the same length


 74%|███████████████████████████████████████████████████████████████████████████████████████████                                | 74/100 [04:47<01:34,  3.64s/it]

Error creating companies_df_temp for page ?page=75: All arrays must be of the same length


 75%|████████████████████████████████████████████████████████████████████████████████████████████▎                              | 75/100 [04:51<01:30,  3.64s/it]

Error creating companies_df_temp for page ?page=76: All arrays must be of the same length


 76%|█████████████████████████████████████████████████████████████████████████████████████████████▍                             | 76/100 [04:55<01:36,  4.02s/it]

Error creating companies_df_temp for page ?page=77: All arrays must be of the same length


 77%|██████████████████████████████████████████████████████████████████████████████████████████████▋                            | 77/100 [04:59<01:31,  3.98s/it]

Error creating companies_df_temp for page ?page=78: All arrays must be of the same length


 78%|███████████████████████████████████████████████████████████████████████████████████████████████▉                           | 78/100 [05:02<01:20,  3.64s/it]

Error creating companies_df_temp for page ?page=79: All arrays must be of the same length


 79%|█████████████████████████████████████████████████████████████████████████████████████████████████▏                         | 79/100 [05:05<01:13,  3.52s/it]

Error creating companies_df_temp for page ?page=80: All arrays must be of the same length


 80%|██████████████████████████████████████████████████████████████████████████████████████████████████▍                        | 80/100 [05:09<01:09,  3.45s/it]

Error creating companies_df_temp for page ?page=81: All arrays must be of the same length


 81%|███████████████████████████████████████████████████████████████████████████████████████████████████▋                       | 81/100 [05:12<01:06,  3.51s/it]

Error creating companies_df_temp for page ?page=82: All arrays must be of the same length


 82%|████████████████████████████████████████████████████████████████████████████████████████████████████▊                      | 82/100 [05:17<01:06,  3.70s/it]

Error creating companies_df_temp for page ?page=83: All arrays must be of the same length


 83%|██████████████████████████████████████████████████████████████████████████████████████████████████████                     | 83/100 [05:21<01:08,  4.01s/it]

Error creating companies_df_temp for page ?page=84: All arrays must be of the same length


 84%|███████████████████████████████████████████████████████████████████████████████████████████████████████▎                   | 84/100 [05:26<01:08,  4.26s/it]

Error creating companies_df_temp for page ?page=85: All arrays must be of the same length


 85%|████████████████████████████████████████████████████████████████████████████████████████████████████████▌                  | 85/100 [05:28<00:55,  3.68s/it]

Error creating companies_df_temp for page ?page=86: All arrays must be of the same length


 86%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▊                 | 86/100 [05:33<00:54,  3.89s/it]

Error creating companies_df_temp for page ?page=87: All arrays must be of the same length


 87%|███████████████████████████████████████████████████████████████████████████████████████████████████████████                | 87/100 [05:37<00:53,  4.09s/it]

Error creating companies_df_temp for page ?page=88: All arrays must be of the same length


 88%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▏              | 88/100 [05:42<00:50,  4.17s/it]

Error creating companies_df_temp for page ?page=89: All arrays must be of the same length


 89%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▍             | 89/100 [05:47<00:48,  4.38s/it]

Error creating companies_df_temp for page ?page=90: All arrays must be of the same length


 90%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▋            | 90/100 [05:51<00:44,  4.48s/it]

Error creating companies_df_temp for page ?page=91: All arrays must be of the same length


 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▉           | 91/100 [05:55<00:37,  4.14s/it]

Error creating companies_df_temp for page ?page=92: All arrays must be of the same length


 92%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏         | 92/100 [05:58<00:30,  3.77s/it]

Error creating companies_df_temp for page ?page=93: All arrays must be of the same length


 93%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍        | 93/100 [06:01<00:25,  3.67s/it]

Error creating companies_df_temp for page ?page=94: All arrays must be of the same length


 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌       | 94/100 [06:06<00:23,  4.00s/it]

Error creating companies_df_temp for page ?page=95: All arrays must be of the same length


 95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊      | 95/100 [06:10<00:19,  3.99s/it]

Error creating companies_df_temp for page ?page=96: All arrays must be of the same length


 96%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████     | 96/100 [06:13<00:15,  3.89s/it]

Failed to fetch response_companies page. Status code: 404


 97%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎   | 97/100 [06:17<00:11,  3.87s/it]

Failed to fetch response_companies page. Status code: 404


 98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌  | 98/100 [06:21<00:07,  3.75s/it]

Failed to fetch response_companies page. Status code: 404


 99%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 99/100 [06:25<00:03,  3.96s/it]

Failed to fetch response_companies page. Status code: 404


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [06:29<00:00,  3.89s/it]


Company data collected.
Collecting review data for shipping_logistics


0it [00:00, ?it/s]


In [None]:
# TODO
# Add more try/excepts
# shipping and logistics not scraping - suffix