# **Web Scraping and Automation**

The two scripts are designed to automate the process of interacting with websites and extracting data efficiently. 

1. **TripAdvisor-Specific Script**:
   - This script focuses on scraping reviews and activity information from TripAdvisor for a given location. It navigates the website, handles dynamic content and pagination, and saves the extracted reviews into organized CSV files.
   - While this script has been tested previously and shown to work effectively, recent changes in TripAdvisor's bot detection mechanisms have made it impossible to test currently.

2. **Generalised Web Scraping Framework**:
   - This script is a flexible framework designed to scrape content from any website by configuring key parameters like locators, navigation paths, and data extraction rules. It demonstrates reusable methods for automating web interactions and extracting data dynamically.
   - While the generalised script has not been thoroughly tested, the functionality provided within it works and offers valuable insights into developing robust web scraping scripts.

Both scripts illustrate practical approaches to automating website interactions and provide useful foundations for building custom web scraping solutions tailored to different needs.


## **Ethics of Web Scraping: Things to Consider**

Web scraping can be a powerful tool for data collection, but it is essential to approach it responsibly and ethically. Below are key points to consider before starting:

### **1. Understand the Website's Terms of Service**
- Review the website's terms of service (ToS) to ensure scraping is allowed. Many websites explicitly prohibit automated data extraction.

### **2. Avoid Overloading the Server**
- Scrape responsibly by introducing delays between requests to avoid putting excessive load on the website's servers.

### **3. Respect Copyright and Data Ownership**
- Data on websites may be subject to copyright or other legal protections. Ensure you have the right to use the data you collect.

### **4. Avoid Collecting Personal or Sensitive Information**
- Never scrape personal, private, or sensitive information without explicit permission, as this may violate privacy laws like GDPR or CCPA.

### **5. Identify Yourself Clearly**
- Use a user-agent string that identifies your scraper instead of disguising it as a regular browser. This promotes transparency and good practices.

### **6. Follow Local Laws and Regulations**
- Web scraping laws vary by country. Ensure your activities comply with local legal frameworks to avoid legal repercussions.

### **7. Test on Your Own Content**
- Before scraping external websites, practice on your own or publicly available datasets to understand the techniques and minimise unintended consequences.

### **8. Monitor for Changes in Website Structure**
- Websites frequently update their structures, which could affect your scraper's functionality or flag it as suspicious behavior.

### **9. Use APIs When Available**
- If a website offers an API, prefer using it over scraping. APIs are typically designed to provide structured access to data while respecting the provider's terms.

### **10. Seek Permission When Possible**
- Contact the website's administrator to request permission for scraping, especially if your use case involves large-scale or repetitive data collection.

By adhering to these principles, you can minimize ethical concerns and potential legal risks, promoting responsible web scraping practices.


# **TripAdvisor Scraper: What the Script Does**

This script automates the process of scraping reviews from TripAdvisor for a specified location. Below is a step-by-step explanation of its functionality:

---

## **Step 1: Setup**
1. **Imports Libraries**:
   - The script imports essential libraries:
     - `Selenium`: For web automation and interaction.
     - `BeautifulSoup`: For parsing HTML content.
     - `pandas`: For organizing and saving the scraped data.
     - `os`: For file and directory handling.
     - `random`: For introducing realistic delays.
   
2. **Configuration**:
   - Defines the location to scrape reviews for (e.g., `Sandwell`).
   - Sets up locators and parameters for interacting with TripAdvisor elements (e.g., buttons, search bar).

---

## **Step 2: Navigate to TripAdvisor**
1. **Open the Website**:
   - The script navigates to the TripAdvisor homepage (`https://www.tripadvisor.co.uk/`).

2. **Handle Cookie Consent**:
   - If a cookie consent popup appears, the script automatically clicks the "Accept" button.

3. **Search for a Location**:
   - Uses the search bar on the homepage to input the location (e.g., `Sandwell`).
   - Submits the search query.

4. **Select the Desired Location**:
   - Clicks on the first search result matching the location name.

5. **Navigate to the "Things to Do" Section**:
   - Once on the location's main page, the script clicks the "Things to Do" link.

---

## **Step 3: Extract Activity Data**
1. **Load Activities**:
   - Collects all activity links (e.g., attractions, places to visit) listed under "Things to Do."
   
2. **Handle Pagination**:
   - If multiple pages of activities are available, the script clicks the "Next" button to navigate through all pages.
   - Continues until no more pages are available.

---

## **Step 4: Scrape Reviews**
1. **Navigate to Each Activity**:
   - Visits each activity's page using the collected links.

2. **Extract Reviews**:
   - Scrapes reviews from the activity page using `BeautifulSoup` for parsing.
   - Handles dynamic content loading using Selenium.

3. **Handle Pagination for Reviews**:
   - If the activity has multiple review pages, the script clicks the "Next" button to navigate through all review pages.
   - Collects reviews from all pages before moving to the next activity.

4. **Clean and Format Reviews**:
   - Removes TripAdvisor disclaimers and unnecessary content from reviews.
   - Applies formatting to make the data cleaner and more readable.

---

## **Step 5: Save Data**
1. **Organize Reviews**:
   - Stores the scraped reviews for each activity in a list.

2. **Save to CSV**:
   - Writes the reviews into a CSV file.
   - Each location has its own directory, and each activity's reviews are saved in a separate file.

3. **Output Example**:
   - Creates a directory structure like:
     ```
     scraped_data/tripadvisor/
         Sandwell/
             activity1.csv
             activity2.csv
     ```

---

## **Step 6: Add Random Delays**
1. **Mimic Human Behavior**:
   - Introduces random delays between actions (e.g., navigating, clicking) to reduce the chance of being flagged as a bot.

---

## **Step 7: Error Handling**
1. **Retries for Actions**:
   - Implements retry logic for actions like clicking buttons or loading pages.
   - If retries fail, the script skips the problematic step and continues.

2. **Handle Failures Gracefully**:
   - Logs errors and skips activities or locations that fail after multiple retries.

---

## **Conclusion**
The script automates the process of navigating TripAdvisor, extracting reviews, and saving them to a structured format. It is designed to handle dynamic content, pagination, and potential errors gracefully, making it flexible and robust for scraping multiple locations.


In [None]:
import os
import re
import time
import random  # For random delays
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# List of locations to scrape
locations = [""]

def create_folder_if_not_exists(root_directory, folder_name):
    """
    Creates a folder in the specified root directory if it doesn't exist.

    Parameters:
        root_directory (str): Path to the root directory.
        folder_name (str): Name of the folder to create.

    Returns:
        str: Full path of the created or existing folder.
    """
    folder_path = os.path.join(root_directory, folder_name)
    if not os.path.exists(folder_path):
        os.makedirs(folder_path)
    return folder_path

def initialize_driver():
    """
    Initialises a Selenium WebDriver for Chrome.

    Returns:
        WebDriver: A configured instance of the Chrome WebDriver.
    """
    return webdriver.Chrome()

def random_delay(min_time=2, max_time=5):
    """
    Introduces a random delay to mimic human browsing behavior.

    Parameters:
        min_time (int): Minimum delay time in seconds.
        max_time (int): Maximum delay time in seconds.
    """
    time.sleep(random.uniform(min_time, max_time))

def retry_action(action, retries=3, wait_time=3):
    """
    Retries a specified action if it fails, with a delay between retries.

    Parameters:
        action (function): The function to retry.
        retries (int): Maximum number of retry attempts.
        wait_time (int): Delay between retry attempts in seconds.

    Returns:
        The result of the successful action.

    Raises:
        Exception: If the action fails after the maximum number of retries.
    """
    for attempt in range(retries):
        try:
            return action()
        except Exception as e:
            print(f"Retry {attempt + 1}/{retries} failed: {e}")
            random_delay()
    raise Exception(f"Action failed after {retries} retries.")

def navigate_to_things_to_do(driver, location, max_retries=2):
    """
    Navigates to the "Things to Do" page for a specified location.

    Parameters:
        driver (WebDriver): Selenium WebDriver instance.
        location (str): Name of the location to search for.
        max_retries (int): Maximum number of retries if navigation fails.

    Returns:
        str: URL of the "Things to Do" page.

    Raises:
        Exception: If navigation fails after the maximum number of retries.
    """
    for attempt in range(max_retries):
        try:
            # Open TripAdvisor homepage
            driver.get("https://www.tripadvisor.co.uk/")
            random_delay()

            # Accept cookies if prompted
            retry_action(lambda: WebDriverWait(driver, 10).until(
                EC.element_to_be_clickable((By.XPATH, "//button[contains(text(), 'Accept')]"))
            ).click())
            random_delay()

            # Search for the location
            search_bar = retry_action(lambda: WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, "input[placeholder='Where to?']"))
            ))
            search_bar.send_keys(location)
            search_bar.send_keys(Keys.RETURN)
            random_delay()

            # Click on the first search result
            location_link = retry_action(lambda: WebDriverWait(driver, 10).until(
                EC.element_to_be_clickable((By.XPATH, "//*[@id='BODY_BLOCK_JQUERY_REFLOW']/div[2]/div/div[2]/div/div/div/div/div[1]/div/div[1]/div/div[3]/div/div[1]/div/div[2]/div/div/div/div/div/div/div[2]/div/div[1]/span"))
            ))
            location_link.click()
            random_delay()

            # Switch to the new tab opened
            driver.switch_to.window(driver.window_handles[-1])
            random_delay()

            # Click the "Things to Do" button
            things_to_do_button = retry_action(lambda: WebDriverWait(driver, 10).until(
                EC.element_to_be_clickable((By.CSS_SELECTOR, "#lithium-root > main > div.cBOoN > span > div > div > div > div:nth-child(2) > a > span"))
            ))
            things_to_do_button.click()
            random_delay()

            return driver.current_url
        except Exception as e:
            print(f"Attempt {attempt + 1}/{max_retries} failed for location {location}: {e}")
            driver.refresh()
            random_delay()
    raise Exception(f"Failed to navigate to 'Things to Do' for {location} after {max_retries} retries.")

def scrape_reviews(driver, activity_url):
    """
    Scrapes all reviews from a specified activity page.

    Parameters:
        driver (WebDriver): Selenium WebDriver instance.
        activity_url (str): URL of the activity page.

    Returns:
        list: A list of reviews scraped from the activity page.
    """
    reviews = []
    retry_action(lambda: driver.get(activity_url))
    random_delay()

    # Parse the page with BeautifulSoup
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    reviews += process_reviews(soup)

    # Iterate through all review pages
    while True:
        try:
            next_button = driver.find_element(By.XPATH, "//a[@class='ui_button nav next primary']")
            retry_action(lambda: next_button.click())
            random_delay()
            soup = BeautifulSoup(driver.page_source, 'html.parser')
            reviews += process_reviews(soup)
        except:
            break

    return reviews

def process_reviews(soup):
    """
    Processes reviews from the BeautifulSoup-parsed HTML content.

    Parameters:
        soup (BeautifulSoup): Parsed HTML content of the page.

    Returns:
        list: A cleaned list of reviews.
    """
    reviews = []
    review_container_classname = '_c'
    for review in soup.find_all('div', {'class': review_container_classname}):
        # Remove unnecessary elements
        class_to_drop = 'mwPje f M k'
        elements_to_drop = review.find_all(class_=class_to_drop)
        for element in elements_to_drop:
            element.extract()

        # Extract review text
        review_text = review.find_next('div', {'class': review_container_classname})
        if review_text:
            reviews.append((review_text.text))

    # Clean up reviews with regex
    reviews = [re.sub(r'(\d+)([A-Za-z]+)', r'\1 \2', item) for item in reviews]
    reviews = [re.sub(r'([a-z])([A-Z])', r'\1 \2', item) for item in reviews]
    reviews = [re.sub(r'\.([A-Za-z])', r'. \1', item) for item in reviews]

    # Remove TripAdvisor's disclaimer text
    subset_to_remove = "This review is the subjective opinion of a Tripadvisor member and not of Tripadvisor LLC."
    reviews = [review.replace(subset_to_remove, '').strip() for review in reviews]

    # Filter out reviews before 2022
    return remove_strings_with_numbers(reviews, 2000, 2021)

def remove_strings_with_numbers(strings_list, start_range, end_range):
    """
    Removes strings containing numbers within a specified range.

    Parameters:
        strings_list (list): List of strings to filter.
        start_range (int): Start of the range.
        end_range (int): End of the range.

    Returns:
        list: Filtered list of strings.
    """
    filtered_strings = []
    for string in strings_list:
        contains_number = any(start_range <= int(word) <= end_range if word.isdigit() else False for word in string.split())
        if not contains_number:
            filtered_strings.append(string)
    return filtered_strings

def scrape_tripadvisor(locations):
    """
    Main function to scrape TripAdvisor for specified locations.

    Parameters:
        locations (list): List of location names to scrape.
    """
    root_directory = "scraped_data/tripadvisor"
    driver = initialize_driver()

    for location in locations:
        # Create a folder for the location
        location_folder = create_folder_if_not_exists(root_directory, location)
        try:
            # Navigate to "Things to Do" page
            things_to_do_url = navigate_to_things_to_do(driver, location)
        except Exception as e:
            print(f"Skipping {location} due to error: {e}")
            continue

        driver.get(things_to_do_url)
        random_delay()

        activity_urls = []
        # Collect activity URLs
        while True:
            try:
                activity_links = driver.find_elements(By.CSS_SELECTOR, "a[href*='/Attraction_Review']")
                for link in activity_links:
                    activity_urls.append(link.get_attribute('href'))

                next_button = retry_action(lambda: driver.find_element(By.CSS_SELECTOR, "a.next"))
                retry_action(lambda: next_button.click())
                random_delay()
            except:
                break

        # Scrape reviews for each activity
        for activity_url in activity_urls:
            try:
                reviews = scrape_reviews(driver, activity_url)
                reviews_df = pd.DataFrame(reviews, columns=["Review"])
                file_path = os.path.join(location_folder, f"{activity_url.split('-')[-1]}.csv")
                reviews_df.to_csv(file_path, index=False)
            except Exception as e:
                print(f"Failed to scrape reviews for {activity_url}: {e}")

    driver.quit()

# Run the script
scrape_tripadvisor(locations)


# **Generalised Web Scraper: What the Script Does**

This script is designed to scrape content from any website by using a configurable setup. Below is a step-by-step explanation of its functionality:

---

## **Step 1: Setup**
1. **Imports Libraries**:
   - The script imports essential libraries:
     - `Selenium`: For web automation and interaction.
     - `BeautifulSoup`: For parsing HTML content.
     - `pandas`: For organizing and saving the scraped data.
     - `os`: For file and directory handling.
     - `random`: For introducing realistic delays.
   
2. **Configuration**:
   - Uses a dictionary (`config`) to define the scraping details for a specific website.
   - The configuration includes:
     - Base URL of the website.
     - Element locators for cookies, search bars, target sections, and pagination.
     - Search terms or keywords (if applicable).

---

## **Step 2: Initialise and Configure WebDriver**
1. **Initialize WebDriver**:
   - Creates an instance of the Selenium WebDriver for browser automation.

2. **Dynamic Delays**:
   - Introduces random delays between actions to mimic human behavior and avoid detection as a bot.

3. **Retry Mechanism**:
   - Implements retry logic for critical actions like clicking buttons or loading pages.
   - Allows multiple retries with delays before failing.

---

## **Step 3: Navigate to Target Section**
1. **Open the Base URL**:
   - Opens the website's homepage or base URL as defined in the `config`.

2. **Handle Cookie Consent**:
   - Automatically clicks the "Accept" button for cookie consent if a locator is provided in the `config`.

3. **Perform Search (If Applicable)**:
   - Inputs a search term (e.g., location, product) into the search bar using the provided selector.
   - Submits the search query and waits for results to load.

4. **Navigate to Target Section**:
   - Clicks on the link or button leading to the desired section of the website (e.g., "Things to Do").
   - Uses locators provided in the `config`.

---

## **Step 4: Extract Data**
1. **Parse the Page**:
   - Extracts content from the page using `BeautifulSoup` for HTML parsing.
   - Finds elements based on the `content_container` and `content_class` specified in the `config`.

2. **Collect Data**:
   - Appends the extracted data (e.g., reviews, product details) to a list for further processing.

3. **Clean Data**:
   - Cleans the extracted content using regular expressions or other methods.

---

## **Step 5: Handle Pagination**
1. **Navigate Through Pages**:
   - Locates and clicks the "Next" button to navigate to subsequent pages.
   - Continues until no more pages are available.

2. **Extract Data from All Pages**:
   - Collects and appends data from each page before moving to the next.

---

## **Step 6: Save Data**
1. **Organize Data**:
   - Structures the collected data into a format suitable for saving (e.g., a list of dictionaries).

2. **Save to CSV**:
   - Writes the data into a CSV file using `pandas`.
   - The file name and directory are defined in the `config`.

3. **Output Example**:
   - Creates a directory structure like:
     ```
     scraped_data/
         website_name/
             target_section.csv
     ```

---

## **Step 7: Error Handling**
1. **Retries for Actions**:
   - Automatically retries actions like clicking buttons, loading elements, or navigating pages.

2. **Graceful Failure**:
   - Skips sections or pages that fail after multiple retries.
   - Logs errors for debugging.

---

## **Conclusion**
The generalized script provides a flexible framework for scraping data from any website. By configuring the `config` dictionary with site-specific details, the script can handle:
- Navigation flows.
- Content extraction.
- Pagination.
- Data storage.

It is designed to be robust and adaptable, making it suitable for a wide range of web scraping tasks.

In [None]:
import os
import time
import random
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd


def create_folder(path):
    """
    Create a folder if it does not exist.
    """
    if not os.path.exists(path):
        os.makedirs(path)


def initialize_driver():
    """
    Initialise the Selenium WebDriver.
    """
    return webdriver.Chrome()


def random_delay(min_time=2, max_time=5):
    """
    Introduce random delay to mimic human behavior.
    """
    time.sleep(random.uniform(min_time, max_time))


def retry_action(action, retries=3, wait_time=3):
    """
    Retry an action with specified retries and delay.
    """
    for attempt in range(retries):
        try:
            return action()
        except Exception as e:
            print(f"Retry {attempt + 1}/{retries} failed: {e}")
            random_delay(wait_time)
    raise Exception(f"Action failed after {retries} retries.")


def navigate_to_section(driver, config):
    """
    Generic navigation function to handle searching and navigating to specific sections.

    Parameters:
        driver: Selenium WebDriver instance.
        config: Dictionary containing navigation details (e.g., search URL, search box locator).
    
    Returns:
        str: URL of the target section.
    """
    try:
        # Open the base URL
        driver.get(config["base_url"])
        random_delay()

        # Accept cookies if applicable
        if "accept_cookies" in config:
            retry_action(lambda: WebDriverWait(driver, 10).until(
                EC.element_to_be_clickable((By.XPATH, config["accept_cookies"]))
            ).click())
            random_delay()

        # Perform search if applicable
        if "search_box" in config and "search_term" in config:
            search_box = retry_action(lambda: WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, config["search_box"]))
            ))
            search_box.send_keys(config["search_term"])
            search_box.send_keys(Keys.RETURN)
            random_delay()

        # Click on target section if applicable
        if "target_section" in config:
            retry_action(lambda: WebDriverWait(driver, 10).until(
                EC.element_to_be_clickable((By.XPATH, config["target_section"]))
            ).click())
            random_delay()

        # Return the current URL
        return driver.current_url
    except Exception as e:
        print(f"Navigation failed: {e}")
        return None


def extract_data(driver, config):
    """
    Generic data extraction function for parsing and extracting content.

    Parameters:
        driver: Selenium WebDriver instance.
        config: Dictionary containing extraction details (e.g., content container).
    
    Returns:
        list: Extracted data.
    """
    data = []
    try:
        # Parse the page with BeautifulSoup
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        container_selector = config.get("content_container", "div")
        content_class = config.get("content_class", None)

        # Extract content based on the configuration
        for element in soup.find_all(container_selector, {"class": content_class}):
            data.append(element.text.strip())
        
        return data
    except Exception as e:
        print(f"Data extraction failed: {e}")
        return data


def handle_pagination(driver, config):
    """
    Handle pagination and collect data from multiple pages.

    Parameters:
        driver: Selenium WebDriver instance.
        config: Dictionary containing pagination details (e.g., next button locator).
    
    Returns:
        list: Data collected from all pages.
    """
    all_data = []
    while True:
        try:
            # Extract data from the current page
            page_data = extract_data(driver, config)
            all_data.extend(page_data)

            # Click on the next page button if applicable
            if "next_button" in config:
                next_button = retry_action(lambda: driver.find_element(By.XPATH, config["next_button"]))
                retry_action(lambda: next_button.click())
                random_delay()
            else:
                break
        except:
            break
    return all_data


def scrape_website(config):
    """
    Main function to scrape any website using the provided configuration.

    Parameters:
        config: Dictionary containing scraping details for the website.
    """
    driver = initialize_driver()
    output_dir = config.get("output_dir", "scraped_data")
    create_folder(output_dir)

    # Navigate to the target section
    target_url = navigate_to_section(driver, config)
    if not target_url:
        print("Failed to navigate to the target section.")
        driver.quit()
        return

    # Collect data from all pages
    driver.get(target_url)
    all_data = handle_pagination(driver, config)

    # Save the data to a CSV file
    output_file = os.path.join(output_dir, f"{config['output_name']}.csv")
    pd.DataFrame(all_data, columns=["Content"]).to_csv(output_file, index=False)
    print(f"Data saved to {output_file}")

    driver.quit()


# Example configuration for TripAdvisor
tripadvisor_config = {
    "base_url": "https://www.tripadvisor.co.uk/",
    "accept_cookies": "//button[contains(text(), 'Accept')]",
    "search_box": "input[placeholder='Where to?']",
    "search_term": "",
    "target_section": "//*[@id='BODY_BLOCK_JQUERY_REFLOW']/div[2]/div/div[2]/div/div/div/div/div[1]/div/div[1]/div/div[3]/div/div[1]/div/div[2]/div/div/div/div/div/div/div[2]/div/div[1]/span",
    "content_container": "div",
    "content_class": "_c",
    "next_button": "//a[@class='ui_button nav next primary']",
    "output_name": "tripadvisor_sandwell",
    "output_dir": "scraped_data/tripadvisor"
}

# Run the scraper
scrape_website(tripadvisor_config)
