# Selenium Web Scraping

Selenium web scraping involves using the browser automation tool to extract data from websites. Selenium allows you to programmatically control a web browser, meaning you can interact with websites just like a human user would. While various tools and libraries are available for web scraping in Python, Selenium stands out as a robust option, especially when dealing with websites that rely heavily on JavaScript for dynamic content rendering.

## Overview of Selenium Web Scraping

1. **Setup**: Install Selenium and a web driver (like ChromeDriver for Chrome or GeckoDriver for Firefox).
2. **Automation**: Use Selenium to open a web browser and navigate to the target website.
3. **Interaction**: Perform actions like clicking buttons, filling out forms, and scrolling, to interact with the web page.
4. **Data Extraction**: Extract the desired data from the web page’s HTML content.
5. **Processing**: Process and store the extracted data as needed.

## Prerequisites

Before we begin, ensure that you have the following installed:

1. **Python**: Download the latest version of Python from the official website.
2. **Chrome Browser**: Selenium works best with the Chrome browser, so make sure you have it installed on your machine.
3. **ChromeDriver**: ChromeDriver is essential for Selenium to control the Chrome browser. You can download it from the official website and ensure that the version matches your installed Chrome browser.
4. **Selenium Library**: Install Selenium using pip with the following command:
    
   ` pip install selenium`
    

You also need to have:
- Basic Python knowledge
- Basic HTML
- An open mind to learn new things

## Getting Started with Selenium

Let’s start with a basic example to understand how Selenium works:



In [2]:
!pip install selenium webdriver-manager



In [None]:
!pip uninstall webdriver-manager
!pip install webdriver-manager


In [None]:
python -m venv myenv
myenv\Scripts\activate
pip install selenium webdriver-manager

In [2]:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service

# Initialize ChromeDriver with automatic open chrome browser installed in your pc
driver = webdriver.Chrome()

# Open a website
driver.get('https://neo4j.com/docs/')

# Extract the page title
page_title = driver.title
print("Page Title:", page_title)

# Close the browser
driver.quit()


User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36
Page Title: Neo4j documentation - Neo4j Documentation


In the above code, we imported the ‘webdriver’ module from Selenium, initialized the ChromeDriver, opened a website (https://www.example.com), extracted its page title, and then closed the browser using the ‘quit()’ method.

# Web Scraping with Selenium

Now that we have a basic understanding of Selenium, let’s explore more advanced web scraping concepts. Many websites load data dynamically using JavaScript, which makes standard libraries like `requests` and `beautifulsoup` inadequate. Selenium’s ability to interact with JavaScript-rendered content makes it a powerful choice for such scenarios.

## Locating Elements

To scrape data effectively, we need to locate elements on the web page. Elements can be located using various methods:

- **By ID**: Using `find_element(By.ID, "your_id")` method.
- **By Name**: Using `find_element(By.NAME, "your_name")` method.
- **By Class Name**: Using `find_element(By.CLASS_NAME, "your_class_name")` method.
- **By CSS Selector**: Using `find_element(By.CSS_SELECTOR, "your_css_selector")` method.
- **By XPath**: Using `find_element(By.XPATH, "your_xpath")` method.

For example, to extract the content of a paragraph with `id="content"`, we can use:

```python
content = driver.find_element(By.ID, "content").text
print(content)


## Handling Dynamic Content

Dynamic websites may take some time to load content using JavaScript. When scraping dynamic content, we should wait for the elements to become visible before extracting data. We can achieve this using **Explicit Waits** provided by Selenium.


In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import csv

# List of URLs to scrape
BASE_URL = [
'http://www.hubertiming.com/results/2017GPTR10K'
]
board_members = []

# Initialize ChromeDriver with automatic open.
driver = webdriver.Chrome()

# Loop through our URLs we loaded above
for b in BASE_URL:
    driver.get(b)

    # Wait for the table to load
    try:
        officer_table = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.ID, 'individualResults'))
        )
        
#Find By Tag Name

        # Loop through the rows of the table
        for row in officer_table.find_elements(By.TAG_NAME, 'tr'):
            cols = row.find_elements(By.TAG_NAME, 'td')
            if len(cols) == 9:  # Ensure there are 9 columns
                board_members.append(
                    (cols[0].text.strip(),  # Place
                     cols[1].text.strip(),  # Bib
                     cols[2].text.strip(),  # Name
                     cols[3].text.strip(),  # Gender
                     cols[4].text.strip(),  # City
                     cols[5].text.strip(),  # State
                     cols[6].text.strip(),  # Time
                     cols[7].text.strip(),  # Gun Time
                     cols[8].text.strip())  # Team
                )
    except Exception as e:
        print(f"Error while scraping {b}: {e}")
        continue  # Skip to the next URL if there's an error

# Close the browser
driver.quit()

# Define CSV file name
csv_file = 'board_members.csv'

# Save the scraped board members data to a CSV file
with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    
    # Write header
    writer.writerow(['Place', 'Bib', 'Name', 'Gender', 'City', 'State', 'Time', 'Gun Time', 'Team'])
    
    # Write data rows
    writer.writerows(board_members)

print(f"Board members data saved to {csv_file}.")


## Handling User Interactions
Some websites require user interactions (e.g., clicking buttons, filling forms) to load data dynamically. Selenium can simulate these interactions using methods like `‘click()’`, `‘send_keys()’`, etc.

`search_input = driver.find_element(By.ID, 'search_input')` 

`search_input.send_keys(‘Web Scraping’)`

`search_button = driver.find_element(By.ID, 'search_button') `

`search_button.click()`

In [None]:
from selenium.webdriver.common.by import By

# Locate the search input field by ID
search_input = driver.find_element(By.ID, 'search_input')

# Simulate typing 'Web Scraping' into the input field
search_input.send_keys('Web Scraping')

# Locate the search button by ID
search_button = driver.find_element(By.ID, 'search_button')

# Simulate clicking the search button
search_button.click()

# Real-World Use Case: Scraping Product Data from an E-commerce Website

To showcase Selenium’s full capabilities, let’s consider a more complex real-world use case. We will scrape product data from an e-commerce website.

1. Open an e-commerce website (e.g., `https://www.example-ecommerce.com`).
2. Search for a specific product category (e.g., "Laptops").
3. Extract and print product names, prices, and ratings.

## Executable Code Below:

Below is a well-detailed Python code with comments to scrape product data from an e-commerce website using Selenium. For this example, we’ll scrape product data from Amazon’s "Best Sellers in Electronics" page using the concepts we have learned:


In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager

# Automatically install and initialize ChromeDriver
driver = webdriver.Chrome(ChromeDriverManager().install())

# Open Amazon’s Best Sellers in Electronics page
driver.get('https://www.amazon.com/gp/bestsellers/electronics/')

# Wait for the list of products to be visible
product_list = WebDriverWait(driver, 10).until(
    EC.presence_of_all_elements_located((By.XPATH, '//div[@class="zg-item-immersion"]'))
)

# Initialize an empty list to store product data
product_data = []

# Loop through each product element and extract data
for product in product_list:
    # Extract product name
    product_name = product.find_element(By.XPATH, './/div[@class="p13n-sc-truncate p13n-sc-line-clamp-2"]').text.strip()

    # Extract product price (if available)
    try:
        product_price = product.find_element(By.XPATH, './/span[@class="p13n-sc-price"]').text.strip()
    except:
        product_price = "Price not available"

    # Extract product rating (if available)
    try:
        product_rating = product.find_element(By.XPATH, './/span[@class="a-icon-alt"]').get_attribute("innerHTML")
    except:
        product_rating = "Rating not available"

    # Append the product data to the list
    product_data.append({
        'Product Name': product_name,
        'Price': product_price,
        'Rating': product_rating
    })

# Print the scraped product data
print("Scraped Product Data:")
for idx, product in enumerate(product_data, start=1):
    print(f"{idx}. {product['Product Name']} — Price: {product['Price']}, Rating: {product['Rating']}")

# Close the browser
driver.quit()


# Explanation of the Code:

1. **Import necessary modules** from Selenium to interact with the web page and locate elements.

2. **Initialize the ChromeDriver** and open Amazon’s “Best Sellers in Electronics” page.

3. **Use an 'Explicit Wait'** to wait for the list of products to be visible. This ensures that the web page has loaded, and the product elements are ready to be scraped.

4. **Initialize an empty list, `product_data`**, to store the scraped product data.

5. **Loop through each product element** and extract the product name, price, and rating (if available). We use `try-except` blocks to handle cases where the price or rating information is not available for a particular product.

6. **Append the extracted product data** to the `product_data` list as a dictionary with keys ‘Product Name’, ‘Price’, and ‘Rating’.

7. **After scraping all products**, print the scraped product data in a user-friendly format.

8. Finally, **close the browser using `driver.quit()`**.


## Dealing with Anti-Scraping Measures

Many websites implement anti-scraping measures to prevent automated data extraction. Techniques like IP blocking, CAPTCHAs, and user-agent detection can hinder scraping efforts. Understanding these measures and implementing strategies to bypass them responsibly is crucial for successful web scraping.

## Data Storage and Management

Once data is scraped, it needs to be stored and managed efficiently. Choosing the right data storage format (e.g., CSV, JSON, database) and organizing scraped data will facilitate further analysis and processing.
