# Selenium Web Scraping

Selenium web scraping involves using the browser automation tool to extract data from websites. Selenium allows you to programmatically control a web browser, meaning you can interact with websites just like a human user would. While various tools and libraries are available for web scraping in Python, Selenium stands out as a robust option, especially when dealing with websites that rely heavily on JavaScript for dynamic content rendering.

## Overview of Selenium Web Scraping

1. **Setup**: Install Selenium and a web driver (like ChromeDriver for Chrome or GeckoDriver for Firefox).
2. **Automation**: Use Selenium to open a web browser and navigate to the target website.
3. **Interaction**: Perform actions like clicking buttons, filling out forms, and scrolling, to interact with the web page.
4. **Data Extraction**: Extract the desired data from the web page’s HTML content.
5. **Processing**: Process and store the extracted data as needed.

## Prerequisites

Before we begin, ensure that you have the following installed:

1. **Python**: Download the latest version of Python from the official website.
2. **Chrome Browser**: Selenium works best with the Chrome browser, so make sure you have it installed on your machine.
3. **ChromeDriver**: ChromeDriver is essential for Selenium to control the Chrome browser. You can download it from the official website and ensure that the version matches your installed Chrome browser.
4. **Selenium Library**: Install Selenium using pip with the following command:
    
   ` pip install selenium`
    

You also need to have:
- Basic Python knowledge
- Basic HTML
- An open mind to learn new things

## Getting Started with Selenium

Let’s start with a basic example to understand how Selenium works:



In [1]:
!pip install selenium webdriver-manager



In [2]:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service

# Initialize ChromeDriver with automatic open chrome browser installed in your pc
driver = webdriver.Chrome()

# Open a website
driver.get('https://neo4j.com/docs/')

# Extract the page title
page_title = driver.title
print("Page Title:", page_title)

# Close the browser
driver.quit()


Page Title: Neo4j documentation - Neo4j Documentation


In the above code, we imported the ‘webdriver’ module from Selenium, initialized the ChromeDriver, opened a website (https://www.example.com), extracted its page title, and then closed the browser using the ‘quit()’ method.

# Web Scraping with Selenium

Now that we have a basic understanding of Selenium, let’s explore more advanced web scraping concepts. Many websites load data dynamically using JavaScript, which makes standard libraries like `requests` and `beautifulsoup` inadequate. Selenium’s ability to interact with JavaScript-rendered content makes it a powerful choice for such scenarios.

## Locating Elements

To scrape data effectively, we need to locate elements on the web page. Elements can be located using various methods:

- **By ID**: Using `find_elements(By.ID, "your_id")` method.
- **By Name**: Using `find_elements(By.NAME, "your_name")` method.
- **By Class Name**: Using `find_elements(By.CLASS_NAME, "your_class_name")` method.
- **By CSS Selector**: Using `find_elements(By.CSS_SELECTOR, "your_css_selector")` method.
- **By XPath**: Using `find_elements(By.XPATH, "your_xpath")` method.

For example, to extract the content of a paragraph with `id="content"`, we can use:

```python
content = driver.find_elements(By.ID, "content").text
print(content)


## Handling Dynamic Content

Dynamic websites may take some time to load content using JavaScript. When scraping dynamic content, we should wait for the elements to become visible before extracting data. We can achieve this using **Explicit Waits** provided by Selenium.

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import csv

# List of URLs to scrape
BASE_URL = [
'http://www.hubertiming.com/results/2017GPTR10K'
]
board_members = []

# Initialize ChromeDriver with automatic open.
driver = webdriver.Chrome()

# Loop through our URLs we loaded above
for b in BASE_URL:
    driver.get(b)

    # Wait for the table to load
    try:
        officer_table = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.ID, 'individualResults'))
        )
        
#Find By Tag Name

        # Loop through the rows of the table
        for row in officer_table.find_elements(By.TAG_NAME, 'tr'):
            cols = row.find_elements(By.TAG_NAME, 'td')
            if len(cols) == 9:  # Ensure there are 9 columns
                board_members.append(
                    (cols[0].text.strip(),  # Place
                     cols[1].text.strip(),  # Bib
                     cols[2].text.strip(),  # Name
                     cols[3].text.strip(),  # Gender
                     cols[4].text.strip(),  # City
                     cols[5].text.strip(),  # State
                     cols[6].text.strip(),  # Time
                     cols[7].text.strip(),  # Gun Time
                     cols[8].text.strip())  # Team
                )
    except Exception as e:
        print(f"Error while scraping {b}: {e}")
        continue  # Skip to the next URL if there's an error

# Close the browser
driver.quit()

# Define CSV file name
csv_file = 'board_members.csv'

# Save the scraped board members data to a CSV file
with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    
    # Write header
    writer.writerow(['Place', 'Bib', 'Name', 'Gender', 'City', 'State', 'Time', 'Gun Time', 'Team'])
    
    # Write data rows
    writer.writerows(board_members)

print(f"Board members data saved to {csv_file}.")


In [None]:
### Find by class name, XPATH name

In [3]:
#???????????????????????????Required Packages??????????????????????????????
from selenium import webdriver
from selenium.webdriver.common.by import By
# ??????????????????????End of Required Packages???????????????????????????????????


# ?????????????????????? function defination Start ??????????????????????????????
def extract_text_by_class(c):
    global driver
    try:
        content = driver.find_element(By.CLASS_NAME, c)
        return content.text
    except:
        return ""


def extract_links_by_xpath(xpath):
    global driver
    links = set()
    try:
        xpath_elems = driver.find_elements(By.XPATH, xpath)
        for elem in xpath_elems:
            link = elem.get_attribute("href")
            if link == "javascript:void(0)":
                continue
            # Remove links to images and various files
            if (
                link.endswith(".png")
                or link.endswith(".json")
                or link.endswith(".txt")
                or link.endswith(".svg")
                or link.endswith(".ipynb")
                or link.endswith(".jpg")
                or link.endswith(".pdf")
                or link.endswith(".mp4")
                or "mailto" in link
                or len(link) > 300
            ):
                continue
            # Remove anchors
            link = link.split("#")[0]
            # Remove parameters
            link = link.split("?")[0]
            # Remove trailing forward slash
            link = link.rstrip("/")
            links.add(link)
        return list(links)
    except:
        return []


# ????????????? Function defination End ???????????????????????????????????????????????

# driver to run run webpage
driver=webdriver.Chrome()
# URL of the website you want to scrape

#url = input("Enter URL of the website you want to scrape: \n ")
url = 'https://neo4j.com/docs'

# Open the website in the browser
driver.get(url)

# Extract text from the content div
text = extract_text_by_class("content")
    # If nothing is found, try article div
if not text:
    text = extract_text_by_class("article")
    # If nothing is found, try page div
if not text:
        text = extract_text_by_class("page")
if not text:
        text = extract_text_by_class("single-user-story")
# Check if 404
try:
     if "Sorry, page not found" in driver.find_element(By.TAG_NAME, "body").text:
            text = "404"
except:
        pass

print(text)

   # Extract links from the content div
links = extract_links_by_xpath("//div[@class='content']//a[@href]")
    # If nothing is found, try article div
if not links:
        links = extract_links_by_xpath("//article[@class='article']//a[@href]")
if not links:
        links = extract_links_by_xpath("//article//a[@href]")
#print(links)
for link in links:
   print(link) 

#Closes the current browser window that the WebDriver is controlling.
#Useful when you have multiple browser windows or tabs open and you want to close a specific one.
#If only one browser window is open, it will close that window but the WebDriver session will still be active.
driver.close() 

#Closes all browser windows and ends the WebDriver session.
#Useful when you are done with the entire browser session and want to clean up all resources.
#Ends the WebDriver session completely, closing all associated browser windows.
driver.quit()

Neo4j documentation
Deployment options
Choose from fully and self-managed local and cloud deployments. Run Neo4j on Docker or Kubernetes.
Get a Neo4j instance
Cypher
Learn how to write Cypher®, Neo4j’s declarative query language.
Query your data
Neo4j Tools
Use Neo4j’s tools to explore, visualize, manage, monitor, and import data to your graph.
Discover the products
Graph Data Science
Run graph algorithms and machine learning models to analyze your data at scale.
Get insights from data
Create applications
Discover the client libraries and APIs to develop applications with Neo4j and AuraDB.
Start developing
Connect data sources
Learn how to use connectors and other tools to connect Neo4j with other data sources.
Connect to Neo4j
Keep exploring
Developer
Choose your deployment
Learn Cypher
Start querying
Create applications
Connect data sources
Integrate GenAI functions
Improve app performance
Extend Neo4j
Database Admin
Manage your database
Deploy and manage a cluster
Database internals

## Handling User Interactions
Some websites require user interactions (e.g., clicking buttons, filling forms) to load data dynamically. Selenium can simulate these interactions using methods like `‘click()’`, `‘send_keys()’`, etc.

`search_input = driver.find_elements(By.ID, 'search_input')` 

`search_input.send_keys(‘Web Scraping’)`

`search_button = driver.find_elements(By.ID, 'search_button') `

`search_button.click()`

### this is a demo example for Automating LinkedIn using selenium

In [9]:
from selenium import webdriver
from selenium.webdriver.common.by import By
browser = webdriver.Chrome()
browser.get('https://www.linkedin.com/checkpoint/lg/sign-in-another-account?trk=guest_homepage-basic_nav-header-signin')

In [10]:
email = browser.find_element(By.ID,'username')
email.send_keys('demo@gmail.com')

In [11]:
password=browser.find_element(By.ID,'password')
password.send_keys('*******')


In [12]:
password.submit()

In [13]:
# Close the browser
browser.quit()

example: Ticket Booking Automation