# Selenium Web Scraping Guide for Python

## What is Selenium?

Selenium is a powerful automation framework primarily used for testing web applications, but it's also widely used for web scraping. Unlike traditional scraping libraries like BeautifulSoup or requests, Selenium can interact with JavaScript-heavy websites by controlling a real web browser programmatically.

### Key Features:
- **Browser Automation**: Controls real browsers (Chrome, Firefox, Safari, Edge)
- **JavaScript Support**: Can scrape dynamic content loaded by JavaScript
- **User Interaction**: Can click buttons, fill forms, scroll pages, and simulate user behavior
- **Cross-Platform**: Works on Windows, macOS, and Linux
- **Multiple Language Support**: Available for Python, Java, C#, Ruby, and JavaScript

## Installation

### 1. Install Selenium Package
```bash
pip install selenium
```

### 2. Install WebDriver Manager (Recommended)
```bash
pip install webdriver-manager
```

### 3. Alternative: Manual WebDriver Installation
If you prefer manual installation, download the appropriate driver:
- **Chrome**: [ChromeDriver](https://chromedriver.chromium.org/)
- **Firefox**: [GeckoDriver](https://github.com/mozilla/geckodriver/releases)
- **Edge**: [EdgeDriver](https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/)

## Imports and Basic Setup

### Essential Imports
```python
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time
```

### WebDriver Setup Options

#### Option 1: Using WebDriver Manager (Recommended)
```python
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service

# Automatically downloads and manages ChromeDriver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)
```

#### Option 2: Manual WebDriver Path
```python
from selenium import webdriver
from selenium.webdriver.chrome.service import Service

# Specify the path to your downloaded ChromeDriver
service = Service('/path/to/chromedriver')
driver = webdriver.Chrome(service=service)
```

#### Option 3: Chrome Options for Customization
```python
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument('--headless')  # Run in background
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--window-size=1920,1080')

driver = webdriver.Chrome(options=chrome_options)
```

## Step-by-Step Web Scraping Process

### Step 1: Initialize WebDriver
```python
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service

# Set up the driver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)
```

### Step 2: Navigate to Website
```python
# Open the target website
url = "https://example.com"
driver.get(url)

# Optional: Maximize window
driver.maximize_window()
```

### Step 3: Locate Elements
Selenium provides multiple ways to find elements:

```python
from selenium.webdriver.common.by import By

# By ID
element = driver.find_element(By.ID, "element-id")

# By Class Name
element = driver.find_element(By.CLASS_NAME, "class-name")

# By Tag Name
element = driver.find_element(By.TAG_NAME, "div")

# By XPath
element = driver.find_element(By.XPATH, "//div[@class='example']")

# By CSS Selector
element = driver.find_element(By.CSS_SELECTOR, ".class-name")

# Find multiple elements
elements = driver.find_elements(By.CLASS_NAME, "multiple-class")
```

### Step 4: Extract Data
```python
# Get text content
text = element.text

# Get attribute values
href = element.get_attribute("href")
src = element.get_attribute("src")

# Get HTML content
html = element.get_attribute("innerHTML")
```

### Step 5: Handle Dynamic Content
```python
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Wait for element to be present
wait = WebDriverWait(driver, 10)
element = wait.until(EC.presence_of_element_located((By.ID, "dynamic-content")))

# Wait for element to be clickable
clickable_element = wait.until(EC.element_to_be_clickable((By.ID, "button-id")))
```

### Step 6: Interact with Elements
```python
from selenium.webdriver.common.keys import Keys

# Click elements
element.click()

# Send text to input fields
input_field.send_keys("your text here")

# Clear input fields
input_field.clear()

# Submit forms
form.submit()

# Scroll page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Press keyboard keys
element.send_keys(Keys.ENTER)
element.send_keys(Keys.TAB)
```

### Step 7: Handle Multiple Pages/Pagination
```python
# Example: Scraping multiple pages
page_num = 1
all_data = []

while True:
    # Scrape current page
    elements = driver.find_elements(By.CLASS_NAME, "data-item")
    
    for element in elements:
        data = element.text
        all_data.append(data)
    
    # Try to find "Next" button
    try:
        next_button = driver.find_element(By.XPATH, "//a[contains(text(), 'Next')]")
        next_button.click()
        time.sleep(2)  # Wait for page to load
        page_num += 1
    except:
        print(f"Scraped {page_num} pages")
        break
```

### Step 8: Clean Up
```python
# Always close the driver when done
driver.quit()
```

## Complete Example: Scraping a News Website

```python
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
import time
import csv

def scrape_news_website():
    # Setup
    service = Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service)
    
    try:
        # Navigate to website
        driver.get("https://news.ycombinator.com")
        
        # Wait for content to load
        wait = WebDriverWait(driver, 10)
        articles = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, "titleline")))
        
        # Extract data
        news_data = []
        for article in articles[:10]:  # Get first 10 articles
            try:
                title_element = article.find_element(By.TAG_NAME, "a")
                title = title_element.text
                link = title_element.get_attribute("href")
                
                news_data.append({
                    'title': title,
                    'link': link
                })
            except Exception as e:
                print(f"Error extracting article: {e}")
                continue
        
        # Save to CSV
        with open('news_data.csv', 'w', newline='', encoding='utf-8') as file:
            writer = csv.DictWriter(file, fieldnames=['title', 'link'])
            writer.writeheader()
            writer.writerows(news_data)
        
        print(f"Successfully scraped {len(news_data)} articles")
        return news_data
        
    except Exception as e:
        print(f"An error occurred: {e}")
    
    finally:
        driver.quit()

# Run the scraper
if __name__ == "__main__":
    scrape_news_website()
```

## Best Practices and Tips

### 1. Respect Robots.txt and Rate Limiting
```python
import time
import random

# Add delays between requests
time.sleep(random.uniform(1, 3))
```

### 2. Handle Exceptions
```python
from selenium.common.exceptions import NoSuchElementException, TimeoutException

try:
    element = driver.find_element(By.ID, "some-id")
except NoSuchElementException:
    print("Element not found")
except TimeoutException:
    print("Page load timeout")
```

### 3. Use Headless Mode for Production
```python
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

driver = webdriver.Chrome(options=options)
```

### 4. Implement Retry Logic
```python
def retry_find_element(driver, by, value, max_attempts=3):
    for attempt in range(max_attempts):
        try:
            return driver.find_element(by, value)
        except NoSuchElementException:
            if attempt == max_attempts - 1:
                raise
            time.sleep(1)
```

### 5. Use Context Managers
```python
from contextlib import contextmanager

@contextmanager
def get_driver():
    service = Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service)
    try:
        yield driver
    finally:
        driver.quit()

# Usage
with get_driver() as driver:
    driver.get("https://example.com")
    # Your scraping code here
```

## Common Challenges and Solutions

### 1. CAPTCHA and Bot Detection
- Use random delays
- Rotate user agents
- Use proxy servers
- Implement human-like behavior patterns

### 2. Dynamic Content Loading
- Use WebDriverWait with expected conditions
- Implement scroll-based loading detection
- Monitor network activity

### 3. Session Management
- Handle cookies and sessions
- Implement login flows
- Maintain session state across pages

### 4. Performance Optimization
- Use headless mode
- Disable images and CSS when not needed
- Implement parallel processing with multiple drivers

## Conclusion

Selenium is a powerful tool for web scraping, especially for JavaScript-heavy websites. While it's slower than traditional HTTP-based scraping methods, its ability to interact with dynamic content makes it invaluable for many scraping tasks. Always remember to scrape responsibly and respect website terms of service and robots.txt files.

In [26]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException, ElementClickInterceptedException, WebDriverException
import time
import logging

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Path to chromedriver.exe
driver_path = r'C:/Users/bhaut/Desktop/chromedriver.exe'
service = Service(driver_path)

try:
    # Initialize Chrome WebDriver
    logger.info("Initializing Chrome WebDriver")
    driver = webdriver.Chrome(service=service)

    try:
        # Navigate to the target page
        logger.info("Navigating to Myntra page")
        driver.get('https://www.myntra.com/women-jewellery?rf=Discount%20Range%3A10.0_100.0_10.0%20TO%20100.0')

        # Wait for initial page load
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CLASS_NAME, "results-base"))
        )

        total_items = 0
        target_items = 6192  # Desired number of items

        while total_items < target_items:
            try:
                # Count current number of items
                items = driver.find_elements(By.CLASS_NAME, "product-base")
                total_items = len(items)
                logger.info(f"Items loaded: {total_items}")

                if total_items >= target_items:
                    logger.info("Reached target number of items")
                    break

                # Find the "Next" or "Load More" button
                load_more_button = WebDriverWait(driver, 10).until(
                    EC.element_to_be_clickable((By.XPATH, "//li[contains(@class, 'pagination-next')] | //button[contains(text(), 'Load More')]"))
                )

                # Scroll to the button
                driver.execute_script("arguments[0].scrollIntoView(true);", load_more_button)
                time.sleep(0.5)  # Brief pause for scrolling

                # Click the button
                logger.info("Clicking 'Next' or 'Load More' button")
                load_more_button.click()

                # Wait for new content to load
                WebDriverWait(driver, 10).until(
                    lambda d: len(d.find_elements(By.CLASS_NAME, "product-base")) > total_items
                )

            except TimeoutException:
                logger.error("Timeout waiting for new content or button")
                break
            except NoSuchElementException:
                logger.error("No 'Next' or 'Load More' button found")
                break
            except ElementClickInterceptedException:
                logger.warning("Button click intercepted, attempting to handle overlay")
                try:
                    driver.execute_script("document.querySelector('.overlay').style.display='none';")
                    load_more_button.click()
                except:
                    logger.error("Failed to handle overlay")
                    break

        # Save the HTML
        logger.info("Saving HTML content")
        html = driver.page_source
        with open('myntra_women.html', 'w', encoding='utf-8') as f:
            f.write(html)
        logger.info(f"Scraping complete. Total items loaded: {total_items}")

    finally:
        # Clean up
        logger.info("Closing WebDriver")
        driver.quit()

except WebDriverException as e:
    logger.error(f"Failed to initialize WebDriver: {str(e)}")
    print("Please ensure the chromedriver.exe version matches your Chrome browser version and system architecture.")
    print("Download the correct version from https://chromedriver.chromium.org/downloads")

2025-06-05 00:41:39,419 - INFO - Initializing Chrome WebDriver
2025-06-05 00:41:40,992 - INFO - Navigating to Myntra page
2025-06-05 00:41:59,942 - INFO - Items loaded: 50
2025-06-05 00:42:00,668 - INFO - Clicking 'Next' or 'Load More' button
2025-06-05 00:42:01,785 - ERROR - Failed to handle overlay
2025-06-05 00:42:01,785 - INFO - Saving HTML content
2025-06-05 00:42:01,895 - INFO - Scraping complete. Total items loaded: 50
2025-06-05 00:42:01,897 - INFO - Closing WebDriver
