# Intro to Data Science: Automated Browser Control

## Topic: Dynamic Web Scraping with Selenium

### Why are we here?
So far, we have used `requests` and `BeautifulSoup`. This is like sending a letter to a server and getting a letter back.

However, modern websites (Instagram, Twitter, infinite scrolling blogs) are **Dynamic**. They are not letters; they are applications.
* Buttons need to be clicked.
* Content loads only when you scroll down.
* Pop-ups need to be closed.

**Selenium** allows us to write a script that acts like a "Ghost User" controlling the actual Chrome browser.

## Part 1: Installation
We need the `selenium` library to control the browser, and `webdriver-manager` to automatically handle the browser driver installation (so we don't have to manually download .exe files).

In [9]:
# Run this once to install the libraries
%pip install selenium webdriver-manager pandas

Note: you may need to restart the kernel to use updated packages.


## Part 2: Summoning the Browser
Let's launch a Chrome window controlled by Python.

In [10]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time

# 1. Setup the Driver
# This installs the correct driver for your Chrome version automatically
service = Service(ChromeDriverManager().install())

# 2. Options (Optional)
# We can run it "Headless" (invisible) or "Headed" (visible).
# For learning, we want to SEE the browser working!
options = webdriver.ChromeOptions()
# options.add_argument("--headless") # Uncomment this to run invisibly later

# 3. Launch the Browser
driver = webdriver.Chrome(service=service, options=options)

print("Browser launched!")

Browser launched!


## Part 3: The "Hello World" of Interaction
We will practice on a sandbox website designed for scraping. It has infinite scrolling, which is impossible to scrape with standard requests.

**Target:** `http://quotes.toscrape.com/scroll`

In [3]:
url = "http://quotes.toscrape.com/scroll"

# Tell the driver to go to the URL
driver.get(url)

print(f"Successfully opened: {driver.title}")

# Let's pause to admire our work (and let the page load)
time.sleep(3)

Successfully opened: Quotes to Scrape


## Part 4: Locating Elements (The Strategy)
In Data Science, finding the data is 90% of the battle. Selenium uses **Locators**.
* `By.ID` (Best, most specific)
* `By.CLASS_NAME` (Good for groups of items)
* `By.CSS_SELECTOR` (Powerful)
* `By.XPATH` (The nuclear option - allows complex logic)

In [4]:
from selenium.webdriver.common.by import By

# Let's try to find all the quotes currently visible on the screen
# Inspecting the page shows each quote has the class 'quote'
quotes = driver.find_elements(By.CLASS_NAME, "quote")

print(f"I found {len(quotes)} quotes on the screen.")

# Let's print the text of the first one
if quotes:
    first_quote = quotes[0]
    # In Selenium, you must use .text to extract the string from the element
    print(f"First Quote: {first_quote.text}")

I found 10 quotes on the screen.
First Quote: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
by Albert Einstein
Tags: change deep-thoughts thinking world


## Part 5: The Challenge - Infinite Scroll
If you scroll down on this page, new quotes appear. Requests cannot see these. We need to tell the browser to scroll.

In [5]:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# A list to store our data
all_data = []

# Let's loop 3 times to simulate scrolling down 3 "pages"
for i in range(3):
    print(f"--- Scroll #{i+1} ---")

    # 1. Scroll down using JavaScript
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # 2. THE CRITICAL STEP: Wait for new data to load
    # We wait until the page catches up.
    # In a real project, we use WebDriverWait, but for simplicity here we will use sleep.
    time.sleep(2)

# 3. Now that we scrolled 3 times, let's grab EVERYTHING
final_quotes = driver.find_elements(By.CLASS_NAME, "quote")

for card in final_quotes:
    # Extracting sub-elements relative to the card
    # Note: We use card.find_element, not driver.find_element
    try:
        text = card.find_element(By.CLASS_NAME, "text").text
        author = card.find_element(By.CLASS_NAME, "author").text
        all_data.append({"author": author, "quote": text})
    except:
        continue

print(f"Success! Collected {len(all_data)} quotes.")
print(all_data)

--- Scroll #1 ---
--- Scroll #2 ---
--- Scroll #3 ---
Success! Collected 40 quotes.
[{'author': 'Albert Einstein', 'quote': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}, {'author': 'J.K. Rowling', 'quote': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'}, {'author': 'Albert Einstein', 'quote': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”'}, {'author': 'Jane Austen', 'quote': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”'}, {'author': 'Marilyn Monroe', 'quote': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”"}, {'author': 'Albert Einstein', 'quote': '“Try not to become a man of success. Rather become a man of value.”'}, {'author': 'André Gide', 'quote': '“It is bett

## Part 6: Cleanup & Export
Always close the door behind you. Leaving browser processes running eats up RAM.

In [6]:
import pandas as pd

# Close the browser
driver.quit()

# Convert to DataFrame
df = pd.DataFrame(all_data)
print(df.head())

# df.to_csv("scraped_quotes.csv", index=False)

            author                                              quote
0  Albert Einstein  “The world as we have created it is a process ...
1     J.K. Rowling  “It is our choices, Harry, that show what we t...
2  Albert Einstein  “There are only two ways to live your life. On...
3      Jane Austen  “The person, be it gentleman or lady, who has ...
4   Marilyn Monroe  “Imperfection is beauty, madness is genius and...


## Class Exercise (15 Mins)

**Task:** The "Wiki-Clicker"
1.  Navigate to `https://en.wikipedia.org/wiki/Main_Page`
2.  Find the Search Bar (Inspect it to find the ID or Name).
3.  Type "Data Science" into it.
4.  Find the Search Button and click it (or send the "Return" key).
5.  Print the title of the new page.

In [None]:
# YOUR CODE HERE

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

url = "https://en.wikipedia.org/wiki/Main_Page"

driver.get(url)
print(f"Successfully opened: {driver.title}")
time.sleep(3)

search_bar = driver.find_elements(By.NAME, "search")[0]
# print(search_bar)

search_bar.send_keys("Data Science")

search_button = driver.find_elements(
    By.CSS_SELECTOR, ".cdx-button.cdx-search-input__end-button"
)[0]
# print(search_button)

search_button.click()
time.sleep(3)

print("New page title:", driver.title)


InvalidSessionIdException: Message: invalid session id; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#invalidsessionidexception
Stacktrace:
Symbols not available. Dumping unresolved backtrace:
	0x9e4103
	0x9e4144
	0x7ee56b
	0x82ce58
	0x85c7c6
	0x858030
	0x857563
	0x7be4dd
	0x7bea6e
	0x7bef3d
	0xc557b4
	0xc5098a
	0xa0c392
	0x9fc4c8
	0xa0324d
	0x7be193
	0x7bd848
	0xdab0cf
	0x76595d49
	0x77ccd6db
	0x77ccd661
