## Web Scraping With BeautifulSoup And Selenium

### What is Web Scraping?
 Web scraping is like being a detective on the internet! It's a way to collect information (like text, prices, or names) from websites by using a computer program. Instead of copying and pasting by hand, you let Python grab the data for you.

    BeautifulSoup BeautifulSoup is a super simple Python tool that helps you pull out data from web pages. It reads the messy code of a website (called HTML) and makes it easy to find things like titles, paragraphs, or lists.

### Why Use It?

- Saves time: Grab lots of data fast, like product prices or news headlines.

- Fun for beginners: You can explore websites and collect cool info.

- Useful: Get data for projects, like tracking toy prices or weather updates.

In [1]:
from bs4 import BeautifulSoup
import requests

### Scrap Data from Coingecko Using BeautifulSoup

In [3]:
import requests
from selenium import webdriver
from bs4 import BeautifulSoup
import time

#Get the web page
driver = webdriver.Chrome()
driver.get("https://www.coingecko.com/en")
time.sleep(5)  # Wait for the page to load
#page = driver.page_source

### Reading the Page

In [4]:
soup = BeautifulSoup(driver.page_source, 'html.parser')

print(soup.prettify()[:200])

driver.quit() # Close the browser

<html lang="en">
 <head>
  <iframe allow="conversion-measurement" frameborder="0" height="1" id="iframe_wrapper_analyticstrack" marginheight="0" marginwidth="0" scrolling="no" src="https://tpc.googles


In [5]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time

# Optional: run headless (no UI)
options = Options()
# options.add_argument('--headless')  # Uncomment for headless
options.add_argument('--disable-blink-features=AutomationControlled')

driver = webdriver.Chrome(options=options)
driver.get('https://www.coingecko.com')

time.sleep(5)  # Wait for full render

# Try to click "Accept All Cookies" if it appears
try:
    accept_button = driver.find_element(By.XPATH, "//button[contains(text(), 'Accept all')]")
    accept_button.click()
    time.sleep(2)  # Wait for page to update
except:
    print("No cookie modal found")

# Now scrape
soup = BeautifulSoup(driver.page_source, 'html.parser')
print(soup.prettify()[:200])

driver.quit()

No cookie modal found
<html lang="en">
 <head>
  <iframe allow="conversion-measurement" frameborder="0" height="1" id="iframe_wrapper_analyticstrack" marginheight="0" marginwidth="0" scrolling="no" src="https://tpc.googles


### Simple REading from WikiPEDIA

In [16]:
import pandas as pd

df = pd.read_html('https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average')
df=df[2]
df

Unnamed: 0,Company,Exchange,Symbol,Industry,Date added,Notes,Index weighting
0,3M,NYSE,MMM,Conglomerate,1976-08-09,As Minnesota Mining and Manufacturing,2.17%
1,American Express,NYSE,AXP,Financial services,1982-08-30,,4.31%
2,Amgen,NASDAQ,AMGN,Biopharmaceutical,2020-08-31,,4.14%
3,Amazon,NASDAQ,AMZN,Retailing,2024-02-26,,2.99%
4,Apple,NASDAQ,AAPL,Information technology,2015-03-19,,2.92%
5,Boeing,NYSE,BA,Aerospace and defense,1987-03-12,,3.03%
6,Caterpillar,NYSE,CAT,Construction and mining,1991-05-06,,5.13%
7,Chevron,NYSE,CVX,Petroleum industry,2008-02-19,Also 1930-07-18 to 1999-11-01,2.01%
8,Cisco,NASDAQ,CSCO,Information technology,2009-06-08,,0.92%
9,Coca-Cola,NYSE,KO,Drink industry,1987-03-12,Also 1932-05-26 to 1935-11-20,1.04%


In [17]:
df.set_index('Symbol', inplace=True) # Set 'Symbol' as index
df.reset_index(inplace=True) # Reset index to make 'Symbol' a column again
df 



Unnamed: 0,Symbol,Company,Exchange,Industry,Date added,Notes,Index weighting
0,MMM,3M,NYSE,Conglomerate,1976-08-09,As Minnesota Mining and Manufacturing,2.17%
1,AXP,American Express,NYSE,Financial services,1982-08-30,,4.31%
2,AMGN,Amgen,NASDAQ,Biopharmaceutical,2020-08-31,,4.14%
3,AMZN,Amazon,NASDAQ,Retailing,2024-02-26,,2.99%
4,AAPL,Apple,NASDAQ,Information technology,2015-03-19,,2.92%
5,BA,Boeing,NYSE,Aerospace and defense,1987-03-12,,3.03%
6,CAT,Caterpillar,NYSE,Construction and mining,1991-05-06,,5.13%
7,CVX,Chevron,NYSE,Petroleum industry,2008-02-19,Also 1930-07-18 to 1999-11-01,2.01%
8,CSCO,Cisco,NASDAQ,Information technology,2009-06-08,,0.92%
9,KO,Coca-Cola,NYSE,Drink industry,1987-03-12,Also 1932-05-26 to 1935-11-20,1.04%


### "CoinDesk Headline Scraper with Selenium & BeautifulSoup"

In [20]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time

# Set up Selenium with headless Chrome for efficiency
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("--disable-gpu")
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/91.0.4472.124")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

try:
    # URL for CoinDesk news
    url = "https://www.coindesk.com/"
    driver.get(url)

    # Wait for headlines to load (adjust timeout as needed)
    try:
        WebDriverWait(driver, 10).until(
            EC.presence_of_all_elements_located((By.TAG_NAME, "h3"))
        )
    except Exception as e:
        print(f"Timeout waiting for headlines: {str(e)}")
        driver.quit()
        exit()

    # Parse page source with BeautifulSoup
    soup = BeautifulSoup(driver.page_source, "html.parser")

    # Find news headlines (try a broader selector first)
    headlines = soup.find_all("h3")  # Fallback: all h3 tags
    # Alternative: Try a class-based selector (update after inspection)
    # headlines = soup.find_all("h3", {"class": "your-class-here"})
# Debugging: Print number of headlines found
    print(f"Found {len(headlines)} headlines.")

    if not headlines:
        print("No headlines found. Check HTML structure or class name.")
        # Optional: Print part of the page source for debugging
        print("Sample HTML:", soup.prettify()[:1000])
        driver.quit()
        exit()

    # Extract up to 5 headlines
    for i, headline in enumerate(headlines[:5], 1):
        try:
            title = headline.text.strip()
            # Find the parent link (if it exists)
            parent_link = headline.find_parent("a")
            link = parent_link["href"] if parent_link and parent_link.get("href") else "No link found"
            # Handle relative links
            full_link = url + link if link.startswith("/") else link
            print(f"Headline {i}: {title}\nLink: {full_link}\n")
        except Exception as e:
            print(f"Error processing headline {i}: {str(e)}")

finally:
    # Clean up
    driver.quit()

WebDriverException: Message: unknown error: cannot find Chrome binary
Stacktrace:
0   chromedriver                        0x00000001031c3f48 chromedriver + 4226888
1   chromedriver                        0x00000001031bc4f4 chromedriver + 4195572
2   chromedriver                        0x0000000102e00d68 chromedriver + 281960
3   chromedriver                        0x0000000102e27c24 chromedriver + 441380
4   chromedriver                        0x0000000102e26150 chromedriver + 434512
5   chromedriver                        0x0000000102e6393c chromedriver + 686396
6   chromedriver                        0x0000000102e63164 chromedriver + 684388
7   chromedriver                        0x0000000102e2ff1c chromedriver + 474908
8   chromedriver                        0x0000000102e30ef4 chromedriver + 478964
9   chromedriver                        0x000000010318559c chromedriver + 3970460
10  chromedriver                        0x00000001031896f0 chromedriver + 3987184
11  chromedriver                        0x000000010318f5b4 chromedriver + 4011444
12  chromedriver                        0x000000010318a2fc chromedriver + 3990268
13  chromedriver                        0x00000001031621c0 chromedriver + 3826112
14  chromedriver                        0x00000001031a6088 chromedriver + 4104328
15  chromedriver                        0x00000001031a61e0 chromedriver + 4104672
16  chromedriver                        0x00000001031b5f28 chromedriver + 4169512
17  libsystem_pthread.dylib             0x0000000185c7c2e4 _pthread_start + 136
18  libsystem_pthread.dylib             0x0000000185c770fc thread_start + 8


In [22]:
from selenium import webdriver
from selenium.webdriver.edge.service import Service
from selenium.webdriver.edge.options import Options

edge_options = Options()
edge_options.use_chromium = True  # Edge is Chromium-based

service = Service("path/to/msedgedriver.exe")  # Replace with your actual path
driver = webdriver.Edge(service=service, options=edge_options)

driver.get("https://example.com")
print(driver.title)
driver.quit()


NoSuchDriverException: Message: Unable to obtain driver for MicrosoftEdge; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors/driver_location
