# Introduction

- Web scraping is the automated collection of data from websites.
- It is widely used in data science, research, journalism, and industry.
- Scraping helps extract publicly available data that is not in structured formats.
- This notebook covers requesting web pages, parsing HTML, and extracting data.
- Ethical and responsible scraping practices are essential in real-world use.

# Ethics of Web Scraping

Although web scraping often involves publicly accessible data, it raises important ethical and legal considerations. Responsible scraping requires respecting both website owners and users.

Key ethical principles include:

- **Respect website policies**: Always review a website’s `robots.txt` file and terms of service to understand what is permitted.
- **Avoid excessive requests**: Sending too many requests in a short period can overload servers. Implement rate limiting and delays when scraping.
- **Do not scrape sensitive data**: Personal, private, or confidential information should never be collected without explicit permission.
- **Attribute data sources**: When using scraped data for research or publication, properly credit the original source.
- **Use data responsibly**: Scraped data should be used in ways that do not harm individuals, organizations, or communities.

Ethical web scraping balances technical capability with responsibility, ensuring that data collection practices are fair, transparent, and respectful.

In [1]:
!pip install requests beautifulsoup4 lxml


Collecting lxml
  Downloading lxml-6.0.2-cp311-cp311-win_amd64.whl (4.0 MB)
     ---------------------------------------- 4.0/4.0 MB 8.1 MB/s eta 0:00:00
Installing collected packages: lxml
Successfully installed lxml-6.0.2



[notice] A new release of pip available: 22.3.1 -> 26.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import requests
from bs4 import BeautifulSoup

# The URL of a simple website to scrape
url = "http://books.toscrape.com/"

# 1. Fetch the web page content
headers = {
"User-Agent": "My Web Scraper 1.0 - for educational purposes"
}
# Make the request
response = requests.get(url, headers=headers)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

# 2. Parse the HTML with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)

<!DOCTYPE html>

<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="s

In [4]:
import requests
from bs4 import BeautifulSoup

# The URL of a simple website to scrape
url = "http://books.toscrape.com/"

# 1. Fetch the web page content
try:
    response = requests.get(url)
    response.raise_for_status()

    # 2. Parse the HTML with BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')
   
    # 3. Extract the data
    # We want to get the title of every book on the page.
    # By inspecting the website, we find that book titles are in <h3> tags
    # which are inside <article class="product_pod"> tags.

   
    book_titles = []
    book_prices = []
    # Find all 'article' tags with the class 'product_pod'
    for book in soup.find_all('article', class_='product_pod'):
        # Inside each article, find the 'h3' tag, then the 'a' tag, and get its title attribute
        title = book.h3.a['title']
        href = book.h3.a['href']
        new_title = title + '  ' + href
       
       
        price = book.find('p', class_="price_color").text
        book_prices.append(price)
       
       
    print("--- Found Book Titles ---")
    for i, (title, price) in enumerate(zip(book_titles,book_prices),start=1): # Print first 5
        print(f"{i}. {title}, Price:{price}")

except requests.exceptions.RequestException as e:
    print(f"Error fetching URL: {e}")

--- Found Book Titles ---
1. A Light in the Attic  catalogue/a-light-in-the-attic_1000/index.html, Price:Â£51.77
2. Tipping the Velvet  catalogue/tipping-the-velvet_999/index.html, Price:Â£53.74
3. Soumission  catalogue/soumission_998/index.html, Price:Â£50.10
4. Sharp Objects  catalogue/sharp-objects_997/index.html, Price:Â£47.82
5. Sapiens: A Brief History of Humankind  catalogue/sapiens-a-brief-history-of-humankind_996/index.html, Price:Â£54.23
6. The Requiem Red  catalogue/the-requiem-red_995/index.html, Price:Â£22.65
7. The Dirty Little Secrets of Getting Your Dream Job  catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html, Price:Â£33.34
8. The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull  catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html, Price:Â£17.93
9. The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics  catalogue

In [17]:
import requests
from bs4 import BeautifulSoup

# The URL of a simple website to scrape
url = "http://books.toscrape.com/"

# 1. Fetch the web page content
try:
    response = requests.get(url)
    response.raise_for_status()

    # 2. Parse the HTML with BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')
   
    # 3. Extract the data
    # We want to get the title of every book on the page.
    # By inspecting the website, we find that book titles are in <h3> tags
    # which are inside <article class="product_pod"> tags.

   
    book_titles = []
    book_prices = []
    book_categories = []
    categories =soup.find('ul',class_='nav')
    # print(categories)
    new = categories.find('ul')
    # print(new.find_all('li')
    for category in new.find_all('li'):
        name=category.a.text.strip()
        # print(len(name))
        # print(name)
        
        # print(new)
        book_categories.append(name)
       
    print("--- Book Categories---")
    for i, (name) in enumerate(book_categories,start=1): # Print first 5
        print(f"{i}. {name}")

   

except requests.exceptions.RequestException as e:
    print(f"Error fetching URL: {e}")

--- Book Categories---
1. Travel
2. Mystery
3. Historical Fiction
4. Sequential Art
5. Classics
6. Philosophy
7. Romance
8. Womens Fiction
9. Fiction
10. Childrens
11. Religion
12. Nonfiction
13. Music
14. Default
15. Science Fiction
16. Sports and Games
17. Add a comment
18. Fantasy
19. New Adult
20. Young Adult
21. Science
22. Poetry
23. Paranormal
24. Art
25. Psychology
26. Autobiography
27. Parenting
28. Adult Fiction
29. Humor
30. Horror
31. History
32. Food and Drink
33. Christian Fiction
34. Business
35. Biography
36. Thriller
37. Contemporary
38. Spirituality
39. Academic
40. Self Help
41. Historical
42. Christian
43. Suspense
44. Short Stories
45. Novels
46. Health
47. Politics
48. Cultural
49. Erotica
50. Crime


In [None]:
import requests
from bs4 import BeautifulSoup

# The URL of a simple website to scrape
url = "https://realpython.com/"

# 1. Fetch the web page content
try:
    response = requests.get(url)
    response.raise_for_status()

    # 2. Parse the HTML with BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')
   
    # 3. Extract the data
    # We want to get the title of every book on the page.
    # By inspecting the website, we find that book titles are in <h3> tags
    # which are inside <article class="product_pod"> tags.

   
    
    
    for title in soup.find_all('div', class_='product_pod'):
        # Inside each article, find the 'h3' tag, then the 'a' tag, and get its title attribute
        title = book.h3.a['title']
        href = book.h3.a['href']
        new_title = title + '  ' + href
       
       
        price = book.find('p', class_="price_color").text
        book_prices.append(price)
       
       
    print("--- Found Book Titles ---")
    for i, (title, price) in enumerate(zip(book_titles,book_prices),start=1): # Print first 5
        print(f"{i}. {title}, Price:{price}")

   

except requests.exceptions.RequestException as e:
    print(f"Error fetching URL: {e}")

In [3]:
import requests
from bs4 import BeautifulSoup
# from urllib.parse import urljoin

# The URL of a simple website to scrape
url = "http://books.toscrape.com/"
book_titles = []
book_prices = []
book_categories = []
current_url="http://books.toscrape.com/"
count =0
max_page=5

# 1. Fetch the web page content
try:
    while True:
        if count<= max_page:
            response = requests.get(current_url)
            response.raise_for_status()
        
            # 2. Parse the HTML with BeautifulSoup
            soup = BeautifulSoup(response.text, 'html.parser')
           
            # 3. Extract the data
            # We want to get the title of every book on the page.
            # By inspecting the website, we find that book titles are in <h3> tags
            # which are inside <article class="product_pod"> tags.
        
           
            for book in soup.find_all('article', class_='product_pod'):
                # Inside each article, find the 'h3' tag, then the 'a' tag, and get its title attribute
                title = book.h3.a['title']
                href = book.h3.a['href']
                new_title = title + '  ' + href
                book_titles.append(new_title)
               
                price = book.find('p', class_="price_color").text
               
                book_prices.append(price)
               
            next_button =soup.find('li',class_='next') 
            
            if next_button:
                temp_url=next_button.a['href']
                if "catalogue/page-2.html" in temp_url:
                    current_url=url+temp_url
                    count +=1
                    # current_url=urljoin(current_url,next_page)
                else:
                    current_url =url + 'catalogue/' + temp_url
                    count+=1
        else:
            break
            
    print("--- Found Book Titles ---")
    for i, (title, price) in enumerate(zip(book_titles,book_prices),start=1): # Print first 5
        print(f"{i}. {title}, Price:{price}")

   

except requests.exceptions.RequestException as e:
    print(f"Error fetching URL: {e}")

--- Found Book Titles ---
1. A Light in the Attic  catalogue/a-light-in-the-attic_1000/index.html, Price:Â£51.77
2. Tipping the Velvet  catalogue/tipping-the-velvet_999/index.html, Price:Â£53.74
3. Soumission  catalogue/soumission_998/index.html, Price:Â£50.10
4. Sharp Objects  catalogue/sharp-objects_997/index.html, Price:Â£47.82
5. Sapiens: A Brief History of Humankind  catalogue/sapiens-a-brief-history-of-humankind_996/index.html, Price:Â£54.23
6. The Requiem Red  catalogue/the-requiem-red_995/index.html, Price:Â£22.65
7. The Dirty Little Secrets of Getting Your Dream Job  catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html, Price:Â£33.34
8. The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull  catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html, Price:Â£17.93
9. The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics  catalogue

In [1]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# The URL of a simple website to scrape
url = "http://quotes.toscrape.com/js/"
book_titles = []
book_prices = []
book_categories = []
current_url="http://quotes.toscrape.com/js/"
count =0
max_page=5

# 1. Fetch the web page content
try:
    while True:
        if count<= max_page:
            response = requests.get(current_url)
            response.raise_for_status()
        
            # 2. Parse the HTML with BeautifulSoup
            soup = BeautifulSoup(response.text, 'html.parser')
           
            # 3. Extract the data
            # We want to get the title of every book on the page.
            # By inspecting the website, we find that book titles are in <h3> tags
            # which are inside <article class="product_pod"> tags.
        
           
            a=soup.find_all('div', class_='quote')
            print(a)

            break
                # Inside each article, find the 'h3' tag, then the 'a' tag, and get its title attribute
                # print(quote)
                # title = quote.span.text
                # print(title)
                # break
    #             href = book.h3.a['href']
    #             new_title = title + '  ' + href
    #             book_titles.append(new_title)
               
    #             price = book.find('p', class_="price_color").text
               
    #             book_prices.append(price)
               
    #         next_button =soup.find('li',class_='next') 
            
    #         if next_button:
    #             temp_url=next_button.a['href']
    #             if "" in temp_url:
    #                 current_url=url+temp_url
    #                 count +=1
    #                 current_url=urljoin(current_url,next_page)
    #             else:
    #                 current_url =url + 'catalogue/' + temp_url
    #                 count+=1
    #     else:
    #         break
            
    # print("--- Found Book Titles ---")
    # for i, (title, price) in enumerate(zip(book_titles,book_prices),start=1): # Print first 5
    #     print(f"{i}. {title}, Price:{price}")

   

except requests.exceptions.RequestException as e:
    print(f"Error fetching URL: {e}")

[]


In [None]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# The URL of a simple website to scrape
url = "http://quotes.toscrape.com/js/"
book_titles = []
book_prices = []
book_categories = []
current_url="http://quotes.toscrape.com/js/"
count =0
max_page=5

# 1. Fetch the web page content
try:
    while True:
        if count<= max_page:
            response = requests.get(current_url)
            response.raise_for_status()
        
            # 2. Parse the HTML with BeautifulSoup
            soup = BeautifulSoup(response.text, 'html.parser')
           
            # 3. Extract the data
            # We want to get the title of every book on the page.
            # By inspecting the website, we find that book titles are in <h3> tags
            # which are inside <article class="product_pod"> tags.
        
           
            for quote in soup.find_all('div', class_='quote'):
                # Inside each article, find the 'h3' tag, then the 'a' tag, and get its title attribute
                print(quote)
                title = quote.span.text
                print(title)
                break
    #             href = book.h3.a['href']
    #             new_title = title + '  ' + href
    #             book_titles.append(new_title)
               
    #             price = book.find('p', class_="price_color").text
               
    #             book_prices.append(price)
               
    #         next_button =soup.find('li',class_='next') 
            
    #         if next_button:
    #             temp_url=next_button.a['href']
    #             if "" in temp_url:
    #                 current_url=url+temp_url
    #                 count +=1
    #                 current_url=urljoin(current_url,next_page)
    #             else:
    #                 current_url =url + 'catalogue/' + temp_url
    #                 count+=1
    #     else:
    #         break
            
    # print("--- Found Book Titles ---")
    # for i, (title, price) in enumerate(zip(book_titles,book_prices),start=1): # Print first 5
    #     print(f"{i}. {title}, Price:{price}")

   

except requests.exceptions.RequestException as e:
    print(f"Error fetching URL: {e}")

In [2]:
!pip install selenium

Collecting selenium
  Downloading selenium-4.40.0-py3-none-any.whl (9.6 MB)
     ---------------------------------------- 9.6/9.6 MB 12.0 MB/s eta 0:00:00
Collecting certifi>=2026.1.4
  Downloading certifi-2026.1.4-py3-none-any.whl (152 kB)
     ---------------------------------------- 152.9/152.9 kB ? eta 0:00:00
Collecting trio<1.0,>=0.31.0
  Downloading trio-0.32.0-py3-none-any.whl (512 kB)
     ------------------------------------- 512.0/512.0 kB 16.2 MB/s eta 0:00:00
Collecting trio-websocket<1.0,>=0.12.2
  Downloading trio_websocket-0.12.2-py3-none-any.whl (21 kB)
Collecting trio-typing>=0.10.0
  Downloading trio_typing-0.10.0-py3-none-any.whl (42 kB)
     ---------------------------------------- 42.2/42.2 kB 2.1 MB/s eta 0:00:00
Collecting types-certifi>=2021.10.8.3
  Downloading types_certifi-2021.10.8.3-py3-none-any.whl (2.1 kB)
Collecting types-urllib3>=1.26.25.14
  Downloading types_urllib3-1.26.25.14-py3-none-any.whl (15 kB)
Collecting urllib3[socks]<3.0,>=2.6.3
  Downloadi


[notice] A new release of pip available: 22.3.1 -> 26.0
[notice] To update, run: python.exe -m pip install --upgrade pip


* Handling Pagination: Scraping data across multiple pages.
* The Challenge of Dynamic Content: When requests isn't enough.
* Introduction to Selenium: Automating a real web browser.
* Hands-On with Selenium: Scraping a JavaScript-powered website.
* Best Practices: Error handling, waits, and putting it all together.

* Introduction to Selenium
* Selenium is a tool that automates a real web browser. It's like a robot sitting at your computer, opening Chrome or Firefox, and interacting with pages just like a human would.
* Because it uses a real browser, the browser will execute all the JavaScript, and Selenium can then access the final, rendered HTML.
* Setup
* You need two things:
* The selenium Python library.
* A WebDriver, which is a separate program that Selenium uses to control a specific browser. The most common is ChromeDriver for Google Chrome.
* Installation:
* !pip install selenium
* Download ChromeDriver: https://googlechromelabs.github.io/chrome-for-driver/. Make sure its version matches your installed Chrome browser version. Unzip it and place the chromedriver.exe (or chromedriver on Mac/Linux) in a known location or in the same folder as your notebook.

In [1]:
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd

driver = webdriver.Chrome()

try:
    dynamic_url = "http://quotes.toscrape.com/js/"
    driver.get(dynamic_url)

    wait = WebDriverWait(driver, 10)
    wait.until(EC.presence_of_element_located((By.CLASS_NAME, "container")))

    print("Page loaded and quotes are present.")

    soup = BeautifulSoup(driver.page_source, "lxml")
    quotes = soup.find_all("div", class_="quote")

    records = []   # list of dictionaries

    for quote in quotes:
        tags = []   # initialize empty list first

        text = quote.find("span", class_="text").text.strip()
        author = quote.find("small", class_="author").text.strip()

        tag_elements = quote.find_all("a", class_="tag")
        for t in tag_elements:
            tags.append(t.text)

        record = {
            "text": text,
            "author": author,
            "tags": tags   # keep as list (pandas can store it)
        }

        records.append(record)

    print(f"Collected {len(records)} quotes")

    # Convert to DataFrame
    df = pd.DataFrame(records)

    # optional: join tags list into string for CSV readability
    df["tags"] = df["tags"].apply(lambda x: ", ".join(x))

    # Save CSV
    df.to_csv("quotes.csv", index=False, encoding="utf-8")

    print("CSV created using DataFrame → quotes.csv")

finally:
    driver.quit()
    print("Browser closed.")

Page loaded and quotes are present.
Collected 10 quotes
CSV created using DataFrame → quotes.csv
Browser closed.


In [2]:
df

Unnamed: 0,text,author,tags
0,“The world as we have created it is a process ...,Albert Einstein,"change, deep-thoughts, thinking, world"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"abilities, choices"
2,“There are only two ways to live your life. On...,Albert Einstein,"inspirational, life, live, miracle, miracles"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"aliteracy, books, classic, humor"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"be-yourself, inspirational"
5,“Try not to become a man of success. Rather be...,Albert Einstein,"adulthood, success, value"
6,“It is better to be hated for what you are tha...,André Gide,"life, love"
7,"“I have not failed. I've just found 10,000 way...",Thomas A. Edison,"edison, failure, inspirational, paraphrased"
8,“A woman is like a tea bag; you never know how...,Eleanor Roosevelt,misattributed-eleanor-roosevelt
9,"“A day without sunshine is like, you know, nig...",Steve Martin,"humor, obvious, simile"


In [3]:
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd

# ------------------ SETUP ------------------
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)

records = []

try:
    dynamic_url = "http://quotes.toscrape.com/js/"
    driver.get(dynamic_url)

    # wait until quotes load first time
    wait.until(EC.presence_of_element_located((By.CLASS_NAME, "quote")))
    print("Page loaded.")

    # ------------------ MULTI-PAGE SCRAPE ------------------
    while True:
        soup = BeautifulSoup(driver.page_source, "lxml")
        quotes = soup.find_all("div", class_="quote")

        print(f"Scraping page — {len(quotes)} quotes found")

        for quote in quotes:
            tags = []   # initialize empty list first (as required)

            text = quote.find("span", class_="text").text.strip()
            author = quote.find("small", class_="author").text.strip()

            tag_elements = quote.find_all("a", class_="tag")
            for t in tag_elements:
                tags.append(t.text)

            records.append({
                "text": text,
                "author": author,
                "tags": tags
            })

        # ---------- click NEXT if exists ----------
        try:
            next_btn = wait.until(
                EC.element_to_be_clickable((By.CSS_SELECTOR, "li.next a"))
            )
            driver.execute_script("arguments[0].click();", next_btn)

            # wait for next page quotes to reload
            wait.until(EC.staleness_of(quotes[0]))

        except:
            print("No more pages.")
            break

    # ------------------ DATAFRAME + CSV ------------------
    df = pd.DataFrame(records)

    # convert tag list → readable string for CSV
    df["tags"] = df["tags"].apply(lambda x: ",".join(x))

    df.to_csv("quotes_all_pages.csv", index=False, encoding="utf-8")

    print(f"\nSaved {len(df)} total quotes to CSV")

# ------------------ CLEANUP ------------------
finally:
    driver.quit()
    print("Browser closed.")

Page loaded.
Scraping page — 10 quotes found
No more pages.

Saved 10 total quotes to CSV
Browser closed.


In [4]:
df

Unnamed: 0,text,author,tags
0,“The world as we have created it is a process ...,Albert Einstein,"change,deep-thoughts,thinking,world"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"abilities,choices"
2,“There are only two ways to live your life. On...,Albert Einstein,"inspirational,life,live,miracle,miracles"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"aliteracy,books,classic,humor"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"be-yourself,inspirational"
5,“Try not to become a man of success. Rather be...,Albert Einstein,"adulthood,success,value"
6,“It is better to be hated for what you are tha...,André Gide,"life,love"
7,"“I have not failed. I've just found 10,000 way...",Thomas A. Edison,"edison,failure,inspirational,paraphrased"
8,“A woman is like a tea bag; you never know how...,Eleanor Roosevelt,misattributed-eleanor-roosevelt
9,"“A day without sunshine is like, you know, nig...",Steve Martin,"humor,obvious,simile"


In [7]:
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd

driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)

records = []

try:
    base_url = "http://quotes.toscrape.com/js/page/{}/"
    page = 1

    while True:
        url = base_url.format(page)
        driver.get(url)

        # wait for quotes to load
        wait.until(EC.presence_of_element_located((By.CLASS_NAME, "quote")))

        soup = BeautifulSoup(driver.page_source, "lxml")
        quotes = soup.find_all("div", class_="quote")

        # stop if no quotes found
        if not quotes:
            print("No more pages.")
            break

        print(f"Scraping page {page} — {len(quotes)} quotes")

        for quote in quotes:
            tags = []

            text = quote.find("span", class_="text").text.strip()
            author = quote.find("small", class_="author").text.strip()

            tag_elements = quote.find_all("a", class_="tag")
            for t in tag_elements:
                tags.append(t.text)

            records.append({
                "text": text,
                "author": author,
                "tags": tags
            })

        page += 1   # go to next page number

    # -------- DataFrame → CSV --------
    df = pd.DataFrame(records)
    df["tags"] = df["tags"].apply(lambda x: ",".join(x))
    df.to_csv("quotes_all_pages.csv", index=False)

    print(f"\nSaved {len(df)} quotes to CSV")

finally:
    driver.quit()
    print("Browser closed.")

Scraping page 1 — 10 quotes
Scraping page 2 — 10 quotes
Scraping page 3 — 10 quotes
Scraping page 4 — 10 quotes
Scraping page 5 — 10 quotes
Scraping page 6 — 10 quotes
Scraping page 7 — 10 quotes
Scraping page 8 — 10 quotes
Scraping page 9 — 10 quotes
Scraping page 10 — 10 quotes
Browser closed.


TimeoutException: Message: 
Stacktrace:
Symbols not available. Dumping unresolved backtrace:
	0x7ff7b34ef3d5
	0x7ff7b34ef430
	0x7ff7b32910bd
	0x7ff7b32ec2fe
	0x7ff7b32ec60c
	0x7ff7b333d187
	0x7ff7b3339d08
	0x7ff7b32dcb0c
	0x7ff7b32dda53
	0x7ff7b37cb470
	0x7ff7b37c586d
	0x7ff7b37e621a
	0x7ff7b350b235
	0x7ff7b3513a5c
	0x7ff7b34f8844
	0x7ff7b34f89f6
	0x7ff7b34deb87
	0x7ff9f0bfe8d7
	0x7ff9f1c6c40c


In [8]:
df

Unnamed: 0,text,author,tags
0,“The world as we have created it is a process ...,Albert Einstein,"change,deep-thoughts,thinking,world"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"abilities,choices"
2,“There are only two ways to live your life. On...,Albert Einstein,"inspirational,life,live,miracle,miracles"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"aliteracy,books,classic,humor"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"be-yourself,inspirational"
5,“Try not to become a man of success. Rather be...,Albert Einstein,"adulthood,success,value"
6,“It is better to be hated for what you are tha...,André Gide,"life,love"
7,"“I have not failed. I've just found 10,000 way...",Thomas A. Edison,"edison,failure,inspirational,paraphrased"
8,“A woman is like a tea bag; you never know how...,Eleanor Roosevelt,misattributed-eleanor-roosevelt
9,"“A day without sunshine is like, you know, nig...",Steve Martin,"humor,obvious,simile"


In [1]:
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from urllib.parse import urljoin

url = "http://quotes.toscrape.com/js/"
current_url = url

quotes_list = []
authors_list = []

driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)

try:
    while True:
        # 1. Fetch the web page (Selenium instead of requests)
        driver.get(current_url)

        # Wait for quotes to load (because JS)
        wait.until(EC.presence_of_element_located((By.CLASS_NAME, "quote")))

        soup = BeautifulSoup(driver.page_source, "html.parser")

        for quote in soup.find_all("div", class_="quote"):
            text = quote.find("span", class_="text").text
            author = quote.find("small", class_="author").text

            quotes_list.append(text)
            authors_list.append(author)

        next_button = soup.find("li", class_="next")

        if next_button:
            href = next_button.a["href"]
            current_url = urljoin(current_url, href)
            time.sleep(1)
        else:
            break

    print("--- Found Quotes ---")
    for i, (quote, author) in enumerate(zip(quotes_list, authors_list), start=1):
        print(f"{i}. {quote} — {author}")

finally:
    driver.quit()
    print("\nBrowser closed.")

--- Found Quotes ---
1. “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” — Albert Einstein
2. “It is our choices, Harry, that show what we truly are, far more than our abilities.” — J.K. Rowling
3. “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” — Albert Einstein
4. “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.” — Jane Austen
5. “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.” — Marilyn Monroe
6. “Try not to become a man of success. Rather become a man of value.” — Albert Einstein
7. “It is better to be hated for what you are than to be loved for what you are not.” — André Gide
8. “I have not failed. I've just found 10,000 ways that won't work.” — Thomas A. Edison
9. “A woman is like a tea bag; you never know how stron

In [None]:
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from urllib.parse import urljoin

url = "https://news.ycombinator.com/"
current_url = url

quotes_list = []
authors_list = []

driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)

try:
    while True:
        # 1. Fetch the web page (Selenium instead of requests)
        driver.get(current_url)

        # Wait for quotes to load (because JS)
        wait.until(EC.presence_of_element_located((By.CLASS_NAME, "quote")))

        soup = BeautifulSoup(driver.page_source, "html.parser")

        for quote in soup.find_all("div", class_="quote"):
            text = quote.find("span", class_="text").text
            author = quote.find("small", class_="author").text

            quotes_list.append(text)
            authors_list.append(author)

        next_button = soup.find("li", class_="next")

        if next_button:
            href = next_button.a["href"]
            current_url = urljoin(current_url, href)
            time.sleep(1)
        else:
            break

    print("--- Found Quotes ---")
    for i, (quote, author) in enumerate(zip(quotes_list, authors_list), start=1):
        print(f"{i}. {quote} — {author}")

finally:
    driver.quit()
    print("\nBrowser closed.")

In [7]:
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from urllib.parse import urljoin
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument("user-agent=Mozilla/5.0")


driver = webdriver.Chrome( options=chrome_options)
base_url ="https://news.ycombinator.com/"
all_news_data = []
page_count = 0
max_pages = 5
current_page_url="https://news.ycombinator.com/"

while current_page_url and page_count<=max_pages:
    print("page_count :", page_count)
    page_count+=1
   
    driver.get(current_page_url)
    wait = WebDriverWait(driver, 10)
    wait.until(EC.presence_of_element_located((By.CLASS_NAME, "athing")))

    page_source = driver.page_source
    # print(page_source)
    soup = BeautifulSoup(page_source,'lxml')

    news= driver.find_elements(By.CLASS_NAME,"athing")
    for q in news:
        text = q.find_element(By.CLASS_NAME,"titleline").text
        all_news_data.append(text)
        
    more_button =soup.find('a',class_ ='morelink')
    if more_button:
        next_page_relative_url =more_button['href']
        current_page_url =urljoin(base_url,next_page_relative_url)
        time.sleep(1)
    else:
        current_page_url = None
        driver.quit()
for i in all_news_data:
    print(i)
print("\nBrowser closed.")

page_count : 0
page_count : 1
page_count : 2
page_count : 3
page_count : 4
page_count : 5
Vouch (github.com/mitchellh)
Art of Roads in Games (sandboxspirit.com)
LispE: Lisp Interpreter with Pattern Programming and Lazy Evaluation (github.com/naver)
Claude’s C Compiler vs. GCC (harshanu.space)
Nobody knows how the whole system works (surfingcomplexity.blog)
TSMC to make advanced AI semiconductors in Japan (apnews.com)
Show HN: A custom font that displays Cistercian numerals using ligatures (bobbiec.github.io)
Every book recommended on the Odd Lots Discord (odd-lots-books.netlify.app)
Apple XNU: Clutch Scheduler (github.com/apple-oss-distributions)
Custom Firmware for the MZ-RH1 – Ready for Testing (sir68k.re)
Reverse Engineering the Prom for the SGI O2 (mattst88.com)
Ask HN: What are you working on? (February 2026)
Quartz crystals (pa3fwm.nl)
More Mac malware from Google search (eclecticlight.co)
Show HN: I created a Mars colony RPG based on Kim Stanley Robinson’s Mars books (underhillg

In [23]:
import requests
from bs4 import BeautifulSoup

# The URL of a simple website to scrape
url = "https://techaxis.com.np/course"

# 1. Fetch the web page content
try:
    response = requests.get(url)
    response.raise_for_status()

    # 2. Parse the HTML with BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')
   
    # 3. Extract the data
    # We want to get the title of every book on the page.
    # By inspecting the website, we find that book titles are in <h3> tags
    # which are inside <article class="product_pod"> tags.

   
    course_titles = []
    
    
    courses =soup.find_all('h3',class_='course__card-title')
    # print(courses)
    for course in courses:
        course_name=course.text
        print(course_name)
        # print(course)
    # new = courses.find('ul')
    # # print(new.find_all('li')
    # for category in new.find_all('li'):
    #     name=category.a.text.strip()
    #     # print(len(name))
    #     # print(name)
        
    #     # print(new)
    #     book_categories.append(name)
       
    # print("--- Courses List---")
    # for i, (courses) in enumerate(course_titles,start=1): # Print first 5
    #     print(f"{i}. {name}")

   

except requests.exceptions.RequestException as e:
    print(f"Error fetching URL: {e}")

Data Analysis Training Using Python & Excel in Nepal
Full Stack Web Development Training 
Adobe InDesign Training Course in Nepal
Video Editing Training Course in Nepal
Prompt Engineering for Video Content Creation Training in Nepal
Prompt Engineering for Web Developers Training Course in Nepal
Prompt Engineering Training Course For Developers in Nepal
Motion Graphics Training Course in Nepal
Adobe Illustrator Training Course in Nepal
Social Media Marketing Training Course in Nepal
Deep Learning With Python Training in Nepal
Data Analysis Training in Nepal
AI with Python Training in Nepal
Master Machine Learning with Python Training in Nepal
Prompt Engineering Training Course in Nepal
ChatGPT Prompt Engineering Course in Nepal
Gemini Prompt Engineering Course in Nepal
Unity Game Design & Development
Google Ads Training Course in Nepal
Google Analytics Training Course in Nepal
Google Tag Manager Training in Nepal
Data Analytics Training in Nepal
Photoshop Training Course in Nepal
GIS wi

In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

url = "https://techaxis.com.np/course"

driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)

try:
    driver.get(url)
    wait.until(EC.presence_of_element_located((By.CLASS_NAME, "course__card")))

    soup = BeautifulSoup(driver.page_source, "html.parser")

    courses = soup.find_all("div", class_="course__card")

    print("--- Found Courses ---")

    for i, course in enumerate(courses, start=1):

        title_tag = course.find("h3", class_="course__card-title")
        detail_tag = course.find("span", class_="course__card-detail-text")

        title = title_tag.text.strip()
        detail = detail_tag.text.strip() if detail_tag else "Details not available"

        print(f"{i}. {title}")
        print(f"   Details: {detail}\n")

finally:
    driver.quit()

--- Found Courses ---
1. Data Analysis Training Using Python & Excel in Nepal
   Details: Details not available

2. Full Stack Web Development Training
   Details: Details not available

3. Adobe InDesign Training Course in Nepal
   Details: Details not available

4. Video Editing Training Course in Nepal
   Details: Details not available

5. Prompt Engineering for Video Content Creation Training in Nepal
   Details: Details not available

6. Prompt Engineering for Web Developers Training Course in Nepal
   Details: Details not available

7. Prompt Engineering Training Course For Developers in Nepal
   Details: Details not available

8. Motion Graphics Training Course in Nepal
   Details: Details not available

9. Adobe Illustrator Training Course in Nepal
   Details: Details not available

10. Social Media Marketing Training Course in Nepal
   Details: Details not available

11. Deep Learning With Python Training in Nepal
   Details: 1-1.5 Months

12. Data Analysis Training in Nepal
 