<a href="https://colab.research.google.com/github/AyeshaIjazTabassum/PythonAIBootcamp/blob/main/Day5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Session**


# üåê Web Scraping with BeautifulSoup

This notebook will guide you through:
- Understanding HTTP requests and HTML structure
- Learning BeautifulSoup basics
- Scraping a simple website
- Saving scraped data to CSV and JSON


## üìö Theory: HTTP Requests & HTML Structure

**1. HTTP Requests:**  
- When you access a website, your browser sends a request to the server.  
- The server responds with HTML content.

**2. HTML Structure:**  
- Websites are built with HTML tags like `<h1>`, `<p>`, `<div>`, `<span>`.  
- Elements can contain text or other elements (nested).  

**3. BeautifulSoup Basics:**  
- BeautifulSoup helps us parse HTML and extract data.  
- Steps:
  1. Send HTTP request to get HTML (`requests.get(url)`)
  2. Parse HTML (`BeautifulSoup(html, 'html.parser')`)
  3. Select elements (`soup.find()`, `soup.find_all()`)


In [None]:
# Install required libraries
!pip install requests beautifulsoup4



In [None]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import csv
import json

## üñ•Ô∏è Practical Demo: Scraping Quotes

**Website:** [Quotes to Scrape](http://quotes.toscrape.com)  

**Steps:**
1. Send HTTP GET request to fetch the page
2. Parse HTML using BeautifulSoup
3. Extract all quotes
4. Display quotes
5. Save to CSV & JSON


In [None]:
# Step 1: Define URL
url = "http://quotes.toscrape.com"

# Step 2: Send HTTP GET request
response = requests.get(url)

# Step 3: Parse HTML content
soup = BeautifulSoup(response.text, 'html.parser')

print("HTML content fetched and parsed successfully!")


HTML content fetched and parsed successfully!


In [None]:
# Step 4: Extract all quotes
quotes = soup.find_all('span', class_='text')

# Step 5: Display quotes
print("Quotes found on the page:")
for q in quotes:
    print(q.text)

Quotes found on the page:
‚ÄúThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.‚Äù
‚ÄúIt is our choices, Harry, that show what we truly are, far more than our abilities.‚Äù
‚ÄúThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.‚Äù
‚ÄúThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.‚Äù
‚ÄúImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.‚Äù
‚ÄúTry not to become a man of success. Rather become a man of value.‚Äù
‚ÄúIt is better to be hated for what you are than to be loved for what you are not.‚Äù
‚ÄúI have not failed. I've just found 10,000 ways that won't work.‚Äù
‚ÄúA woman is like a tea bag; you never know how strong it is until it's in hot water.‚Äù
‚ÄúA day without sunshine is like, you know, night.‚Äù


In [None]:
# Step 6: Save quotes to CSV
with open('quotes.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(["Quote"])  # Header
    for q in quotes:
        writer.writerow([q.text])

print("Quotes saved to quotes.csv")


Quotes saved to quotes.csv


In [None]:
# Step 7: Save quotes to JSON
data = [q.text for q in quotes]
with open('quotes.json', 'w') as file:
    json.dump(data, file)

print("Quotes saved to quotes.json")

Quotes saved to quotes.json


In [None]:
from google.colab import files

# Download CSV & JSON
files.download("quotes.csv")
files.download("quotes.json")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#**Self Practice Questions**

1. Status Code Check

In [14]:
import requests

r = requests.get("https://quotes.toscrape.com/")
print(r.status_code == 200)

True


2. Extract Page Title

In [3]:
from bs4 import BeautifulSoup
import requests

html = requests.get("https://quotes.toscrape.com/").text
soup = BeautifulSoup(html, "html.parser")

print(soup.title.text)

Quotes to Scrape


3. Extract All Quotes

In [5]:
quotes = soup.find_all("span", class_="text")
[q.text for q in quotes]

['‚ÄúThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.‚Äù',
 '‚ÄúIt is our choices, Harry, that show what we truly are, far more than our abilities.‚Äù',
 '‚ÄúThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.‚Äù',
 '‚ÄúThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.‚Äù',
 "‚ÄúImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.‚Äù",
 '‚ÄúTry not to become a man of success. Rather become a man of value.‚Äù',
 '‚ÄúIt is better to be hated for what you are than to be loved for what you are not.‚Äù',
 "‚ÄúI have not failed. I've just found 10,000 ways that won't work.‚Äù",
 "‚ÄúA woman is like a tea bag; you never know how strong it is until it's in hot water.‚Äù",
 '‚ÄúA day without sunshine is like, you know, night.‚Äù']

4. Extract Authors Only

In [4]:
authors = soup.find_all("small", class_="author")
[a.text for a in authors]

['Albert Einstein',
 'J.K. Rowling',
 'Albert Einstein',
 'Jane Austen',
 'Marilyn Monroe',
 'Albert Einstein',
 'Andr√© Gide',
 'Thomas A. Edison',
 'Eleanor Roosevelt',
 'Steve Martin']

5. Count Number of Quotes

In [6]:
len(quotes)

10

6. Extract Links (href)

In [7]:
links = soup.find_all("a")
[l.get("href") for l in links]

['/',
 '/login',
 '/author/Albert-Einstein',
 '/tag/change/page/1/',
 '/tag/deep-thoughts/page/1/',
 '/tag/thinking/page/1/',
 '/tag/world/page/1/',
 '/author/J-K-Rowling',
 '/tag/abilities/page/1/',
 '/tag/choices/page/1/',
 '/author/Albert-Einstein',
 '/tag/inspirational/page/1/',
 '/tag/life/page/1/',
 '/tag/live/page/1/',
 '/tag/miracle/page/1/',
 '/tag/miracles/page/1/',
 '/author/Jane-Austen',
 '/tag/aliteracy/page/1/',
 '/tag/books/page/1/',
 '/tag/classic/page/1/',
 '/tag/humor/page/1/',
 '/author/Marilyn-Monroe',
 '/tag/be-yourself/page/1/',
 '/tag/inspirational/page/1/',
 '/author/Albert-Einstein',
 '/tag/adulthood/page/1/',
 '/tag/success/page/1/',
 '/tag/value/page/1/',
 '/author/Andre-Gide',
 '/tag/life/page/1/',
 '/tag/love/page/1/',
 '/author/Thomas-A-Edison',
 '/tag/edison/page/1/',
 '/tag/failure/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/paraphrased/page/1/',
 '/author/Eleanor-Roosevelt',
 '/tag/misattributed-eleanor-roosevelt/page/1/',
 '/author/Steve-Martin',
 

7. First Book Title (Books Site)

In [8]:
html = requests.get("https://books.toscrape.com/").text
soup = BeautifulSoup(html, "html.parser")

soup.find("h3").a["title"]

'A Light in the Attic'

8. Prices of All Books with Title

In [15]:
books = soup.find_all("article", class_="product_pod")

data = [(book.h3.a["title"], book.find("p", class_="price_color").text)
        for book in books]
data

[('A Light in the Attic', '√Ç¬£51.77'),
 ('Tipping the Velvet', '√Ç¬£53.74'),
 ('Soumission', '√Ç¬£50.10'),
 ('Sharp Objects', '√Ç¬£47.82'),
 ('Sapiens: A Brief History of Humankind', '√Ç¬£54.23'),
 ('The Requiem Red', '√Ç¬£22.65'),
 ('The Dirty Little Secrets of Getting Your Dream Job', '√Ç¬£33.34'),
 ('The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull',
  '√Ç¬£17.93'),
 ('The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics',
  '√Ç¬£22.60'),
 ('The Black Maria', '√Ç¬£52.15'),
 ('Starving Hearts (Triangular Trade Trilogy, #1)', '√Ç¬£13.99'),
 ("Shakespeare's Sonnets", '√Ç¬£20.66'),
 ('Set Me Free', '√Ç¬£17.46'),
 ("Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", '√Ç¬£52.29'),
 ('Rip it Up and Start Again', '√Ç¬£35.02'),
 ('Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991',
  '√Ç¬£57.25'),
 ('Olio', '√Ç¬£23.88'),
 ('Mesaerion: The Best Science Fiction Stories 1800

9. Save the data in csv file

In [18]:
import csv
titles = soup.find_all("h3")
prices = soup.find_all("p", class_="price_color")

with open("books.csv", "w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["Title", "Price"])

    for i in range(len(titles)):
        title = titles[i].a["title"]
        price = prices[i].text
        writer.writerow([title, price])

print("CSV file saved successfully!")

CSV file saved successfully!


10. Send Request with Headers

In [10]:
headers = {"User-Agent": "Mozilla/5.0"}
r = requests.get("https://quotes.toscrape.com/", headers=headers)
print(r.status_code)

200


11. Check Robots.txt

In [11]:
requests.get("https://quotes.toscrape.com/robots.txt").text

'<!doctype html>\n<html lang=en>\n<title>404 Not Found</title>\n<h1>Not Found</h1>\n<p>The requested URL was not found on the server. If you entered the URL manually please check your spelling and try again.</p>\n'

12. Web scraping using lxml for faster parsing

In [12]:
from lxml import html
import requests

tree = html.fromstring(requests.get("https://quotes.toscrape.com/").content)
tree.xpath('//span[@class="text"]/text()')

['‚ÄúThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.‚Äù',
 '‚ÄúIt is our choices, Harry, that show what we truly are, far more than our abilities.‚Äù',
 '‚ÄúThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.‚Äù',
 '‚ÄúThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.‚Äù',
 "‚ÄúImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.‚Äù",
 '‚ÄúTry not to become a man of success. Rather become a man of value.‚Äù',
 '‚ÄúIt is better to be hated for what you are than to be loved for what you are not.‚Äù',
 "‚ÄúI have not failed. I've just found 10,000 ways that won't work.‚Äù",
 "‚ÄúA woman is like a tea bag; you never know how strong it is until it's in hot water.‚Äù",
 '‚ÄúA day without sunshine is like, you know, night.‚Äù']