Web scraping is the process of automatically extracting information from websites.
Instead of a human manually copy-pasting data, a script (or "bot") fetches the web page and pulls out the required information

#### Why is it useful?
* **Market research (e.g., product prices, reviews)
* **Lead generation:(finding potential customers ("leads") who may be interested in your product or service) (e.g., contact info from directories):
* **Data analysis (e.g., gathering sports statistics, stock prices)

The Golden Rule: Scraping Responsibly
* This is the most important part of the webscraping. Always be a good web citizen.

* Check robots.txt: Most websites have a file at www.example.com/robots.txt that tells bots which pages they are allowed/disallowed to visit. Always respect this.

* Read the Terms of Service (ToS): The website's ToS may explicitly forbid scraping.

* Don't Overwhelm the Server: Send requests at a reasonable rate. A human clicks a link every few seconds; your script should do the same. Use time.sleep() between requests.

* Scrape for Public Data: Only scrape data that is publicly visible and not behind a login wall, unless you have explicit permission.

Scraping follows a simple process:
* `Request`: Send an HTTP request to the website's server to get the page's content.
* `Parse`: Interpret the raw HTML content into a structured format.
* `Extract`: Find and pull the specific data you need from the parsed structure.
* `Store`: Save the extracted data in a useful format (like CSV or a database).

#### Our Tools:
* `requests`: A fantastic Python library for making HTTP requests.
* `beautifulsoup4`: The master tool for parsing messy HTML and XML.
* `lxml`: A parser that BeautifulSoup uses under the hood. It's very fast.
* `pandas`: The go-to library for data analysis and manipulation, perfect for storing our data.

In [23]:

import requests
import time
from bs4 import BeautifulSoup
import pandas as pd
print("Libraries imported sucessfully")

Libraries imported sucessfully


In [24]:
URL="http://books.toscrape.com/"
headers={
    "User-Agent":"My web Scrapper 1.0-for educational purposes"

}
response=requests.get(URL,headers=headers)


In [25]:
response

<Response [200]>

In [26]:
if response.status_code==200:
    print("Success!Request was sucessful.")
    print("\nFirst 500 characters of the page.")
    print(response.text[:500])
else:
    print(f"Failed to retrive the page.Status code:{response.status_code}" )

Success!Request was sucessful.

First 500 characters of the page.
<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->
    <head>
        <title>
    All products | Books to Scrape - Sandbox
</title>

        <meta http-equiv="content-type" content="text/html; charset=UTF-8" /


In [40]:
soup=BeautifulSoup(response.text,'lxml')


In [41]:
print(soup)

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--><!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]--><!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]--><!--[if gt IE 8]><!--><html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="static

In [42]:
print(soup.prettify())
page_title=soup.find('title')
print(f"Page Title:{page_title.text}")

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   All products | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="static/oscar/css/styles.css" rel="stylesheet" type="tex

In [43]:
main_headrer=soup.find('h1')
print(f"Main Header:{main_headrer.text}")

a_book_price = soup.find('p',class_='price_color')
print(f"Price of the frist book:{a_book_price.text}")

Main Header:All products
Price of the frist book:Â£51.77


In [44]:
all_h1_tags = soup.find_all('h1')
all_prices = soup.find_all('p',class_='price_color')

In [45]:
for h1_tags in all_h1_tags:
    print(h1_tags.text)

All products


In [46]:
for price in all_prices:
    print(price.text)

Â£51.77
Â£53.74
Â£50.10
Â£47.82
Â£54.23
Â£22.65
Â£33.34
Â£17.93
Â£22.60
Â£52.15
Â£13.99
Â£20.66
Â£17.46
Â£52.29
Â£35.02
Â£57.25
Â£23.88
Â£37.59
Â£51.33
Â£45.17


In [47]:
books=soup.find_all('article',class_='product_pod')
print(f"Found {len(books)} books on the page.")

Found 20 books on the page.


In [48]:
books_data=[]


In [49]:
for book in books:

    title=book.find('h3').find('a')['title']
    price=book.find('p',class_='price_color').text
    rating=book.find('p',class_='star-rating')['class'][1]
    stock=book.find('p',class_='instock')['class'][1]

    book_info={
        'Title':title,
        'Price':price,
        'Stock':stock,
        'Rating':f"{rating}out of Five"
    }

    books_data.append(book_info)


In [52]:
books_data

[{'Title': 'A Light in the Attic',
  'Price': 'Â£51.77',
  'Stock': 'availability',
  'Rating': 'Threeout of Five'},
 {'Title': 'Tipping the Velvet',
  'Price': 'Â£53.74',
  'Stock': 'availability',
  'Rating': 'Oneout of Five'},
 {'Title': 'Soumission',
  'Price': 'Â£50.10',
  'Stock': 'availability',
  'Rating': 'Oneout of Five'},
 {'Title': 'Sharp Objects',
  'Price': 'Â£47.82',
  'Stock': 'availability',
  'Rating': 'Fourout of Five'},
 {'Title': 'Sapiens: A Brief History of Humankind',
  'Price': 'Â£54.23',
  'Stock': 'availability',
  'Rating': 'Fiveout of Five'},
 {'Title': 'The Requiem Red',
  'Price': 'Â£22.65',
  'Stock': 'availability',
  'Rating': 'Oneout of Five'},
 {'Title': 'The Dirty Little Secrets of Getting Your Dream Job',
  'Price': 'Â£33.34',
  'Stock': 'availability',
  'Rating': 'Fourout of Five'},
 {'Title': 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull',
  'Price': 'Â£17.93',
  'Stock': 'availability',
  'Rating': 'Thr

In [53]:
for i in range(3):
    print(books_data[i])

{'Title': 'A Light in the Attic', 'Price': 'Â£51.77', 'Stock': 'availability', 'Rating': 'Threeout of Five'}
{'Title': 'Tipping the Velvet', 'Price': 'Â£53.74', 'Stock': 'availability', 'Rating': 'Oneout of Five'}
{'Title': 'Soumission', 'Price': 'Â£50.10', 'Stock': 'availability', 'Rating': 'Oneout of Five'}


In [54]:
df=pd.DataFrame(books_data)

In [55]:
print("---Pandas DataFrame---")
display(df.head())

---Pandas DataFrame---


Unnamed: 0,Title,Price,Stock,Rating
0,A Light in the Attic,Â£51.77,availability,Threeout of Five
1,Tipping the Velvet,Â£53.74,availability,Oneout of Five
2,Soumission,Â£50.10,availability,Oneout of Five
3,Sharp Objects,Â£47.82,availability,Fourout of Five
4,Sapiens: A Brief History of Humankind,Â£54.23,availability,Fiveout of Five


In [56]:
output_filename='book_day1.csv'
df.to_csv(output_filename,index=False)
print(f"\nData sucessfully saved to {output_filename}!")


Data sucessfully saved to book_day1.csv!


Advanced Techniques:

* Handling Pagination: Scraping data across multiple pages.
* The Challenge of Dynamic Content: When requests isn't enough.
* Introduction to Selenium: Automating a real web browser.
* Hands-On with Selenium: Scraping a JavaScript-powered website.
* Best Practices: Error handling, waits, and putting it all together.

Strategy:

* Scrape the current page.
* Find the "Next" button's link.
* If a "Next" button exists, form the URL for the next page and repeat the process.
* If there's no "Next" button, we've reached the last page, so we stop.

In [57]:

import requests
import time
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urljoin

base_url = "http://books.toscrape.com/catalogue/"
current_page_url=base_url+"page-1.html"
headers={"User-Agent":"My WEB Scraper 2.0 - for educational purposes"}
all_books_data=[]
page_count=0
max_pages=10

while current_page_url and page_count<max_pages:
    page_count += 1
    print(f"Scraping page{page_count}:{current_page_url}")
    response = requests.get(current_page_url,headers=headers)
    soup=BeautifulSoup(response.text,'lxml')

    books=soup.find_all('article',class_='product_pod')

    for book in books:
        title=book.find('h3').find('a')['title']
        price=book.find('p',class_='price_color').text
        rating=book.find('p',class_='star-rating')['class'][1]

        all_books_data.append({
            'Title':title,
            'Price':price,
            'Rating':f"{rating} out of five"
        })

        next_button=soup.find('li',class_='next')
        if next_button:
            next_page_relative_url=next_button.find('a')['href']
            current_page_url=urljoin(base_url,next_page_relative_url)
            time.sleep(1)
        else:
            current_page_url=None

    print(f"\n Finished scraping .Total books found :{len(all_books_data)}")


Scraping page1:http://books.toscrape.com/catalogue/page-1.html

 Finished scraping .Total books found :20
Scraping page2:http://books.toscrape.com/catalogue/page-2.html

 Finished scraping .Total books found :40
Scraping page3:http://books.toscrape.com/catalogue/page-3.html

 Finished scraping .Total books found :60
Scraping page4:http://books.toscrape.com/catalogue/page-4.html

 Finished scraping .Total books found :80
Scraping page5:http://books.toscrape.com/catalogue/page-5.html

 Finished scraping .Total books found :100
Scraping page6:http://books.toscrape.com/catalogue/page-6.html

 Finished scraping .Total books found :120
Scraping page7:http://books.toscrape.com/catalogue/page-7.html

 Finished scraping .Total books found :140
Scraping page8:http://books.toscrape.com/catalogue/page-8.html

 Finished scraping .Total books found :160
Scraping page9:http://books.toscrape.com/catalogue/page-9.html

 Finished scraping .Total books found :180
Scraping page10:http://books.toscrape.com

In [58]:
df_all_pages=pd.DataFrame(all_books_data)
df_all_pages.to_csv('books_all_pages.csv',index=False)
print("Data from all pages saved to books_all_pages.csv")
display(df_all_pages.head())
display(df_all_pages.tail())

Data from all pages saved to books_all_pages.csv


Unnamed: 0,Title,Price,Rating
0,A Light in the Attic,Â£51.77,Three out of five
1,Tipping the Velvet,Â£53.74,One out of five
2,Soumission,Â£50.10,One out of five
3,Sharp Objects,Â£47.82,Four out of five
4,Sapiens: A Brief History of Humankind,Â£54.23,Five out of five


Unnamed: 0,Title,Price,Rating
195,Eureka Trivia 6.0,Â£54.59,Four out of five
196,Drive: The Surprising Truth About What Motivat...,Â£34.95,Four out of five
197,Done Rubbed Out (Reightman & Bailey #1),Â£37.72,Five out of five
198,Doing It Over (Most Likely To #1),Â£35.61,Three out of five
199,Deliciously Ella Every Day: Quick and Easy Rec...,Â£42.16,Three out of five
