#### Day 65 of Python Programming

## Introduction to Web Scraping

What is Web Scraping?

Web scraping is the process of extracting data from websites. Instead of manually copying and pasting information, web scraping automates the task using a program. The extracted data can be stored for further analysis, visualization, or application development.

### Key Use Cases of Web Scraping:

Market Research: Extract product details and prices from e-commerce websites.

Content Aggregation: Collect news articles, blogs, or public data.

Sentiment Analysis: Gather user reviews from social media or forums.

### Legal and Ethical Considerations

Before you start web scraping, it is crucial to ensure you are doing it legally and ethically:

Check robots.txt: Websites use a robots.txt file to indicate the pages that can or cannot be scraped. You can find this file by appending /robots.txt to a website’s URL (e.g., https://example.com/robots.txt).

Respect Terms of Service: Review the website’s terms and conditions to understand its data usage policies.

Avoid Overloading the Server: Make requests responsibly to avoid harming the website’s infrastructure.

### Setting Up the Environment

Required Libraries

For this tutorial, we will use two Python libraries:

requests: To send HTTP requests and fetch web pages.

Beautiful Soup: To parse and navigate the HTML structure.

Installation

To install the libraries, run the following commands:

In [1]:
pip install requests beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


### Step 1: Import Libraries

In [1]:
import requests
from bs4 import BeautifulSoup

### Step 2: Send a Request to the Webpage

Choose a webpage you want to scrape (e.g., https://example.com). Use the requests library to fetch its content:

In [2]:
url = "https://example.com"
response = requests.get(url)

# Check the status of the request
if response.status_code == 200:
    print("Page fetched successfully!")
else:
    print("Failed to fetch the page. Status code:", response.status_code)

Page fetched successfully!


### Step 3: Parse the HTML Content

Use Beautiful Soup to parse the content of the page:

In [4]:
soup = BeautifulSoup(response.content, "html.parser")
#print(soup.prettify())

#### Explanation:

requests.get() fetches the web page's HTML.

BeautifulSoup(response.text, "html.parser") parses the HTML content.


.prettify() outputs neatly formatted HTML.

### Step 4: Extract Specific Elements

Extract the page title:

In [6]:
title = soup.title.string
print("Page Title:", title)

Page Title: Example Domain


#### Extract the first paragraph:

In [7]:
first_paragraph = soup.find("p").text
print("First Paragraph:", first_paragraph)

First Paragraph: This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.


#### Accessing attributes of a tag

In [6]:
link = soup.find('a')  # Finds the first <a> tag
print("Href attribute:", link['href'])

Href attribute: https://www.iana.org/domains/example


####  Find all elements of a specific tag

In [9]:

all_links = soup.find_all('a')  # Find all <a> tags
for link in all_links:
    print(link['href'])


https://www.iana.org/domains/example


### Saving Extracted Data

In [10]:
import csv

# Example: Save extracted links to a CSV
links = [a['href'] for a in soup.find_all('a', href=True)]

with open("links.csv", "w", newline="") as file:
    writer = csv.writer(file)
    writer.writerow(["Links"])
    for link in links:
        writer.writerow([link])

print("Links saved to links.csv")


Links saved to links.csv


### Combining requests and BeautifulSoup

To scrape data from a real web page, combine requests to fetch the page and BeautifulSoup to parse the HTML.

Example: Scraping a Real Web Page

In [11]:
import requests
from bs4 import BeautifulSoup

# URL of the page you want to scrape
url = 'https://www.example.com'

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Create a BeautifulSoup object
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Extract and print the title of the page
    title = soup.title.string
    print(f"Title of the page: {title}")
    
    # Extract and print all headings (h1, h2, h3)
    headings = soup.find_all(['h1', 'h2', 'h3'])
    for heading in headings:
        print(f"Heading text: {heading.get_text()}")
else:
    print(f"Failed to retrieve page. Status code: {response.status_code}")

Title of the page: Example Domain
Heading text: Example Domain


### Understanding the Code

requests.get(url): Sends an HTTP GET request to fetch the webpage.

BeautifulSoup(response.content, "html.parser"): Parses the HTML content of the page.

soup.title.string: Retrieves the title of the page.

soup.find("p"): Finds the first paragraph (<p> tag) on the page.



#### Assignment

Choose a website of your choice and:

Extract the title of the page.

Extract the first three headings (e.g., h1, h2 tags).

Extract the links (URLs) from the page.

