### Day 14 of programming

## Python Tutorial: Web Scraping
Introduction to Web Scraping
Web scraping is the process of extracting data from websites. Python, with libraries such as requests and BeautifulSoup, makes it easy to fetch and parse web content. Web scraping can be used to collect data from websites for various applications such as data analysis, research, and automation.

### Prerequisites
Python installed on your system.

Basic understanding of HTML and CSS.

#### The following Python libraries:

requests: To fetch web pages.

beautifulsoup4: To parse HTML.

You can install the necessary libraries using pip:

In [1]:
pip install requests beautifulsoup4


Defaulting to user installation because normal site-packages is not writeable
Could not fetch URL https://pypi.org/simple/pip/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/pip/ (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)'))) - skipping
Note: you may need to restart the kernel to use updated packages.


### Step 1: Fetching Web Pages with requests
The requests library allows you to send HTTP requests and receive responses from the web. Here's a basic example of how to fetch a web page.

Example: Fetching a Web Page

In [2]:
import requests

# URL of the page you want to scrape
url = 'https://www.bethmedia.com'

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    print("Page fetched successfully!")
    print(response.text)  # Print the HTML content of the page
else:
    print(f"Failed to retrieve page. Status code: {response.status_code}")


Page fetched successfully!
<!DOCTYPE html>
<html dir="ltr" lang="en-US" prefix="og: https://ogp.me/ns#">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
	 <link rel="profile" href="https://gmpg.org/xfn/11"> 
	 <title>Beth Media | Creative Solutions | Web Design, AI, Branding</title>

		<!-- All in One SEO 4.6.2 - aioseo.com -->
		<meta name="description" content="Beth Media: Frankfurt&#039;s top creative agency. Elevate your brand with our web design, AI integration, and branding services. Discover our innovative solutions!" />
		<meta name="robots" content="max-image-preview:large" />
		<meta name="google-site-verification" content="r3xlisudaoVJkmSahi51YSKWpmrQlGmAG8GpaWHLks0" />
		<link rel="canonical" href="https://bethmedia.com/" />
		<meta name="generator" content="All in One SEO (AIOSEO) 4.6.2" />
		<meta property="og:locale" content="en_US" />
		<meta property="og:site_name" content="Beth Media - bethmedia.com" />
		<meta proper

### Step 2: Parsing HTML with BeautifulSoup
Once you have fetched the HTML content, you can use BeautifulSoup to parse and extract data from it.

Example: Parsing HTML

In [4]:
from bs4 import BeautifulSoup

# Sample HTML content (for demonstration purposes)
html_content = """
<html>
<head><title>Test Page</title></head>
<body>
<h1>Welcome to the Test Page</h1>
<p>This is a paragraph.</p>
<p class="info">Another paragraph with a class.</p>
</body>
</html>
"""

# Create a BeautifulSoup object and specify the parser
soup = BeautifulSoup(html_content, 'html.parser')

# Extract the title of the page
title = soup.title.string
print(f"Title of the page: {title}")

# Extract all paragraph tags
paragraphs = soup.find_all('p')
for p in paragraphs:
    print(f"Paragraph text: {p.get_text()}")

# Extract text of a paragraph with a specific class
info_paragraph = soup.find('p', class_='info')
print(f"Info paragraph text: {info_paragraph.get_text()}")


Title of the page: Test Page
Paragraph text: This is a paragraph.
Paragraph text: Another paragraph with a class.
Info paragraph text: Another paragraph with a class.


### Step 3: Combining requests and BeautifulSoup
To scrape data from a real web page, combine requests to fetch the page and BeautifulSoup to parse the HTML.

Example: Scraping a Real Web Page

In [6]:
import requests
from bs4 import BeautifulSoup

# URL of the page you want to scrape
url = 'https://www.bethmedia.com'

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Create a BeautifulSoup object
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Extract and print the title of the page
    title = soup.title.string
    print(f"Title of the page: {title}")
    
    # Extract and print all headings (h1, h2, h3)
    headings = soup.find_all(['h1', 'h2', 'h3'])
    for heading in headings:
        print(f"Heading text: {heading.get_text()}")
else:
    print(f"Failed to retrieve page. Status code: {response.status_code}")


Title of the page: Beth Media | Creative Solutions | Web Design, AI, Branding
Heading text: Creative Agency
Heading text: in Frankfurt
Heading text: optimum web design propellente in mundo
Heading text: About Us
Heading text: 

							Location:						

Heading text: 

							Email:						

Heading text: 

							Phone:						

Heading text: Visual Identity
Heading text: 
							Logo Design						
Heading text: SEO Optimizing
Heading text: 
							SEO Optimizing						
Heading text: Web Development
Heading text: 
							Web Development						
Heading text: Artificial Intelligence
Heading text: 
							Artificial Intelligence						
Heading text: Our Services
Heading text: Latest Insights
Heading text: Effective Strategies for Marketing and Advertising
Heading text: Crafting a Memorable Brand Identity
Heading text: The Art of Brand Experience Design
Heading text: Frequently Asked Questions
Heading text: Contact Us


#### Explanation of the Code:
response.text:

This is the HTML content of the web page that you fetched using requests.get(). The requests.get(url) call sends a request to the server to retrieve the content of the page, and response.text contains the raw HTML as a string.
BeautifulSoup():

This is a constructor from the BeautifulSoup library. It is used to create a BeautifulSoup object that parses the raw HTML content.
BeautifulSoup() takes two main arguments:
HTML content (in this case, response.text).

Parser type ('html.parser' in this case), which specifies how to interpret and parse the HTML code.
'html.parser':

This tells BeautifulSoup which parser to use for parsing the HTML.
'html.parser' is the default parser included with Python. It is used to convert the HTML document into a nested data structure (like a tree), which makes it easier to search and navigate through the HTML elements (tags, attributes, etc.).

### Step 4: Handling Common Issues
Respecting robots.txt: Always check the robots.txt file of a website to ensure you're allowed to scrape it. This file specifies rules for web crawlers.

Handling HTTP Errors: Always check the status code of the response. Common codes include:

200: OK

404: Not Found

403: Forbidden

Dealing with Dynamic Content: Some websites use JavaScript to load content dynamically. For these, consider using tools like Selenium to interact with the web page.

Rate Limiting: Avoid sending too many requests in a short time to prevent being blocked. Implement delays between requests if scraping multiple pages.

### Summary
Fetching Web Pages: Use requests to send HTTP requests and retrieve web pages.

Parsing HTML: Use BeautifulSoup to parse and extract data from HTML content.

Handling Common Issues: Respect robots.txt, handle HTTP errors, and manage dynamic content appropriately.

Practice: Implement and test your scraping skills through various practical exercises.

Web scraping is a valuable tool for data collection and analysis. With practice and careful handling of web content, you can effectively gather and use data from the web.

## Practice Questions

1. Write a Python script that logs into a website and scrapes data from a page that requires authentication. Use the requests library for handling login.
2. Write a Python script to scrape data from a website and perform basic data analysis (e.g., average price of products, most common categories).
3. Write a Python script to scrape data from a website (e.g., product information) and save the extracted data to a CSV file.
