# Self-study coding activity 8.2: Web scraping in action 

### 1. Setup and import libraries 
First, ensure you have the necessarry libraris installed. You need requests for fetching web pages and beautifulsoup4 for parding HTML content. Install them using pip if you haven't already: 

In [3]:
!pip install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup



### 2. Send an HTTP request 
Use the requests library to send an HTTP request to the target website and fetch the HTML content of the page.

In [13]:
url = 'http://example.com'
response = requests.get(url) 
# Check if the request was successful
if response.status_code == 200:
   html_content = response.content
else:
    print(f'Failed to retrieve the page. Status code: {response.status_code}')

### 3. Parse the HTML content
Create a Beautiful Soup object to parse the HTML content retrieved from the website. This object allows you to navigate and search through the HTML.

In [12]:
soup = BeautifulSoup(html_content, 'html.parser')

### 4. Navigate and search the HTML tree 
Use Beautiful Soup methods to find and extract the data you need. The methods you can use are listed below: 

#### A. find():
Finds the first occurrence of a tag or element in the HTML.

In [14]:
from bs4 import BeautifulSoup

html_content = """
<html>
  <body>
    <h1>Welcome to My Website</h1>
    <p>This is the first paragraph.</p>
    <h1>Another Heading</h1>
  </body>
</html>
"""

soup = BeautifulSoup(html_content, 'html.parser')
first_h1 = soup.find('h1')
print(first_h1.text)  # Output: Welcome to My Website

Welcome to My Website


#### B. find_all(): 
Finds all occurrences of a specific tag or element in the HTML.

In [16]:
from bs4 import BeautifulSoup

html_content = """
<html>
  <body>  
    <p>This is the first paragraph.</p>  
    <p>This is the second paragraph.</p>  
    <p>This is the third paragraph.</p>
  </body>
</html>
"""

soup = BeautifulSoup(html_content, 'html.parser')
all_paragraphs = soup.find_all('p')
for p in all_paragraphs:    
    print(p.text)
# Output:
# This is the first paragraph.
# This is the second paragraph.
# This is the third paragraph.

This is the first paragraph.
This is the second paragraph.
This is the third paragraph.


#### C. select: 
Finds elements using CSS selectors. This method is more flexible and allows you to use complex CSS-like rules. Suppose you want to find all elements with the class item and all tags within a specific with the ID content.

In [17]:
from bs4 import BeautifulSoup

html_content = """
<html>
  <body>
    <div id="content">
      <p>This is the first paragraph in the content.</p>
      <p>This is the second paragraph in the content.</p>
    </div>
    <div class="item">
      <p>This is an item paragraph.</p>
    </div>
    <div class="item">
      <p>This is another item paragraph.</p>
    </div>
  </body>
</html>
"""

soup = BeautifulSoup(html_content, 'html.parser')

# Find all elements with the class "item"
items = soup.select('.item')
for item in items:
    print(item.p.text)
# Output:
# This is an item paragraph.
# This is another item paragraph.

# Find all <p> tags within the div with id "content"
content_paragraphs = soup.select('#content p')
for p in content_paragraphs:
    print(p.text)
# Output:
# This is the first paragraph in the content.
# This is the second paragraph in the content.

This is an item paragraph.
This is another item paragraph.
This is the first paragraph in the content.
This is the second paragraph in the content.


### 5. Extract data
The desired data can be extracted from the tags. If the h1 tags must be extracted from the URL, use the following code:

In [21]:
url = 'http://example.com'
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
  html_content = response.content
else:
    print(f'Failed to retrieve the page. Status code: {response.status_code}')
soup = BeautifulSoup(html_content, 'html.parser')
# Find the first <h1> tag
h1_tag = soup.find('h1')
if h1_tag:
    print(h1_tag.text)
else:
    print('No <h1> tag found')
# Extract text from an <h1> tag
print(h1_tag.text)
# Extract href attribute from all <a> tags
a_tags = soup.find_all('a')
for a in a_tags:
  href = a.get('href')
print(href)

Example Domain
Example Domain
https://iana.org/domains/example


### 6. Handle relative URLs
If you encounter relative URLs, convert them to absolute URLs to ensure they can be used to navigate the web.

In [24]:
from urllib.parse import urljoin
base_url = 'http://example.com'
relative_url = '/path/to/resource'
absolute_url = urljoin(base_url, relative_url)
print(absolute_url)

http://example.com/path/to/resource


### 7. Store the Extracted Data 
The extracted data can be stored in any relevant format. It can be stored in text file, CSV file, JSON file, or directly into a database.

#### A. Storing data in a text file: 
When storing as a text file, you can specify the path for storing or the default folder to be used for storing the file. Suppose you have an HTML page with several paragraphs, and you want to extract the text from each paragraph (<p> tag) and store it in a text file. The text file is the extracted file, and this file, by default, is stored in the same folder. However, you can store the file anywhere by defining the path of the file.

In [25]:
import requests
from bs4 import BeautifulSoup

# For this example, we'll use the HTML content directly.
# In a real scenario, you would fetch it from a website.
html_content = """
<html>
<body>
  <h1>Sample Page</h1>
  <p>This is the first paragraph.</p>
  <p>This is the second paragraph.</p>
  <p>This is the third paragraph.</p>
</body>
</html>
"""

# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(html_content, 'html.parser')

# Find all <p> tags
paragraphs = soup.find_all('p')

# Open a text file in write mode
with open('extracted_data.txt', 'w') as file:
    # Loop through all <p> tags and write the text content to the file
    for p in paragraphs:
        file.write(p.get_text() + '\n')

print("Data has been written to extracted_data.txt")

Data has been written to extracted_data.txt


#### B. Storing data in CSV file: 
To store the output of webscraping using Beautiful Soup into a CSV file, perform the following steps:
- Create an HTML string: Define the HTML content with a table. For example, a simple HTML document containing a table with three columns (Name,Age,City) is created.
- Parse the HTML string: Use Beautiful Soup to parse the HTML content.
- Extract the data: Extract the data from the table. The headers are extracted from the <th> elements. The rows are extracted from the <tr> elements, skipping the header row. Each cellâ€™s text content is extracted from the <td> elements.
- Write to CSV: Export the extracted data to a CSV file. The csv.writer writes the headers and rows to a CSV file named output.csv.

In [29]:
from bs4 import BeautifulSoup
import csv

# Step 1: Create an HTML string with a table
html_content = """
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Sample Table</title>
</head>
<body>
  <table border="1">
      <tr>
          <th>Name</th>
          <th>Age</th>
          <th>City</th>
      </tr>
      <tr>
          <td>John Doe</td>
          <td>30</td>
          <td>New York</td>
      </tr>
      <tr>
          <td>Jane Smith</td>
          <td>25</td>
          <td>Los Angeles</td>
      </tr>
      <tr>
          <td>Emily Jones</td>
          <td>35</td>
          <td>Chicago</td>
      </tr>
  </table>
</body>
</html>
"""

# Step 2: Parse the HTML string
soup = BeautifulSoup(html_content, 'html.parser')

# Step 3: Extract the desired data
data = []
table = soup.find('table')  # Find the table
if table:
    headers = [header.text for header in table.find_all('th')]
    rows = table.find_all('tr')[1:]  # Skip the header row
    for row in rows:
        cells = row.find_all('td')
        data_row = [cell.text for cell in cells]
        data.append(data_row)

# Step 4: Write the data to a CSV file
csv_file = 'output.csv'
with open(csv_file, 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    if headers:
      writer.writerow(headers)
    writer.writerows(data)

print(f"Data has been written to {csv_file}")

Data has been written to output.csv


#### 4. Storing data to a Database: 
To store scraped data in a database, perform the following steps:
- Fetch the webpage: Use the requests library to fetch the content of the webpage. Use the HTML content by defining it or the URL for scraping data.
- Parse the HTML: Use Beautiful Soup to parse the HTML content.
- Extract the data: Identify and extract the specific data you need from the HTML. In this example, the script extracts headers and rows from the table.
- Store in a database: Use a database library like sqlite3, SQLAlchemy, or any other database connector to store the extracted data.
- Download the database: Once the db is created, it can be downloaded and opened in SQL lite.

In [32]:
import requests
from bs4 import BeautifulSoup
import sqlite3

# Step 1: Fetch the webpage (using a predefined HTML string for this example)
html_content = """
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Sample Table</title>
</head>
<body>
  <table border="1">
      <tr>
          <th>Name</th>
          <th>Age</th>
          <th>City</th>
      </tr>
      <tr>
          <td>John Doe</td>
          <td>30</td>
          <td>New York</td>
      </tr>
      <tr>
          <td>Jane Smith</td>
          <td>25</td>
          <td>Los Angeles</td>
      </tr>
      <tr>
          <td>Emily Jones</td>
          <td>35</td>
          <td>Chicago</td>
      </tr>
  </table>
</body>
</html>
"""

# Step 2: Parse the HTML
soup = BeautifulSoup(html_content, 'html.parser')

# Step 3: Extract the desired data
data = []
table = soup.find('table')  # Find the table
if table:
    headers = [header.text for header in table.find_all('th')]
    rows = table.find_all('tr')[1:]  # Skip the header row
    for row in rows:
      cells = row.find_all('td')
        data_row = [cell.text for cell in cells]
        data.append(data_row)

# Step 4: Store the data in a database
# Connect to SQLite database (or create it if it doesn't exist)
conn = sqlite3.connect('example.db')
cursor = conn.cursor()

# Create table
cursor.execute('''
  CREATE TABLE IF NOT EXISTS people (
      id INTEGER PRIMARY KEY,
      name TEXT,
      age INTEGER,
      city TEXT
  )
''')

# Insert data into the table
for row in data:
  cursor.execute('INSERT INTO people (name, age, city) VALUES (?, ?, ?)', row)

# Commit the transaction and close the connection
conn.commit()
conn.close()

print("Data has been written to the database")

IndentationError: unindent does not match any outer indentation level (<string>, line 49)