### Extracting data from HTML documents

- We will first create an HTML document for use in later for extracting text out of it. 
- You should have some basic knowledge of HTML files to extract best out of this section. 

In [15]:
# Create an HTML document as said. 
html_content = '''
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Sample HTML File</title>
</head>
<body>
    <header>
        <h1>Main Title</h1>
        <nav>
            <ul>
                <li><a href="#home">Home</a></li>
                <li><a href="#about">About</a></li>
                <li><a href="#contact">Contact</a></li>
            </ul>
        </nav>
    </header>
    <section id="home">
        <h2>Home Section</h2>
        <p>Welcome home. Can we have a coffee?.</p>
    </section>
    <section id="about">
        <h2>About Section</h2>
        <p>This section writes about an overview of my website.</p>
    </section>
    <section id="contact">
        <h2>Contact Section</h2>
    <p>You can contact me using my personal email id<a href="mailto:my_name@my_email.com">info@example.com</a>.</p>
    </section>
    <footer>
        <p>&copy; I am writing a book with my copyright on it.</p>
    </footer>
</body>
</html>
'''

# Save the HTML content to a file
with open('my_sample.html', 'w') as file:
    file.write(html_content)

- We will use BeautifulSoup library to extract text from this HTML file.
- BeautifulSoupis one popular Python library to scrape data from the web. 

In [16]:
# Import the BeautifulSoup class from the bs4 library
from bs4 import BeautifulSoup 

# The following code reads the HTML file. 
# Open 'Sample.html' in read mode.
with open('my_sample.html', 'r') as file:  
    html = file.read()  # Read the content of the file into the 'html' variable

# Parse the HTML content using BeautifulSoup.
soup = BeautifulSoup(html, 'html.parser')  

# Extract the text of the <title> element.
title = soup.title.string 
# Extract the text of the <h1> element within the <header>.
main_heading = soup.header.h1.string  
# Extract all href attributes from <a> elements within the <nav>.
nav_links = [a['href'] for a in soup.nav.find_all('a')]
# Extract the text of the <p> element.
home_section = soup.find(id="home").p.string 
# Extract the text of the <p> element within the section with id="about".
about_section = soup.find(id="about").p.string 
# Extract the text of the <p> element within the section with id="contact".
contact_section = soup.find(id="contact").p.string
# Extract the href attribute of the <a> element.
contact_email = soup.find(id="contact").a['href']
# Extract the text of the <p> element within the <footer>.
footer_text = soup.footer.p.string  

# Print the title.
print("Title:", title) 
# Print the main heading.
print("Main Heading:", main_heading)
# Print navigation links.
print("Navigation Links:", nav_links) 
# Print the text of the home section.
print("Home Section:", home_section)
# Print the contents of the about section.
print("About Section:", about_section)
# Print the contents of the contact section.
print("Contact Section:", contact_section) 
# Print the contact email link.
print("Contact Email:", contact_email)
# Print the footer text.
print("Footer Text:", footer_text)  

Title: Sample HTML File
Main Heading: Main Title
Navigation Links: ['#home', '#about', '#contact']
Home Section: Welcome home. Can we have a coffee?.
About Section: This section writes about an overview of my website.
Contact Section: None
Contact Email: mailto:my_name@my_email.com
Footer Text: Â© I am writing a book with my copyright on it.
