# COSC 482 – Data Science and Web Scraping
## Assignment 1: Web Scraping Practice

**Objective:**

Practice web scraping with BeautifulSoup using a single large HTML content. Tasks include:
- Extracting elements based on tags.
- Filtering elements based on attributes or text.
- Storing scraped data in pandas DataFrames.


In [1]:
# Install and import necessary libraries
!pip install beautifulsoup4 pandas

from bs4 import BeautifulSoup
import pandas as pd



### HTML Content
Below is the HTML content you will work with.

In [2]:
html_content = """<html>
    <head>
        <title>Web Scraping Practice</title>
    </head>
    <body>
        <h1>Welcome to the Web Scraping Assignment</h1>
        <p>This is an assignment to practice web scraping using BeautifulSoup.</p>

        <h2>Featured Articles</h2>
        <div class="articles">
            <div class="article">
                <h3>Introduction to Python</h3>
                <p>This article introduces Python programming for beginners.</p>
                <a href="https://example.com/python-intro" class="external-link">Read more</a>
            </div>
            <div class="article">
                <h3>Advanced Web Scraping</h3>
                <p>Learn advanced techniques for scraping dynamic content.</p>
                <a href="https://example.com/web-scraping" class="external-link">Read more</a>
            </div>
            <div class="article">
                <h3>Data Science with Python</h3>
                <p>Explore how Python is used in data science and machine learning.</p>
                <a href="https://example.com/data-science" class="external-link">Read more</a>
            </div>
        </div>

        <h2>Resources</h2>
        <ul>
            <li><a href="https://example.com/beginners" class="external-link">Python for Beginners</a></li>
            <li><a href="https://example.com/intermediate" class="external-link">Intermediate Python</a></li>
            <li><a href="https://example.com/html" class="internal-link">Learn HTML</a></li>
            <li><a href="https://example.com/advanced-python" class="external-link">Advanced Python</a></li>
        </ul>

        <h2>Comments</h2>
        <div class="comments">
            <div class="comment">
                <p class="author">Alice</p>
                <p class="content">This article on Python was very helpful!</p>
            </div>
            <div class="comment">
                <p class="author">Bob</p>
                <p class="content">I learned so much about web scraping!</p>
            </div>
            <div class="comment">
                <p class="author">Charlie</p>
                <p class="content">Great resource for beginners in Python.</p>
            </div>
        </div>
    </body>
</html>"""

### Task 1: Extract the Title and Heading
**Objective:** Extract and print the title (`<title>`) and the main heading (`<h1>`) of the page.

In [16]:
title = soup.find('title').text
heading = soup.find('h1').text
print("Title:", title)
print("Heading:", heading)

Title: Web Scraping Practice
Heading: Welcome to the Web Scraping Assignment


###Task 2: Extract All Links
**Objective:** Extract all `<a>` tags and print their text (link text) and `href` attributes.

In [4]:
links = soup.find_all('a')
for link in links:
    print("Link Text:", link.text)
    print("href:", link['href'])

Link Text: Read more
href: https://example.com/python-intro
Link Text: Read more
href: https://example.com/web-scraping
Link Text: Read more
href: https://example.com/data-science
Link Text: Python for Beginners
href: https://example.com/beginners
Link Text: Intermediate Python
href: https://example.com/intermediate
Link Text: Learn HTML
href: https://example.com/html
Link Text: Advanced Python
href: https://example.com/advanced-python


###Task 3: Extract Links with Specific Class
**Objective:** Extract links that have the class `external-link` and print their URLs.


In [5]:
external_links = soup.find_all('a', class_='external-link')
for link in external_links:
    print(link['href'])

https://example.com/python-intro
https://example.com/web-scraping
https://example.com/data-science
https://example.com/beginners
https://example.com/intermediate
https://example.com/advanced-python


### Task 4: Extract Articles
**Objective:** Extract the titles (`<h3>`) and content (`<p>`) of the articles under the "Featured Articles" section.

In [6]:
articles_section = soup.find('div', class_='articles')
articles = articles_section.find_all('div', class_='article')
for article in articles:
    title = article.find('h3').text
    content = article.find('p').text
    print("Title:", title)
    print("Content:", content)

Title: Introduction to Python
Content: This article introduces Python programming for beginners.
Title: Advanced Web Scraping
Content: Learn advanced techniques for scraping dynamic content.
Title: Data Science with Python
Content: Explore how Python is used in data science and machine learning.


###Task 5: Extract Articles Containing "Python" in Title
**Objective:** Extract and print only the articles where the title contains the word `“Python”`.

In [7]:
for article in articles:
    title = article.find('h3').text
    if "Python" in title:
        content = article.find('p').text
        print("Title:", title)
        print("Content:", content)

Title: Introduction to Python
Content: This article introduces Python programming for beginners.
Title: Data Science with Python
Content: Explore how Python is used in data science and machine learning.


###Task 6: Extract Comments by a Specific Author
**Objective:** Extract the comments made by “Alice” and print the comment content.

In [10]:
comments_section = soup.find('div', class_='comments')
comments = comments_section.find_all('div', class_='comment')
for comment in comments:
    author = comment.find('p', class_='author').text
    if author == "Alice":
        content = comment.find('p', class_='content').text
content

'This article on Python was very helpful!'

###Task 7: Storing Articles in a pandas DataFrame
**Objective:** Store the article titles and content in a pandas DataFrame and print it.

In [11]:
article_data = []
for article in articles:
    title = article.find('h3').text
    content = article.find('p').text
    article_data.append({'Title': title, 'Content': content})

df = pd.DataFrame(article_data)
df

Unnamed: 0,Title,Content
0,Introduction to Python,This article introduces Python programming for...
1,Advanced Web Scraping,Learn advanced techniques for scraping dynamic...
2,Data Science with Python,Explore how Python is used in data science and...


###Task 8: Storing Comments in a pandas DataFrame
**Objective:** Store the authors and their comments in a pandas DataFrame and print it.

In [12]:
comment_data = []
for comment in comments:
    author = comment.find('p', class_='author').text
    content = comment.find('p', class_='content').text
    comment_data.append({'Author': author, 'Comment': content})

df_comments = pd.DataFrame(comment_data)
df_comments

Unnamed: 0,Author,Comment
0,Alice,This article on Python was very helpful!
1,Bob,I learned so much about web scraping!
2,Charlie,Great resource for beginners in Python.


###Task 9: Extract Links with Specific Words
**Objective:** Extract links (`<a>` tags) where the link text contains the word "Advanced" and print the text and URL.

In [17]:
for link in soup.find_all('a'):
    if "Advanced" in link.text:
        print("Text:", link.text)
        print("URL:", link['href'])

Text: Advanced Python
URL: https://example.com/advanced-python
