<a href="https://colab.research.google.com/github/Merhii/COSC482_DataScience_WebScraping/blob/main/Assignments/Assignment1/Assignment1_WebScrapingPractice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COSC 482 – Data Science and Web Scraping
## Assignment 1: Web Scraping Practice

**Objective:**

Practice web scraping with BeautifulSoup using a single large HTML content. Tasks include:
- Extracting elements based on tags.
- Filtering elements based on attributes or text.
- Storing scraped data in pandas DataFrames.


In [4]:
# Install and import necessary libraries
!pip install beautifulsoup4 pandas
from bs4 import BeautifulSoup
import pandas as pd



### HTML Content
Below is the HTML content you will work with.

In [5]:
html_content = """<html>
    <head>
        <title>Web Scraping Practice</title>
    </head>
    <body>
        <h1>Welcome to the Web Scraping Assignment</h1>
        <p>This is an assignment to practice web scraping using BeautifulSoup.</p>

        <h2>Featured Articles</h2>
        <div class="articles">
            <div class="article">
                <h3>Introduction to Python</h3>
                <p>This article introduces Python programming for beginners.</p>
                <a href="https://example.com/python-intro" class="external-link">Read more</a>
            </div>
            <div class="article">
                <h3>Advanced Web Scraping</h3>
                <p>Learn advanced techniques for scraping dynamic content.</p>
                <a href="https://example.com/web-scraping" class="external-link">Read more</a>
            </div>
            <div class="article">
                <h3>Data Science with Python</h3>
                <p>Explore how Python is used in data science and machine learning.</p>
                <a href="https://example.com/data-science" class="external-link">Read more</a>
            </div>
        </div>

        <h2>Resources</h2>
        <ul>
            <li><a href="https://example.com/beginners" class="external-link">Python for Beginners</a></li>
            <li><a href="https://example.com/intermediate" class="external-link">Intermediate Python</a></li>
            <li><a href="https://example.com/html" class="internal-link">Learn HTML</a></li>
            <li><a href="https://example.com/advanced-python" class="external-link">Advanced Python</a></li>
        </ul>

        <h2>Comments</h2>
        <div class="comments">
            <div class="comment">
                <p class="author">Alice</p>
                <p class="content">This article on Python was very helpful!</p>
            </div>
            <div class="comment">
                <p class="author">Bob</p>
                <p class="content">I learned so much about web scraping!</p>
            </div>
            <div class="comment">
                <p class="author">Charlie</p>
                <p class="content">Great resource for beginners in Python.</p>
            </div>
        </div>
    </body>
</html>"""

### Task 1: Extract the Title and Heading
**Objective:** Extract and print the title (`<title>`) and the main heading (`<h1>`) of the page.

In [6]:
# Code Here
soup = BeautifulSoup(html_content, 'html.parser')
title = soup.title.string
main_heading = soup.find('h1')
print(title)
print(main_heading.string)

Web Scraping Practice
Welcome to the Web Scraping Assignment


###Task 2: Extract All Links
**Objective:** Extract all `<a>` tags and print their text (link text) and `href` attributes.

In [7]:
# Code Here
links = soup.find_all('a')
for link in links:
  print(link.text, link.attrs['href'])

Read more https://example.com/python-intro
Read more https://example.com/web-scraping
Read more https://example.com/data-science
Python for Beginners https://example.com/beginners
Intermediate Python https://example.com/intermediate
Learn HTML https://example.com/html
Advanced Python https://example.com/advanced-python


###Task 3: Extract Links with Specific Class
**Objective:** Extract links that have the class `external-link` and print their URLs.


In [8]:
# Code Here
links = soup.find_all('a',class_='external-link')
for link in links:
  print(link.attrs['href'])

https://example.com/python-intro
https://example.com/web-scraping
https://example.com/data-science
https://example.com/beginners
https://example.com/intermediate
https://example.com/advanced-python


### Task 4: Extract Articles
**Objective:** Extract the titles (`<h3>`) and content (`<p>`) of the articles under the "Featured Articles" section.

In [10]:
# Code Here
articlies = soup.find_all('div',class_='articles')
for article in articlies:
  title = article.find('h3')
  content = article.find('p')
  print(title.string)
  print(content.string)

Introduction to Python
This article introduces Python programming for beginners.


###Task 5: Extract Articles Containing "Python" in Title
**Objective:** Extract and print only the articles where the title contains the word `“Python”`.

In [11]:
# Code Here
articlies = soup.find_all('div',class_='article')
for article in articlies:
  title = article.find('h3').string
  if 'Python' in title:
    print(article)

<div class="article">
<h3>Introduction to Python</h3>
<p>This article introduces Python programming for beginners.</p>
<a class="external-link" href="https://example.com/python-intro">Read more</a>
</div>
<div class="article">
<h3>Data Science with Python</h3>
<p>Explore how Python is used in data science and machine learning.</p>
<a class="external-link" href="https://example.com/data-science">Read more</a>
</div>


###Task 6: Extract Comments by a Specific Author
**Objective:** Extract the comments made by “Alice” and print the comment content.

In [13]:
# Code Here
comments = soup.find_all('div',class_='comment')
for comment in comments:
  author = comment.find('p',class_='author').string
  if 'Alice' in author:
    print(comment.find('p',class_='content').string)


This article on Python was very helpful!


###Task 7: Storing Articles in a pandas DataFrame
**Objective:** Store the article titles and content in a pandas DataFrame and print it.

In [14]:
# Code Here
data = []
articlies = soup.find_all('div',class_='article')
for article in articlies:
  title = article.find('h3').string
  content = article.find('p').string
  data.append([title,content])
data = pd.DataFrame(data,columns=['Title','Content'])
data

Unnamed: 0,Title,Content
0,Introduction to Python,This article introduces Python programming for...
1,Advanced Web Scraping,Learn advanced techniques for scraping dynamic...
2,Data Science with Python,Explore how Python is used in data science and...


###Task 8: Storing Comments in a pandas DataFrame
**Objective:** Store the authors and their comments in a pandas DataFrame and print it.

In [15]:
# Code Here
data2 = []
cmnts = soup.find_all('div',class_='comment')
for cmnt in cmnts:
  author = cmnt.find('p',class_='author').string
  content = cmnt.find('p',class_='content').string
  data = pd.DataFrame(data,columns=['Title','Content'])
data

Unnamed: 0,Title,Content
0,Introduction to Python,This article introduces Python programming for...
1,Advanced Web Scraping,Learn advanced techniques for scraping dynamic...
2,Data Science with Python,Explore how Python is used in data science and...


###Task 9: Extract Links with Specific Words
**Objective:** Extract links (`<a>` tags) where the link text contains the word "Advanced" and print the text and URL.

In [16]:
# Code Here
links = soup.find_all('a')
for link in links:
  if 'Advanced' in link.text:
    print(link.text,link.attrs['href'])

Advanced Python https://example.com/advanced-python
