# Web Scraping with BeautifulSoup4 🕷️

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NouamaneTazi/hackai-challenges/blob/main/new_notebooks/data_scraping_bs4_goudma.ipynb)

In this notebook, you'll learn how to:
- Extract data from websites using BeautifulSoup4
- Parse HTML content
- Save the scraped data to a structured format
- Share your dataset on HuggingFace 🤗

Time to complete: ~45 minutes

## What is Web Scraping? 🌐

Web scraping is like having a robot that can read websites and collect information for you. It's useful for:
- Gathering data for analysis
- Creating datasets for AI training
- Monitoring website changes
- Automating data collection

## Setup 🛠️

First, let's install the required packages:
- `beautifulsoup4`: Helps us parse and navigate HTML
- `requests`: Lets us download web pages
- `pandas`: For organizing our data
- `datasets`: For sharing on HuggingFace

In [None]:
# install required packages
!pip install beautifulsoup4 requests pandas datasets -q

## Import Libraries 📚

Let's import the tools we need:

In [None]:
from bs4 import BeautifulSoup  # For parsing HTML
import requests  # For downloading web pages
import pandas as pd  # For data organization
from tqdm import tqdm  # For progress bars

## Our Target 🎯

We'll scrape news articles from [Goud.ma](https://www.goud.ma), a Moroccan news website. We'll collect:
- Article titles
- Article content
- Article images

Before beginning our scraping, we need to analyze the HTML of our target website to understand its structure.
As shown in the image below, we'll target:
1. The `article` elements with class `card`
2. Inside each article, we'll find the `a` tag with class `stretched-link` to get the article URL
3. Then we'll extract the content from each article page

TODO: Add image showing the HTML structure of the main page with arrows pointing to:
- article elements with class "card"
- a tags with class "stretched-link"
- href attributes

In [None]:
# The URL we want to scrape
target_url = "https://www.goud.ma/topics/%d8%a7%d9%84%d8%b1%d8%a6%d9%8a%d8%b3%d9%8a%d8%a9/"

# Send a request to the website
# The User-Agent header helps identify our request
response = requests.get(target_url, headers={"User-Agent": "Mozilla/5.0"})

# Check if the request was successful
if response.status_code == 200:
    print("✅ Successfully connected to the website!")
else:
    print("❌ Failed to connect to the website")

## Parsing the HTML 🧩

Now that we have the webpage, let's parse it with BeautifulSoup. We'll find all article elements with class "card":

TODO: Add image showing the BeautifulSoup object structure with:
- HTML tree visualization
- Highlighted article elements
- Class attributes

In [None]:
# Create a BeautifulSoup object to parse the HTML
soup = BeautifulSoup(response.text, "html.parser")

# Find all article elements with class "card"
# We'll get the first 6 articles
articles = soup.find_all("article", class_="card")[:6]
print(f"Found {len(articles)} articles!")

## Extracting Article Links 🔗

As shown in the image, we need to:
1. Find the `a` tag with class `stretched-link` inside each article
2. Extract the `href` attribute to get the article URL

TODO: Add image showing:
- Article HTML structure
- Highlighted a tag with class "stretched-link"
- Arrow pointing to href attribute
- Example of extracted URL

In [None]:
# Extract links from articles
article_links = [
    article.find("a", class_="stretched-link").get("href")
    for article in articles
]

print("Article links:")
for i, link in enumerate(article_links, 1):
    print(f"{i}. {link}")

## Scraping Article Content 📝

Now let's scrape the content from each article. As shown in the image, we'll:
1. Visit each article page
2. Find the title in the `h1` tag with class `entry-title`
3. Find the content in the `div` with class `post-content`
4. Find the image in the `img` tag with class `img-fluid wp-post-image`

TODO: Add image showing article page HTML structure with:
- Highlighted h1 tag with class "entry-title"
- Highlighted div with class "post-content"
- Highlighted img tag with class "img-fluid wp-post-image"
- Arrows pointing to text content and image source

In [None]:
# Create a dictionary to store our data
data = {
    "titles": [],
    "content": [],
    "images": []
}

# Scrape each article
for link in tqdm(article_links, desc="Scraping articles"):
    # Get the article page
    article_response = requests.get(link, headers={"User-Agent": "Mozilla/5.0"})
    article_soup = BeautifulSoup(article_response.text, "html.parser")
    
    # Extract data
    title = article_soup.find("h1", class_="entry-title").text
    content = article_soup.find("div", class_="post-content").text.strip()
    image = article_soup.find("img", class_="img-fluid wp-post-image").get("src")
    
    # Save data
    data["titles"].append(title)
    data["content"].append(content)
    data["images"].append(image)

## Organizing the Data 📊

Let's put our data in a pandas DataFrame for better organization:

TODO: Add image showing:
- Example of raw scraped data
- The resulting pandas DataFrame
- Highlighted columns and rows

In [None]:
# Create a DataFrame
df = pd.DataFrame(data)
df

## Saving to HuggingFace 🤗

Finally, let's share our dataset on HuggingFace. This makes it easy to:
- Share your data with others
- Use it in other AI projects
- Track changes to your dataset

To use this part, you'll need to:
1. Create a HuggingFace account
2. Get your write token from https://huggingface.co/settings/tokens
3. Replace `HF_WRITE_TOKEN` with your token
4. Change `HF_DATASET_REPO` to your username/dataset name

TODO: Add image showing:
- HuggingFace dataset page
- Where to find the write token
- How to create a new dataset

In [None]:
from datasets import Dataset

# Convert pandas DataFrame to HuggingFace Dataset
dataset = Dataset.from_pandas(df)

# Uncomment and fill these to push to HuggingFace
# HF_WRITE_TOKEN = "your_token_here"  # Get from https://huggingface.co/settings/tokens
# HF_DATASET_REPO = "your_username/dataset_name"
# dataset.push_to_hub(HF_DATASET_REPO, token=HF_WRITE_TOKEN)

## Congratulations! 🎉

You've successfully:
- Scraped a website using BeautifulSoup4
- Extracted structured data
- Organized it in a pandas DataFrame
- Prepared it for sharing on HuggingFace

## Next Steps 🚀
- Try scraping a different website
- Add more data fields (like dates, authors, etc.)
- Clean the text data (remove extra spaces, special characters)
- Create visualizations of your data