# Using Beautiful Soup for Data Collection

## 1. What is BeautifulSoup?

- BeautifulSoup is a Python library used to parse HTML and XML documents.
- It creates a parse tree from page content, making it easy to extract data.
- It is often used with `requests` to scrape websites.

## 2. Installing BeautifulSoup

Install both `beautifulsoup4` and a parser like `lxml`:

In [None]:
!pip install beautifulsoup4 lxml

## 3. Creating a BeautifulSoup Object

**Example:**

In [None]:
from bs4 import BeautifulSoup
import requests

url = "https://example.com"
response = requests.get(url)

soup = BeautifulSoup(response.text, "lxml")

- `response.text`: HTML content.
- `"lxml"`: A fast and powerful parser (you can also use `"html.parser"`).

## 4. Understanding the HTML Structure

BeautifulSoup treats the page like a tree.
You can search and navigate through tags, classes, ids, and attributes.

**Example HTML:**
```html
<html>
  <body>
    <h1>Title</h1>
    <p class="description">This is a paragraph.</p>
    <a href="/page">Read more</a>
  </body>
</html>
```

## 5. Common Methods in BeautifulSoup

### 5.1 Accessing Elements

In [None]:
# Access the first occurrence of a tag
soup.h1

# Get the text inside a tag
soup.h1.text

### 5.2 `find()` Method

In [None]:
# Finds the first matching element
soup.find("p")

# Find a tag with specific attributes
soup.find("p", class_="description")

### 5.3 `find_all()` Method

In [None]:
# Finds all matching elements
soup.find_all("a")

### 5.4 Using `select()` and `select_one()`

In [None]:
# Select elements using CSS selectors
soup.select_one("p.description")

soup.select("a")

## 6. Extracting Attributes

Get the value of an attribute, such as `href` from an `<a>` tag:

In [None]:
link = soup.find("a")
print(link["href"])

# Or using .get():
print(link.get("href"))

## 7. Traversing the Tree

In [None]:
# Access parent elements
soup.p.parent

# Access children elements
list(soup.body.children)

# Find the next sibling
soup.h1.find_next_sibling()

## 8. Handling Missing Elements Safely

In [None]:
title_tag = soup.find("h1")
if title_tag:
    print(title_tag.text)
else:
    print("Title not found")

## 9. Summary

- BeautifulSoup helps parse and navigate HTML easily.
- Use `.find()`, `.find_all()`, `.select()`, and `.select_one()` to locate data.
- Always inspect the website's structure before writing scraping logic.
- Combine BeautifulSoup with `requests` for full scraping workflows.