---

# 🥣 Using Beautiful Soup for Data Collection

---

## 1️⃣ What is **BeautifulSoup**?

**BeautifulSoup** is a **Python** library used to parse **HTML** and **XML** documents.
🌳 It creates a **parse tree** from page content, making it easy to **extract data**.
🔗 It is often used with `requests` to **scrape websites**.

---

## 2️⃣ Installing **BeautifulSoup**

Install both `beautifulsoup4` and a parser like `lxml`:

```bash
pip install beautifulsoup4 lxml
```

---

## 3️⃣ Creating a BeautifulSoup Object

📌 **Example:**

```python
from bs4 import BeautifulSoup
import requests
 
url = "https://example.com"
response = requests.get(url)
 
soup = BeautifulSoup(response.text, "lxml")
```

🔍 `response.text`: **HTML content**
⚡ `"lxml"`: A **fast and powerful parser** (you can also use `"html.parser"`)

---

## 4️⃣ Understanding the HTML Structure

BeautifulSoup treats the page like a **tree** 🌲.
You can **search and navigate** through **tags**, **classes**, **ids**, and **attributes**.

📄 **Example HTML:**

```html
<html>
  <body>
    <h1>Title</h1>
    <p class="description">This is a paragraph.</p>
    <a href="/page">Read more</a>
  </body>
</html>
```

---

## 5️⃣ Common Methods in BeautifulSoup

### 5.1 🔎 Accessing Elements

* Access the **first occurrence** of a tag:

  ```python
  soup.h1
  ```
* Get the **text inside** a tag:

  ```python
  soup.h1.text
  ```

---

### 5.2 🔍 `find()` Method

* Find the **first matching element**:

  ```python
  soup.find("p")
  ```
* Find a tag with **specific attributes**:

  ```python
  soup.find("p", class_="description")
  ```

---

### 5.3 🔎 `find_all()` Method

* Find **all matching elements**:

  ```python
  soup.find_all("a")
  ```

---

### 5.4 🎯 Using `select()` and `select_one()`

Select elements using **CSS selectors**:

```python
soup.select_one("p.description")
soup.select("a")
```

---

## 6️⃣ Extracting Attributes

Get the value of an attribute, such as `href` from an `<a>` tag:

```python
link = soup.find("a")
print(link["href"])
```

Or using `.get()`:

```python
print(link.get("href"))
```

---

## 7️⃣ Traversing the Tree

* Access **parent elements**:

  ```python
  soup.p.parent
  ```
* Access **children elements**:

  ```python
  list(soup.body.children)
  ```
* Find the **next sibling**:

  ```python
  soup.h1.find_next_sibling()
  ```

---

## 8️⃣ Handling Missing Elements Safely

⚠️ Always **check if an element exists** before accessing it:

```python
title_tag = soup.find("h1")
if title_tag:
    print(title_tag.text)
else:
    print("Title not found")
```

---

## 9️⃣ Summary 📌

✅ **BeautifulSoup** helps **parse and navigate HTML** easily.
✅ Use `.find()`, `.find_all()`, `.select()`, and `.select_one()` to **locate data**.
✅ Always **inspect the website's structure** before writing scraping logic.
✅ Combine **BeautifulSoup** with `requests` for **full scraping workflows**. 🛠️

---

In [3]:
from bs4 import BeautifulSoup

In [4]:
with open("htmls/page-1.html") as f:
    content = f.read()

In [5]:
soup = BeautifulSoup(content, "html.parser")

In [6]:
articles = soup.select("article.product_pod")

In [7]:
items = []
for article in articles:
    title = article.find("h3").find("a")["title"]
    price = article.select_one("p.price_color").text.split('£')[1]
    rating_element = article.select_one("p.star-rating")
    rating = rating_element['class'][1]
    items.append([title, price, rating])

In [8]:
items

[['A Light in the Attic', '51.77', 'Three'],
 ['Tipping the Velvet', '53.74', 'One'],
 ['Soumission', '50.10', 'One'],
 ['Sharp Objects', '47.82', 'Four'],
 ['Sapiens: A Brief History of Humankind', '54.23', 'Five'],
 ['The Requiem Red', '22.65', 'One'],
 ['The Dirty Little Secrets of Getting Your Dream Job', '33.34', 'Four'],
 ['The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull',
  '17.93',
  'Three'],
 ['The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics',
  '22.60',
  'Four'],
 ['The Black Maria', '52.15', 'One'],
 ['Starving Hearts (Triangular Trade Trilogy, #1)', '13.99', 'Two'],
 ["Shakespeare's Sonnets", '20.66', 'Four'],
 ['Set Me Free', '17.46', 'Five'],
 ["Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", '52.29', 'Five'],
 ['Rip it Up and Start Again', '35.02', 'Five'],
 ['Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991',
  '57.25',
  'Three'],
 ['Olio

In [9]:
import pandas as pd

In [10]:
df = pd.DataFrame(items, columns=["Books", "Price", "Rating"])
df

Unnamed: 0,Books,Price,Rating
0,A Light in the Attic,51.77,Three
1,Tipping the Velvet,53.74,One
2,Soumission,50.1,One
3,Sharp Objects,47.82,Four
4,Sapiens: A Brief History of Humankind,54.23,Five
5,The Requiem Red,22.65,One
6,The Dirty Little Secrets of Getting Your Dream...,33.34,Four
7,The Coming Woman: A Novel Based on the Life of...,17.93,Three
8,The Boys in the Boat: Nine Americans and Their...,22.6,Four
9,The Black Maria,52.15,One


In [11]:
df.to_csv("Data.csv", index=False)