# Using Beautiful Soup for Data Collection

---

## 1. What is BeautifulSoup?

**BeautifulSoup** is a Python library used to parse HTML and XML documents.  
It creates a *parse tree* from page content, making it easy to extract data.  
It is often used with `requests` to scrape websites.

---

## 2. Installing BeautifulSoup

Install both **beautifulsoup4** and a parser like **lxml**:

```bash
pip install beautifulsoup4 lxml
```

---

## 3. Creating a BeautifulSoup Object

**Example:**

```python
from bs4 import BeautifulSoup
import requests
 
url = "https://example.com"
response = requests.get(url)
 
soup = BeautifulSoup(response.text, "lxml")
```

**Notes:**
- `response.text`: HTML content.  
- `"lxml"`: A fast and powerful parser (you can also use `"html.parser"`).

---

## 4. Understanding the HTML Structure

BeautifulSoup treats the page like a tree.  
You can search and navigate through **tags**, **classes**, **ids**, and **attributes**.

**Example HTML:**

```html
<html>
  <body>
    <h1>Title</h1>
    <p class="description">This is a paragraph.</p>
    <a href="/page">Read more</a>
  </body>
</html>
```

---

## 5. Common Methods in BeautifulSoup

### 5.1 Accessing Elements

Access the first occurrence of a tag:

```python
soup.h1
```

Get the text inside a tag:

```python
soup.h1.text
```

---

### 5.2 `find()` Method

Finds the first matching element:

```python
soup.find("p")
```

Find a tag with specific attributes:

```python
soup.find("p", class_="description")
```

---

### 5.3 `find_all()` Method

Finds all matching elements:

```python
soup.find_all("a")
```

---

### 5.4 Using `select()` and `select_one()`

Select elements using **CSS selectors**:

```python
soup.select_one("p.description")
```

```python
soup.select("a")
```

---

## 6. Extracting Attributes

Get the value of an attribute, such as `href` from an `<a>` tag:

```python
link = soup.find("a")
print(link["href"])
```

Or using `.get()`:

```python
print(link.get("href"))
```

---

## 7. Traversing the Tree

Access **parent elements**:

```python
soup.p.parent
```

Access **children elements**:

```python
list(soup.body.children)
```

Find the **next sibling**:

```python
soup.h1.find_next_sibling()
```

---

## 8. Handling Missing Elements Safely

Always check if an element exists before accessing it:

```python
title_tag = soup.find("h1")
if title_tag:
    print(title_tag.text)
else:
    print("Title not found")
```

---

## 9. Summary

- **BeautifulSoup** helps parse and navigate HTML easily.  
- Use `.find()`, `.find_all()`, `.select()`, and `.select_one()` to locate data.  
- Always inspect the website's structure before writing scraping logic.  
- Combine **BeautifulSoup** with `requests` for full scraping workflows.


In [None]:
from bs4 import BeautifulSoup

In [None]:
import os
import re
import subprocess

In [None]:
if not os.path.exists("requests_ran.flag"):
    print("requests first...")
    subprocess.run(["jupyter", "nbconvert", "--to", "notebook", "--execute", "requests.ipynb", "--output", "requests.ipynb"])
else:
    print("requests already ran. Continuing...")

In [None]:
html_dir = os.path.join(os.getcwd(), "htmls")

In [None]:
# List to store (page_number, content)
html_contents = []

# Loop over all files in htmls directory
for file in os.listdir(html_dir):
    file_path = os.path.join(html_dir, file)

    # Check if it's a .html file
    if os.path.isfile(file_path) and file.endswith(".html"):
        # Extract the page number from filename, e.g., page23.html -> 23
        match = re.search(r'page(\d+)\.html', file)
        if match:
            page_number = int(match.group(1))
            # Read file content
            with open(file_path, 'r', encoding='utf-8') as f:
                content = f.read()
            html_contents.append((page_number, content))

# Sort by page number
html_contents.sort(key=lambda x: x[0])

# Extract only the content in order
html_list = [content for _, content in html_contents]

In [None]:
import pandas as pd
articles_list = []
for content in html_list:
    soup = BeautifulSoup(content, "html.parser")
    articles = soup.select("article.product_pod")
    articles_list.append(articles)

In [None]:
items = []
mp = {
    "One":1,
    "Two":2,
    "Three":3,
    "Four":4,
    "Five":5,
}
for articles in articles_list:
    for article in articles:
        title = article.find("h3").find("a")["title"]
        price = float(article.select_one("p.price_color").text.split("£")[1])
        rating_element = article.select_one("p.star-rating")
        rating =  mp[rating_element["class"][1]]
        items.append([title, price, rating])
    

In [None]:
items[0]

In [None]:
df = pd.DataFrame(items, columns = ["Book Title","Price","Rating"])

In [None]:
df

In [None]:
df.to_csv("data.csv", index = False)