# 🎓 Lesson 9: Advanced Searching (Regex, Lambda, Attribute Filters)

🎯 Goal

In this lesson, you'll learn how to:
1. Use `attrs={}` to filter by HTML attributes
2. Use regex to match text patterns
3. Use lambda functions to apply custom filters
4. Search using `string=`, `text=`, and more

💻 We'll Use:

📍 https://scrapethissite.com/pages/forms/

This page has a structured hockey team table perfect for advanced searches.

In [None]:
from bs4 import BeautifulSoup
import requests

url = "https://scrapethissite.com/pages/forms/"
soup = BeautifulSoup(requests.get(url).text, "lxml")

# 🔎 All elements with specific class using attrs
rows = soup.find_all("tr", attrs={"class": "team"})

print(f"Found {len(rows)} rows with class='team'")

Same as:

In [None]:
rows = soup.select("tr.team")
print(f"Found {len(rows)} rows with class='team'")

## ✅ Section B: Searching with `string=` or `text=`

In [None]:
# Find all elements whose exact text is "Boston Bruins"
team_elements = soup.find_all("td", string="Boston Bruins") # return []

for team in team_elements:
    print("Found exact match:", team.text)

print(team_elements)

### 💡 Why does `soup.find_all("td", string="Boston Bruins")` return an empty list?

Although this line looks correct:

In [17]:
team_elements = soup.find_all("td", string="Boston Bruins")

…it often returns nothing. Why?

### 🔍 The Problem with `.string`

The `.string` property works only if the tag contains exactly one direct text node.

In the page https://scrapethissite.com/pages/forms/, the HTML structure looks like this:

In [None]:
<td class="name"> Boston Bruins </td>

Or sometimes like:

In [None]:
<td class="name">
    Boston Bruins
</td>

In these cases:

- ✅ If the `<td>` contains only raw text (i.e. a single NavigableString), .string works

- ❌ But if the tag contains extra whitespace, line breaks, or nested tags `.string` becomes None

That’s why your code fails silently.

### ✅ Solution 1: Use `.text.strip()` with lambda

This method checks and matches the cleaned-up text manually:

In [None]:
team_elements = soup.find_all(
    "td",
    class_="name",
    string=lambda text: text and text.strip() == "Boston Bruins"
)

for team in team_elements:
    print("✅ Match found:", team.text.strip())

This works more reliably because strip() removes invisible line breaks or spaces.

### ✅ Solution 2: Use Regex Matching

This approach is more flexible, especially for case-insensitive or partial matches:

In [None]:
import re

team_elements = soup.find_all(
    "td",
    class_="name",
    string=re.compile(r"\bBoston Bruins\b")
)

for team in team_elements:
    print("🔎 Regex Match:", team.text.strip())

### 📌 About `.string`

Remember:

tag.string


| Behavior | Description                                              |
| -------- | -------------------------------------------------------- |
| ✅ Works  | When tag has **only text**, no nested tags or whitespace |
| ❌ Fails  | When tag has extra formatting, spaces, or child tags     |


### ✅ Summary

Instead of using:

In [None]:
soup.find_all("td", string="Boston Bruins")

It’s more reliable to use:

- A lambda function with `.text.strip()` for exact cleaned matches

- A `re.compile()` regex for flexible text pattern matching

## Practice Tasks

1. Use regex to find all teams whose names start with `C`.

2. Use lambda to find teams with less than 20 losses.

3. Try filtering rows where team name contains “Leafs” (case-insensitive).

## 🔜 Next up: Lesson 10 – Pagination and Multi-page Scraping

We’ll follow “Next” buttons and scrape across multiple pages, an essential real-world skill!