# 🎓 Lesson 15: Regex in BeautifulSoup

🎯 Goal

In this lesson, you’ll learn how to:

- Use re.compile() to match flexible text and attribute patterns

- Search for elements with partially known or variable content

- Combine BeautifulSoup + Regex for more powerful scraping


## 💻 Practice Site:

📍 https://quotes.toscrape.com/

Perfect for testing flexible text patterns like matching authors or quote content.

## ✅ Step 1: Basic Regex Matching with string=re.compile(...)

In [None]:
import requests
from bs4 import BeautifulSoup
import re

# Load the page
url = "https://quotes.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")

# Find all quotes containing the word "life"
pattern = re.compile("life", re.IGNORECASE)

quotes = soup.find_all("span", class_="text", string=pattern)

for quote in quotes:
    print("Matched Quote:", quote.text.strip())

## Explanation

| Concept              | Description                                 |
| -------------------- | ------------------------------------------- |
| `re.compile("life")` | Create a regex pattern to search for "life" |
| `re.IGNORECASE`      | Makes the search case-insensitive           |
| `string=pattern`     | Matches the visible text using regex        |


## ✅ Step 2: Match Elements with Part of a Class or Attribute

Let’s say you want to match tags with any class name that includes "author".

In [None]:
# Match <small> tags where class contains "author"
author_tags = soup.find_all("small", class_=re.compile("author"))

for tag in author_tags:
    print("Author:", tag.text.strip())

You can do the same for IDs, href, src, or other attributes:

In [None]:
# Find all <a> tags where href ends in "/login"
login_links = soup.find_all("a", href=re.compile("/login$"))

for link in login_links:
    print("Login link found:", link.get("href"))

## ✅ Step 3: Match Complex Quote Texts (Multiple Words)

In [None]:
# Match quotes that contain both "truth" or "reality"
keywords = re.compile("truth|reality", re.IGNORECASE)

matches = soup.find_all("span", class_="text", string=keywords)

for quote in matches:
    print("Match:", quote.text.strip())

## 🔧 Common Regex Patterns

| Pattern | Meaning                  | Example                 |        |         |
| ------- | ------------------------ | ----------------------- | ------ | ------- |
| `^text` | Starts with `text`       | `^Albert`               |        |         |
| `text$` | Ends with `text`         | `Einstein$`             |        |         |
| `.*`    | Any number of characters | `.*quote.*`             |        |         |
| \`a     | b\`                      | Match either `a` or `b` | \`life | truth\` |
| `[A-Z]` | Any uppercase letter     | `[A-Z][a-z]+`           |        |         |

Regex allows you to write dynamic scrapers that don’t rely on exact text.

## Practice Tasks

1. Find quotes where the author’s name starts with “A”.

2. Find quote texts that contain punctuation like “!” or “?”.

3. Match `<a>` tags where `href` contains the word `"tag"`.

## 🔜 Next up: Lesson  16 – Scraping JavaScript-rendered Sites (Intro to Selenium)

Learn how to handle websites that generate content using JavaScript, which `requests` can’t see.