# 🎓 Lesson 8: Extracting Tables and Lists

🎯 Goal

In this lesson, you’ll learn how to:
- Scrape structured data from HTML &lt;table&gt;, &lt;ul&gt;, and &lt;ol&gt; elements
- Iterate through rows and cells
- Extract content into Python-friendly formats

## ⚠️ Note for Students
❗ The earlier scrapethissite.com/pages/forms/ example uses form inputs and JavaScript logic. BeautifulSoup can’t handle such interactions directly, we’ll explore advanced techniques like form emulation and Selenium in future lessons.

## 🗂️ Section A: Extracting Data from &lt;ul&gt; and &lt;ol&gt; Lists

Even though the scrapethissite.com pages don’t have educational &lt;ul&gt; or &lt;ol&gt; examples, we can use a safe practice site or even embed a sample block directly in code to demonstrate.

We’ll simulate this with a real list structure for learning purposes.

### 📘 Example HTML (Simulated or Offline)

Suppose we have the following HTML from a blog or product list:

```html
<ul class="frameworks">
    <li>Beautiful Soup</li>
    <li>Scrapy</li>
    <li>Selenium</li>
</ul>
```

### ✅ Code Example: Parsing an Unordered List

In [None]:
from bs4 import BeautifulSoup

# Simulated HTML (offline or embedded)
html = '''
<ul class="frameworks">
    <li>Beautiful Soup</li>
    <li>Scrapy</li>
    <li>Selenium</li>
</ul>
'''

soup = BeautifulSoup(html, "lxml")

# Find all <li> elements inside .frameworks list
items = soup.select("ul.frameworks li")

for item in items:
    print("🧰", item.text.strip())

### Where &lt;ul&gt; and &lt;ol&gt; Appear in Real Scraping

You’ll often find lists in:

- News sites (list of headlines or authors)
- Product sites (features, categories)
- Reviews and comments
- Academic or government websites

## 🗂️ Section B: Extracting Data from &lt;table&gt; Elements

For this demo, we'll use another test page:

📍 https://scrapethissite.com/pages/forms/

This page contains a table of hockey teams.

### ✅ Example: Scraping a Real HTML Table (Sales Orders)

In [None]:
import requests
from bs4 import BeautifulSoup

# Fetch the page
url = "https://www.scrapethissite.com/pages/forms/?page_num=1"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")

# Find the table rows inside tbody
rows = soup.select("table tbody tr")
print(rows)
# Loop through each row and extract data
for row in rows:
    team_name = row.select_one("td.name").text.strip()
    year_founded = row.select_one("td.year").text.strip()
    wins = row.select_one("td.wins").text.strip()
    losses = row.select_one("td.losses").text.strip()

    print(f"🏒 Team: {team_name} | Founded: {year_founded} | Wins: {wins} | Losses: {losses}")

📌 Example Output

🛒 1/6/2020 | East | Jones | Pencil | 95 units

🛒 1/23/2020 | Central | Kivell | Binder | 50 units

### row.select_one("td.classname") – When and Why to Use It

If the `<td>` element you're targeting has a CSS class, you can use `select_one()` with a CSS selector for cleaner and more readable code.

✅ Example HTML:

In [None]:
<tr>
  <td class="team-name">Arsenal</td>
  <td class="year">1886</td>
</tr>

✅ Two Ways to Get the Year

1️⃣ Using .find() with class_:

In [None]:
year = row.find("td", class_="year").text.strip()

2️⃣ Using .select_one() with a CSS selector (recommended for clarity):

In [None]:
year = row.select_one("td.year").text.strip()

✅ Best for consistency if you're already using .select() elsewhere in your code

⚠️ What If There's No Class or ID?
Then you’ll need to fall back to `find_all()` and use indexing:

In [None]:
columns = row.find_all("td")
year = columns[1].text.strip()

💡 Pro Tip:

`select_one()` works exactly like `querySelector()` in JavaScript, so if you know how to select elements in browser DevTools, you can use the same syntax here.

Summary Table

| Selector                | Purpose                                |
| ----------------------- | -------------------------------------- |
| `td.classname`          | A `<td>` tag with a specific class     |
| `td#idname`             | A `<td>` tag with a specific ID        |
| `tr > td.classname`     | A direct child `td` of `tr` with class |
| `div.container td.year` | Any `td.year` inside `.container` div  |


# Practice Tasks

1. Filter and print only rows where region == "Central".
2. Convert units to integers and find the average.
3. Store the data in a list of dictionaries and print the first 5 entries.
4. Create your own HTML block with an `<ol>` list of programming languages and scrape them.
5. Visit a real site (e.g. https://quotes.toscrape.com) and try scraping the list of tags at the bottom.

## 🔜 Next up: Lesson 9 – Advanced Searching (Regex, Lambda, Attribute Filters)

We’ll learn `.find_all()` with `attrs`, `string=`, `text=`, regex, and even lambdas to filter data powerfully.