# Assignment1 : Thailand Yellow Pages scraping 
- Developer: **Patiparn Nualchan**
- Application Position: **Data Scientist/AI Engineer**
- Submission Date: 2025-12-13
---

**Objective:**
- To scrape business listings from the Thailand Yellow Pages website for any business category of your **choice**

**Design, Challenging and Goal:**
- Choosed business category : **SPORT**
- Extract structured data from unstructured web content follow the given schema 
- Handle dynamic pagination and variable data quality
- Build scalable, maintainable extraction system
- Data for AI applications -RAG chatbot (Scraped Opportunity)

**Result Summary**
- ✅ Extract ≥2500 **SPORT** business records
- ✅ Data completly filled all fields follow the given schema with extra data : **Business Description**
- ✅ Reproducible and maintainable code


---
**Architecture considering and choosing**
- **Traditional** : Requests + BeautifulSoup ❌concern: HTML Fetching Failure
- **AI scraper** : Browser-Use or LLM Extraction ❌concern: Hallucinate, Time and Cost
- **✅Hybrid** : Crawl4AI + BeautifulSoup

    The Perfect Balance.
    - Crawl4AI provides the "Real" Browser Environment (handling JS/Security).
    - BeautifulSoup provides the "Logic" (handling extraction rules instantly and accurately without cost).

**Scraping Result**
- scraping as structured Pandas DataFrame of SPORT category here
- https://docs.google.com/spreadsheets/d/1UgECbhKtLRo1-wGDeJhzF9cLGcG-af8Q/edit?usp=drive_link&ouid=117972368303206174757&rtpof=true&sd=true

---
# Code session

In [None]:
!pip install pandas beautifulsoup4 crawl4ai openai requests
!playwright install-deps

## Import libraries

In [6]:
import asyncio
import json
import pandas as pd
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.async_configs import LLMConfig
from openai import OpenAI
import requests
from bs4 import BeautifulSoup
import time
import random

## See all main Categories

In [None]:
BASE_URL = "https://www.yellowpages.co.th"

In [None]:
async def get_main_categories():
    """
    Fetch and print main categories from the Thailand Yellow Pages website.
    """
    url = f"{BASE_URL}/category"
    print(f"Fetching categories from: {url}")
    
    async with AsyncWebCrawler(verbose=False) as crawler:
        result = await crawler.arun(url=url, bypass_cache=True)
        
    soup = BeautifulSoup(result.html, 'html.parser')
    links = soup.find_all('a', href=True)
    cats = []
    seen = set()
    
    for link in links:
        href = link['href']
        text = link.get_text(strip=True)
        
        # Filter for names
        if href.startswith('/category/') and text and "All Category" not in text:
             if text not in seen:
                 cats.append(text)
                 seen.add(text)
    
    print(f"\n--- Found {len(cats)} Categories ---")
    for i, name in enumerate(cats, 1):
        print(f"{i}. {name}")

await get_main_categories()

Fetching categories from: https://www.yellowpages.co.th/category



--- Found 17 Categories ---
1. บ้านและที่อยู่อาศัย (33,150)
2. สำนักงานและบริการทางธุรกิจ (43,272)
3. กีฬา (2,676)
4. ยานยนต์ (34,230)
5. บริการจัดเลี้ยงและงานพิธีต่างๆ (15,726)
6. การศึกษา (11,147)
7. แม่และเด็ก (2,435)
8. บันเทิงนันทนาการและงานอดิเรก (4,795)
9. ท่องเที่ยว (16,263)
10. อุตสาหกรรมและเกษตรกรรม (42,024)
11. สุขภาพและความงาม (20,095)
12. แฟชั่นและเครื่องสำอาง (17,776)
13. ไลฟ์สไตล์ (18,565)
14. เทคโนโลยีและไอที (9,438)
15. ก่อสร้างและงานตกแต่ง (52,537)
16. หน่วยงานราชการและองค์กร (19,202)
17. ธนาคารและสถาบันการเงิน (11,779)


## Choose subcategory to scrape

In [None]:
#Human choose subcategory (please type)
choosen_subcategory = "กีฬา" 

## Get Subcategories

In [None]:
async def get_subcategories():
    """
    Extract all subcategories from SPORT category using BeautifulSoup.
    """
    url = f"{BASE_URL}/category/{choosen_subcategory}"
    
    print("Getting subcategories...")
    
    async with AsyncWebCrawler(verbose=False) as crawler:
        # get the HTML
        result = await crawler.arun(url=url, bypass_cache=True)
        
    if not result.success:
        print(f"Error fetching page: {result.error_message}")
        return []

    # BeautifulSoup, Parse HTML
    soup = BeautifulSoup(result.html, 'html.parser')
    
    # Find all links that contain '/heading/'
    # These are the subcategories like "ค่ายซ้อมมวย", "ลานสเกต", etc.
    links = soup.find_all('a', href=True)
    
    subcats = []
    seen_urls = set()
    
    for link in links:
        href = link['href']
        text = link.get_text(strip=True)
        
        # Check if it looks like a category link
        if '/heading/' in href and text:
            # Clean up the URL
            if not href.startswith('http'):
                full_url = f"https://www.yellowpages.co.th{href}"
            else:
                full_url = href
                
            if full_url not in seen_urls:
                subcats.append({"name": text, "url": full_url})
                seen_urls.add(full_url)
    
    print(f"Found {len(subcats)} subcategories")
    return subcats

# Run and show first 5 subcategories
subcats = await get_subcategories()

print(f"\nFirst 5 subcategories (Total {len(subcats)}):")
for i, sc in enumerate(subcats[:5], 1):
    print(f"{i}. {sc['name']}")
    print(f"   {sc['url']}")

Getting subcategories...


Found 58 subcategories

First 5 subcategories (Total 58):
1. ค่ายซ้อมมวย
   https://www.yellowpages.co.th/heading/%E0%B8%84%E0%B9%88%E0%B8%B2%E0%B8%A2%E0%B8%8B%E0%B9%89%E0%B8%AD%E0%B8%A1%E0%B8%A1%E0%B8%A7%E0%B8%A2
2. บิลเลียด สนุกเกอร์ พูล โกล์
   https://www.yellowpages.co.th/heading/%E0%B8%9A%E0%B8%B4%E0%B8%A5%E0%B9%80%E0%B8%A5%E0%B8%B5%E0%B8%A2%E0%B8%94%20%E0%B8%AA%E0%B8%99%E0%B8%B8%E0%B8%81%E0%B9%80%E0%B8%81%E0%B8%AD%E0%B8%A3%E0%B9%8C%20%E0%B8%9E%E0%B8%B9%E0%B8%A5%20%E0%B9%82%E0%B8%81%E0%B8%A5%E0%B9%8C
3. ผู้รับเหมาสร้างสระว่ายน้ำ
   https://www.yellowpages.co.th/heading/%E0%B8%9C%E0%B8%B9%E0%B9%89%E0%B8%A3%E0%B8%B1%E0%B8%9A%E0%B9%80%E0%B8%AB%E0%B8%A1%E0%B8%B2%E0%B8%AA%E0%B8%A3%E0%B9%89%E0%B8%B2%E0%B8%87%E0%B8%AA%E0%B8%A3%E0%B8%B0%E0%B8%A7%E0%B9%88%E0%B8%B2%E0%B8%A2%E0%B8%99%E0%B9%89%E0%B8%B3
4. ลานสเกต
   https://www.yellowpages.co.th/heading/%E0%B8%A5%E0%B8%B2%E0%B8%99%E0%B8%AA%E0%B9%80%E0%B8%81%E0%B8%95
5. ศูนย์ออกกำลังกาย
   https://www.yellowpages.co.th/heading/%E0%B8%A8%E0%B8

## Run Scraping

In [None]:
async def scraper():
    """
    Extract structured data from unstructured web content follow the given schema on chosen category with all subcategories
    """
    # Reuse subcats
    val_subcats = subcats if 'subcats' in globals() else await get_subcategories()
        
    print(f"Starting V5 SUPER ROBUST Scrape for {len(val_subcats)} subcategories...")
    print("Features: Gap Tolerance (3 pages), Strict Verification (4 retries), Unlimited Pages.")
    
    all_data = []
    
    async with AsyncWebCrawler(verbose=False) as crawler:
        
        for i, sub in enumerate(val_subcats):
            sub_name = sub['name']
            sub_url = sub['url']
            print(f"\n[{i+1}/{len(val_subcats)}] Subcategory: {sub_name}")
            
            page = 1
            empty_streak = 0  # Count of consecutive empty pages
            
            while True:
                # Polite delay between pages
                await asyncio.sleep(random.uniform(1.0, 2.0))
                
                target_url = sub_url if page == 1 else f"{sub_url}?page={page}"
                
                # --- RETRY LOGIC for PAGE FETCH (3 attempts) ---
                page_soup = None
                for attempt in range(3):
                    try:
                        res = await crawler.arun(url=target_url, bypass_cache=True)
                        if not res.html: raise Exception("Empty HTML")
                        page_soup = BeautifulSoup(res.html, 'html.parser')
                        break
                    except Exception as e:
                        print(f"   [Page {page}] Load Retry {attempt+1}... ({e})")
                        await asyncio.sleep(3)
                
                # If page load failed completely
                if not page_soup:
                    print(f"   [Page {page}] FAILED to load. Skipping.")
                    empty_streak += 1
                    if empty_streak >= 3:
                        print(f"   Stopped at Page {page} (3 empty pages in a row).")
                        break
                    page += 1
                    continue

                # Check listings
                titles = page_soup.find_all('div', class_='yp-listing-title')
                
                if not titles:
                    # NO ITEMS FOUND
                    print(f"   [Page {page}] 0 items found.")
                    empty_streak += 1
                    if empty_streak >= 3:
                        print(f"   Stopped at Page {page} (3 empty pages in a row).")
                        break
                    page += 1
                    continue
                else:
                    # ITEMS FOUND - Reset Streak!
                    empty_streak = 0
                    print(f"   [Page {page}] Listing {len(titles)} items...")
                
                # Extract URLS
                p_urls = []
                for div in titles:
                    a = div.find('h3').find('a')
                    if a and a.get('href'):
                        href = a['href'] if a['href'].startswith('http') else "https://www.yellowpages.co.th" + a['href']
                        p_urls.append(href)
                
                # --- SCRAPE DETAILS for each item with STRICT RETRY ---
                for p_idx, p_url in enumerate(p_urls):
                    # Profile Retry Loop (4 attempts)
                    success = False
                    for p_attempt in range(4):
                        try:
                            # Delay to be polite and avoid blocks
                            await asyncio.sleep(random.uniform(0.5, 1.2))
                            
                            res_p = await crawler.arun(url=p_url, bypass_cache=True)
                            soup = BeautifulSoup(res_p.html, 'html.parser')
                            
                            # VALIDATION: Must have a Name (H1)
                            h1 = soup.find('h1')
                            if not h1:
                                raise Exception("Missing H1 Name")
                                
                            # If we get here, the page loaded correctly!
                            name = h1.get_text(strip=True)
                            
                            # B. Address
                            address = "Unknown"
                            addr_label = soup.find('strong', string=lambda t: t and "ที่อยู่" in t)
                            if addr_label and addr_label.parent:
                                val = addr_label.parent.find_next_sibling('div')
                                if val: address = val.get_text(strip=True)
                                
                            # C. Phone
                            phone = "No Phone"
                            l_ph = soup.find('a', href=lambda h: h and h.startswith('tel:'))
                            if l_ph: phone = l_ph.get_text(strip=True)
                            
                            # D. Map
                            map_link = "No Map"
                            l_map = soup.find('a', string=lambda t: t and "นำทาง" in t)
                            if not l_map: l_map = soup.find('a', href=lambda h: h and 'google.com/maps' in h)
                            if l_map: map_link = l_map['href']
                            
                            # E. Description
                            desc = "No Description"
                            dh = soup.find(string=lambda t: t and "สินค้าและบริการ" in t)
                            if dh and dh.parent:
                                cont = dh.parent
                                for sib in cont.find_next_siblings():
                                    txt = sib.get_text(strip=True)
                                    if txt and "Share" not in txt and len(txt) > 10:
                                        desc = txt
                                        break
                                        
                            all_data.append([sub_name, name, address, phone, map_link, desc, p_url])
                            success = True
                            break # Break retry loop
                            
                        except Exception as e:
                            # Only print if it's the last attempt
                            if p_attempt == 3:
                                print(f"     x Skipping Item: {p_url} (Failed 4 retries)")
                            else:
                                await asyncio.sleep(2) # Wait before retry
                    
                    # Optional: Progress dot
                    # if p_idx % 5 == 0: print(".", end="")
                
                # Next Page
                page += 1
        
    # Save
    cols = ["Subcategory", "Name", "Address", "Phone", "Map", "Description", "Profile URL"]
    df = pd.DataFrame(all_data, columns=cols)
    filename = f"yellowpages_{choosen_subcategory}.csv"
    df.to_csv(filename, index=False, encoding='utf-8-sig')
    
    print(f"\n==========================================")
    print(f"DONE! Saved {len(df)} rows to {filename}")
    print(f"==========================================")
    return df

# RUN scraper()
df = await scraper()
df.head()

---
# Summary
- ✅ Implemented **hybrid approach** : Crawl4AI + BeautifulSoup
- ✅ Extracted data from Yellowpages.co.th on chosen subcategory **SPORT**
- ✅ Saved to CSV file
- ⏩️ Data ready for AI applications -RAG chatbot

# -------- THANK YOU -------- 
