# Core principle of crawling
Crawling is the process of analyzing HTML structure and automatically extracting only the data that follow specific patterns.

**Preparing the Edge driver file**
- Enter **edge://version** in the Edge address bar to check the browser’s major version number.
- Download the WebDriver that matches your version from the official page below:

  https://developer.microsoft.com/ko-kr/microsoft-edge/tools/webdriver/?form=MA13LH#downloads
  
- Create a C:\WebDriver folder and place the msedgedriver.exe file inside it.

**Select a webtoon of interest**
- Open webtoons.com in your browser.
- Click any series you want.
- Copy the URL of the episode list page for that series.

https://www.webtoons.com/en/romance/the-mafia-nanny/list?title_no=5879

### Import libraries

In [None]:
pip install selenium

In [None]:
pip install beautifulsoup4

In [None]:
pip install tqdm

In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.edge.service import Service
from selenium.webdriver.edge.options import Options

from bs4 import BeautifulSoup
from tqdm import tqdm
import time
import pandas as pd

- selenium → browser control (clicking, opening URLs, locating elements, etc.)
- BeautifulSoup → reading HTML and extracting comment text
- tqdm → progress bar for loop progress
- time → simple delay handling
- pandas → saving results to a CSV file

# **Define configuration variables**

In [7]:
# ==========================================
# [Configuration Area] 
# ==========================================

# 1) Webtoon list page URL to crawl (excluding page number)
SERIES_URL = "https://www.webtoons.com/en/romance/the-mafia-nanny/list?title_no=5879"

# 2) Number of list pages to scan (how many pages to flip through)
MAX_EPISODE_PAGES = 1

# 3) [Important] Selectors found using Developer Tools (F12)
# (A) Episode link selector (tag used to click each episode on the list page)
# Tip: use class(.) instead of id(#).
EPISODE_LINK_SELECTOR = "li._episodeItem a"  

# (B) Comment 'More' button selector
# Tip: needed to keep clicking until the button disappears.
MORE_BUTTON_SELECTOR = "button.wcc_CommentMore__more"

# (C) Actual comment text selector
# Tip: locate as comment box (li) -> paragraph (p) -> span.
COMMENT_TEXT_SELECTOR = "li.wcc_CommentItem__root p.wcc_TextContent__content span"

# 4) Path to the Edge driver file (modify to match your computer path)
EDGEDRIVER_PATH = r"C:\WebDriver\msedgedriver.exe"

# 5) Output file name to save results
OUTPUT_CSV = r"...\comments.csv"

### Launch browser

In [3]:
# Set Edge driver options and launch
edge_options = Options()
# edge_options.add_argument("--headless")  # Uncomment if you do not want to see the browser window

edge_service = Service(executable_path=EDGEDRIVER_PATH)

# Open the browser
driver = webdriver.Edge(service=edge_service, options=edge_options)
wait = WebDriverWait(driver, 10)  # Wait up to 10 seconds for elements to appear

print("Edge browser has been launched.")

Edge browser has been launched.


### Collect episode list

In [8]:
episode_urls = []

print("Starting to collect episode URLs...")

# Repeat for the number of pages specified (from 1 to MAX_EPISODE_PAGES)
for page in range(1, MAX_EPISODE_PAGES + 1):
    # Append page number to the URL
    page_url = f"{SERIES_URL}&page={page}"
    print(f"  - Navigating to: {page_url}")
    
    driver.get(page_url)
    time.sleep(2)  # Wait for page to load (increase if internet is slow)
    
    # Wait until episode links appear
    try:
        wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, EPISODE_LINK_SELECTOR)))
        
        # Find link elements
        links = driver.find_elements(By.CSS_SELECTOR, EPISODE_LINK_SELECTOR)
        
        # Extract only the 'href' (URL) and store
        for a in links:
            href = a.get_attribute("href")
            if href and href not in episode_urls:  # Avoid duplicates
                episode_urls.append(href)
                
        print(f"    -> Found {len(links)} episodes")
        
    except Exception as e:
        print(f"    Error occurred or no links found: {e}")

print(f"\nA total of {len(episode_urls)} episode URLs collected.")


Starting to collect episode URLs...
  - Navigating to: https://www.webtoons.com/en/romance/the-mafia-nanny/list?title_no=5879&page=1
    -> Found 10 episodes

A total of 10 episode URLs collected.


### Collect comments

In [9]:
comments_data = []

print("Starting comment collection (this may take some time)...")

# Use tqdm to display progress
for idx, ep_url in enumerate(tqdm(episode_urls, desc="Progress"), start=1):
    
    driver.get(ep_url)
    time.sleep(3)  # Wait for the comment section to load (generous wait)
    
    # 1) Keep clicking the 'More' button until it disappears
    click_count = 0
    while True:
        try:
            # Find the button
            more_btn = driver.find_element(By.CSS_SELECTOR, MORE_BUTTON_SELECTOR)
            
            driver.execute_script("arguments[0].click();", more_btn)
            
            click_count += 1
            time.sleep(0.5)  # Slight delay to avoid getting blocked
            
        except:
            # Stop when the button is not found (no more comments)
            break
            
    # 2) Get the current page HTML with BeautifulSoup
    html = driver.page_source
    soup = BeautifulSoup(html, "html.parser")
    
    # 3) Extract comment text only
    comment_nodes = soup.select(COMMENT_TEXT_SELECTOR)
    
    # 4) Add to the data list
    for c in comment_nodes:
        text = c.get_text(strip=True)
        if text:  # Exclude empty comments
            comments_data.append({
                "episode_index": idx,   # Episode number
                "episode_url": ep_url,  # Source URL
                "comment": text         # Comment text
            })

print(f"\nCollection complete! A total of {len(comments_data)} comments gathered.")

Starting comment collection (this may take some time)...


Progress: 100%|████████████████████████████████████████████████████████████████████████| 10/10 [08:21<00:00, 50.15s/it]


Collection complete! A total of 8759 comments gathered.





### Save and exit

In [10]:
# Close the browser
driver.quit()
print("Browser has been closed.")

# Convert to DataFrame and preview
df = pd.DataFrame(comments_data)

print("\n[Preview of collected data]")
display(df.head())  # Show top 5 rows

# Save to CSV (prevent Korean text from breaking: utf-8-sig)
df.to_csv(OUTPUT_CSV, index=False, encoding="utf-8-sig")
print(f"\nFile saved: {OUTPUT_CSV}")

Browser has been closed.

[Preview of collected data]


Unnamed: 0,episode_index,episode_url,comment
0,1,https://www.webtoons.com/en/romance/the-mafia-...,TOP
1,1,https://www.webtoons.com/en/romance/the-mafia-...,TOP
2,1,https://www.webtoons.com/en/romance/the-mafia-...,TOP
3,1,https://www.webtoons.com/en/romance/the-mafia-...,it's confirmed. Nico and Alina are married now
4,1,https://www.webtoons.com/en/romance/the-mafia-...,TOP



File saved: C:\Users\dadab\Desktop\Text mining\Crawling\comments.csv
