# WebPilot AI: A Multimodal Approach to Web Data Extraction
**Avnish Kardekar**  
*DA623 – Multimodal Data Analysis and Learning (Winter 2025)*  
*Indian Institute of Technology Guwahati*

## Motivation
Traditional scrapers break on websites that:
- Use JavaScript for dynamic content rendering
- Require scrolling or interaction
- Present data in visual formats like images or charts

**WebPilot AI** addresses these by combining:
- HTML parsing
- Vision-Language Models
- Automated browser interaction

## Multimodal Learning Context
This project reflects multimodal learning by integrating:
- **Textual data** (HTML parsing)
- **Visual data** (Vision-Language model reasoning)
- **Behavioral data** (user interaction simulation)

These complement each other to extract structured data from almost any type of website.

## 1. HTML Scraping
Best for static, well-structured pages. Below is an example of scraping product data from a simple eCommerce page.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://example.com/products'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

products = []
for div in soup.select('div.product'):
    title = div.select_one('h2.title').text.strip()
    price = div.select_one('span.price').text.strip()
    products.append({'title': title, 'price': price})

df = pd.DataFrame(products)
df.head()

## 2. Vision-Based Extraction with Vision-language Model
For websites with visual layouts or hidden JS content, we capture a screenshot and prompt a Vision-Language Model.

**Prompt Example:**
> Extract all job titles, companies, and locations from this job board.

The model processes the image and returns structured JSON data.

## 3. Automated Browsing with Playwright
Used for dynamic pages requiring interaction.
Here’s a code snippet to scrape Google search results:

In [None]:
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto('https://www.google.com')
    page.fill('input[name="q"]', 'best AI tools 2025')
    page.keyboard.press('Enter')
    page.wait_for_selector('h3')
    results = page.query_selector_all('h3')
    for r in results[:5]:
        print(r.inner_text())
    browser.close()

## Comparative Analysis
| Mode               | Strengths                         | Limitations                         |
|--------------------|------------------------------------|--------------------------------------|
| HTML Scraping      | Fast, efficient for static pages   | Fails with JS / visual layouts       |
| Vision-based       | Works on images, charts, visuals   | Requires VLMs, slower                |
| Automated Browsing | Mimics real users, handles JS      | Slower, resource-heavy               |

## Reflections
- **Improvements**: Speed optimization, error handling, and CAPTCHA bypassing

## References
- [Playwright Python Docs](https://playwright.dev/python/)
- [BeautifulSoup Docs](https://www.crummy.com/software/BeautifulSoup/)
- [GPT-4V](https://openai.com/gpt-4v)
- [LLaVA](https://llava-vl.github.io/)