A Streamlit-based web crawler that discovers all URLs and subdomains from a target website using Playwright.
✨ Core Capabilities:
- Enter any website URL and crawl it automatically
- Extract all discovered URLs and subdomains
- Configurable crawling depth (max pages)
- Adjustable timeout settings
- Option to include external links
- Download results as TXT files
- Real-time progress tracking
- Error handling and reporting
- Clone or download this repository
- Install dependencies:
pip install -r requirements.txt
playwright install chromium- Run locally:
streamlit run streamlit_crawler.py- Create a new GitHub repository
- Push these files:
streamlit_crawler.py(main app)requirements.txt(dependencies).streamlit/config.toml(Streamlit configuration)README.md(this file)
- Go to Streamlit Cloud
- Sign in with your GitHub account
- Click "New app"
- Select your repository
- Specify the file:
streamlit_crawler.py - Click "Deploy"
Streamlit Cloud requires special handling for Playwright. Add a packages.txt file:
packages.txt (create this file):
libgconf-2-4
libx11-xcb1
libxcomposite1
libxcursor1
libxdamage1
libxext6
libxfixes3
libxi6
libxinerama1
libxrandr2
libxrender1
libxss1
libxtst6
fonts-liberation
libappindicator1
libgtk-3-0
xdg-utils
- Enter Website URL: Input your target website (e.g.,
example.comorhttps://example.com) - Configure Settings (sidebar):
- Max Pages: How many pages to crawl (default: 50)
- Timeout: Max seconds per page (default: 30)
- Include External Links: Toggle for external URLs
- Click "Start Crawling": The app will begin crawling
- View Results:
- Summary cards showing statistics
- Discovered subdomains list
- All URLs found
- Download options for results
- Max pages to crawl (10-500): Limits the number of pages to visit. Lower values = faster, less comprehensive.
- Timeout (seconds) (5-60): How long to wait for each page to load.
- Include external links: Whether to collect links pointing outside the target domain.
Web_Mapper/
├── streamlit_crawler.py # Main application
├── requirements.txt # Python dependencies
├── packages.txt # System packages (for Streamlit Cloud)
├── .streamlit/
│ └── config.toml # Streamlit configuration
└── README.md # This file
- URL Normalization: Converts input URL to proper format
- Domain Extraction: Identifies the base domain
- Web Crawling: Uses Playwright to:
- Load each page
- Extract all links via DOM query
- Resolve relative URLs to absolute URLs
- Filter by domain (stays on-domain by default)
- Subdomain Detection: Identifies URLs with subdomains
- Results Display: Shows statistics and allows downloading results
- Framework: Streamlit 1.36.0
- Browser Automation: Playwright 1.44.0 (Chromium)
- Execution: Asynchronous with asyncio
- URL Parsing: Python's urllib.parse
- Start with lower max pages (20-30) for quick tests
- Increase timeout for slow websites
- On Streamlit Cloud, initial load may take 30-60 seconds
"Browser error": Website may be blocking automated browsers
- Try increasing timeout
- Check if website has anti-bot protection
"Timeout errors": Website is slow to load
- Increase timeout setting
- Reduce max pages to crawl
"No results": Website may be JavaScript-heavy
- The crawler waits for
domcontentloadedevent
- Does not execute complex JavaScript (waits for DOM ready only)
- Does not handle sites requiring authentication
- Respects only crawl speed, not robots.txt (use responsibly!)
- Some websites may block automated access
MIT License - Feel free to modify and deploy!
For issues or questions, check the error messages in the app or review the logs in Streamlit Cloud deployment dashboard.