Website Crawler with Subdomain Finder

A Streamlit-based web crawler that discovers all URLs and subdomains from a target website using Playwright.

Features

✨ Core Capabilities:

Enter any website URL and crawl it automatically
Extract all discovered URLs and subdomains
Configurable crawling depth (max pages)
Adjustable timeout settings
Option to include external links
Download results as TXT files
Real-time progress tracking
Error handling and reporting

Installation

Local Development

Clone or download this repository
Install dependencies:

pip install -r requirements.txt
playwright install chromium

Run locally:

streamlit run streamlit_crawler.py

Deployment to Streamlit Cloud

Step 1: Prepare your GitHub repository

Create a new GitHub repository
Push these files:
- streamlit_crawler.py (main app)
- requirements.txt (dependencies)
- .streamlit/config.toml (Streamlit configuration)
- README.md (this file)

Step 2: Deploy to Streamlit Cloud

Go to Streamlit Cloud
Sign in with your GitHub account
Click "New app"
Select your repository
Specify the file: streamlit_crawler.py
Click "Deploy"

Step 3: Handle Playwright on Cloud (IMPORTANT!)

Streamlit Cloud requires special handling for Playwright. Add a packages.txt file:

packages.txt (create this file):

libgconf-2-4
libx11-xcb1
libxcomposite1
libxcursor1
libxdamage1
libxext6
libxfixes3
libxi6
libxinerama1
libxrandr2
libxrender1
libxss1
libxtst6
fonts-liberation
libappindicator1
libgtk-3-0
xdg-utils

Usage

Enter Website URL: Input your target website (e.g., example.com or https://example.com)
Configure Settings (sidebar):
- Max Pages: How many pages to crawl (default: 50)
- Timeout: Max seconds per page (default: 30)
- Include External Links: Toggle for external URLs
Click "Start Crawling": The app will begin crawling
View Results:
- Summary cards showing statistics
- Discovered subdomains list
- All URLs found
- Download options for results

Configuration Options

Sidebar Settings

Max pages to crawl (10-500): Limits the number of pages to visit. Lower values = faster, less comprehensive.
Timeout (seconds) (5-60): How long to wait for each page to load.
Include external links: Whether to collect links pointing outside the target domain.

File Structure

Web_Mapper/
├── streamlit_crawler.py      # Main application
├── requirements.txt          # Python dependencies
├── packages.txt             # System packages (for Streamlit Cloud)
├── .streamlit/
│   └── config.toml          # Streamlit configuration
└── README.md                # This file

How It Works

URL Normalization: Converts input URL to proper format
Domain Extraction: Identifies the base domain
Web Crawling: Uses Playwright to:
- Load each page
- Extract all links via DOM query
- Resolve relative URLs to absolute URLs
- Filter by domain (stays on-domain by default)
Subdomain Detection: Identifies URLs with subdomains
Results Display: Shows statistics and allows downloading results

Technical Details

Framework: Streamlit 1.36.0
Browser Automation: Playwright 1.44.0 (Chromium)
Execution: Asynchronous with asyncio
URL Parsing: Python's urllib.parse

Performance Tips

Start with lower max pages (20-30) for quick tests
Increase timeout for slow websites
On Streamlit Cloud, initial load may take 30-60 seconds

Troubleshooting

"Browser error": Website may be blocking automated browsers

Try increasing timeout
Check if website has anti-bot protection

"Timeout errors": Website is slow to load

Increase timeout setting
Reduce max pages to crawl

"No results": Website may be JavaScript-heavy

The crawler waits for domcontentloaded event

Limitations

Does not execute complex JavaScript (waits for DOM ready only)
Does not handle sites requiring authentication
Respects only crawl speed, not robots.txt (use responsibly!)
Some websites may block automated access

License

MIT License - Feel free to modify and deploy!

Support

For issues or questions, check the error messages in the app or review the logs in Streamlit Cloud deployment dashboard.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Website Crawler with Subdomain Finder

Features

Installation

Local Development

Deployment to Streamlit Cloud

Step 1: Prepare your GitHub repository

Step 2: Deploy to Streamlit Cloud

Step 3: Handle Playwright on Cloud (IMPORTANT!)

Usage

Configuration Options

Sidebar Settings

File Structure

How It Works

Technical Details

Performance Tips

Troubleshooting

Limitations

License

Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.streamlit		.streamlit
README.md		README.md
requirements.txt		requirements.txt
streamlit_crawler.py		streamlit_crawler.py

Folders and files

Latest commit

History

Repository files navigation

Website Crawler with Subdomain Finder

Features

Installation

Local Development

Deployment to Streamlit Cloud

Step 1: Prepare your GitHub repository

Step 2: Deploy to Streamlit Cloud

Step 3: Handle Playwright on Cloud (IMPORTANT!)

Usage

Configuration Options

Sidebar Settings

File Structure

How It Works

Technical Details

Performance Tips

Troubleshooting

Limitations

License

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages