Web Crawler

A Python-based command-line web crawler that uses Selenium to navigate websites, interact with dynamic elements (forms, buttons, search bars), capture network requests, and extract valid endpoints. The crawler saves endpoint details (URL, method, body parameters, headers) to a file in JSON, plain text, or CSV format.

Features

Crawls websites starting from a given URL, up to a specified page limit.
Interacts with dynamic elements:
- Clicks buttons and submit-like elements.
- Fills forms (text inputs, dropdowns, checkboxes) without submitting.
- Enters test data in search bars.
- Triggers onchange and oninput events.
Captures network requests to identify HTTP endpoints.
Extracts endpoints from JavaScript files.
Validates URLs to ensure they belong to the base domain and exclude static assets (CSS, JS, images).
Supports custom HTTP headers (e.g., Authorization tokens).
Saves unique endpoints to a file in JSON, plain text, or CSV format.
Browser fallback: Uses Chrome, falls back to Firefox if Chrome is unavailable.
Accurate HTTP method detection (GET, POST, PUT, DELETE) for all endpoints.

Warning

Caution: This crawler sends real HTTP requests to the target website, interacting with forms, buttons, and search bars. Do not use this tool with real accounts or credentials, as it may trigger security measures, lock accounts, or result in bans. Always obtain permission from the website owner before crawling, and use test accounts or environments to avoid unintended consequences.

Installation

Install Python: Ensure Python 3.6+ is installed.
Install Dependencies:
```
pip install selenium requests
```
Install WebDriver:
- For Chrome: Install ChromeDriver matching your Chrome version.
- For Firefox: Install GeckoDriver.
- Ensure the WebDriver executable is in your system PATH.

Usage

Run the crawler from the command line using crawler.py.

python crawler.py -u http://example.com -m 10 -o endpoints.json --headless --header "Authorization: Bearer token"

Command-Line Arguments

-u/--url: Starting URL (required).
-m/--max-pages: Maximum pages to crawl (default: 10).
-o/--output: Output file (default: endpoints.json).
-f/--format: Output format (json, txt, csv; default: json or inferred from file extension).
--headless: Run in headless mode.
--header: Custom header (e.g., --header "Authorization: Bearer token"; can be used multiple times).

Output Formats

JSON (default):

Array of objects with url, method, body_params, and extra_headers.

Example: endpoints.json

[
  {
    "url": "http://example.com/api/submit",
    "method": "POST",
    "body_params": {"query": "test"},
    "extra_headers": {"Authorization": "Bearer token"}
  }
]

Use case: General-purpose, suitable for scripts or tools that parse JSON.

Plain Text (txt):
- One URL per line.
- Example: endpoints.txt
```
http://example.com/api/submit
```
- Use case: Direct input to tools like nuclei, ffuf, or sqlmap.
CSV:
- Columns: URL, Method, Body Params (JSON-serialized), Extra Headers (JSON-serialized).
- Example: endpoints.csv
```
URL,Method,Body Params,Extra Headers
http://example.com/api/submit,POST,"{""query"": ""test""}","{""Authorization"": ""Bearer token""}"
```
- Use case: Structured data for analysis or tools that accept CSV.

Examples

Basic Crawl with JSON Output:

python crawler.py -u http://example.com -m 10 -o endpoints.json --headless

Crawl with Authentication and Text Output:

python crawler.py -u http://example.com -m 20 -o endpoints.txt -f txt --headless --header "Authorization: Bearer eyJhbGciOiJIUzUxMiJ9..."

Crawl with CSV Output:

python crawler.py -u http://example.com -m 15 -o endpoints.csv -f csv --headless --header "User-Agent: Mozilla/5.0"

Pipe to Nuclei for Vulnerability Scanning:

python crawler.py -u http://example.com -o endpoints.txt -f txt --headless
cat endpoints.txt | nuclei -t /path/to/templates

Pipe to FFUF for Fuzzing:

python crawler.py -u http://example.com -o endpoints.txt -f txt --headless
ffuf -w endpoints.txt -u FUZZ

Test SQL Injection with sqlmap:

python crawler.py -u http://example.com -o endpoints.txt -f txt --headless
sqlmap -m endpoints.txt --batch

Notes

Browser Support: The crawler tries Chrome first, falling back to Firefox if Chrome is unavailable.
HTTP Methods: Accurately detects GET, POST, PUT, DELETE methods, including for endpoints extracted from JavaScript.
Error Handling: Logs errors and warnings for debugging.
Output: Use the txt format for easy integration with most security tools.

Contributing

Contributions are welcome! Please:

Fork the repository.
Create a feature branch (git checkout -b feature/YourFeature).
Commit changes (git commit -m "Add YourFeature").
Push to the branch (git push origin feature/YourFeature).
Open a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Readme.md		Readme.md
crawler.py		crawler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler

Features

Warning

Installation

Usage

Command-Line Arguments

Output Formats

Examples

Notes

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web Crawler

Features

Warning

Installation

Usage

Command-Line Arguments

Output Formats

Examples

Notes

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages