A simple Python tool to extract HTTP request headers that would be sent to web pages, with full support for comprehensive browser headers.
- Extract request headers that would be sent to any URL
- Send actual requests and capture both request and response headers
- Handle HTTP errors gracefully
- Support for comprehensive browser headers to bypass anti-bot measures
- Simple API for easy integration
- JSON output for easy parsing
- Clone or download this repository
- Create a virtual environment:
python -m venv .venv
- Activate the virtual environment:
- On Windows:
.venv\Scripts\activate
- On macOS/Linux:
source .venv/bin/activate
- On Windows:
- Install the required dependencies:
pip install -r requirements.txt
from header_extractor import HeaderExtractor
# Create extractor instance
extractor = HeaderExtractor()
# Extract request headers that would be sent
result = extractor.extract_request_headers("https://example.com")
# Access the request headers
request_headers = result['request_headers']
for key, value in request_headers.items():
print(f"{key}: {value}")from header_extractor import HeaderExtractor
# Create extractor instance
extractor = HeaderExtractor()
# Define comprehensive browser headers to bypass anti-bot measures
comprehensive_headers = {
# Content negotiation headers
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
# Cache control (anti-bot detection)
'cache-control': 'no-cache',
'pragma': 'no-cache',
# Priority headers (modern browser feature)
'priority': 'u=0, i',
# Client Hints (modern replacement for parts of User-Agent)
'sec-ch-ua': '"Not;A=Brand";v="99", "Google Chrome";v="139", "Chromium";v="139"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
# Security/Fetch metadata headers
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
# User-Agent (still important for compatibility)
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36'
}
# Extract request headers with comprehensive headers
result = extractor.extract_request_headers("https://example.com", comprehensive_headers)
# Print the request headers that would be sent
print("Request Headers That Would Be Sent:")
for key, value in result['request_headers'].items():
print(f"{key}: {value}")from header_extractor import HeaderExtractor
# Create extractor instance
extractor = HeaderExtractor()
# Send request and capture both request and response headers
result = extractor.send_request_and_capture_headers("https://example.com")
# Access both request and response headers
print("Request Headers Sent:")
for key, value in result['request_headers_sent'].items():
print(f" {key}: {value}")
print("\nResponse Headers Received:")
for key, value in result['response_headers_received'].items():
print(f" {key}: {value}")You can also use the tool from the command line (after installation, via the header-extractor entrypoint):
# Basic usage
header-extractor https://example.com
# Specify custom output directory
header-extractor https://example.com --output-dir data/headers
# Multiple URLs
header-extractor https://example.com https://httpbin.org/headers
# Custom timeout
header-extractor https://example.com --timeout 30
# Save to file
header-extractor https://example.com --output headers.json
# Inline format
header-extractor https://example.com --format inlineMany modern websites, especially those with anti-bot measures, analyze these headers to determine if the request is coming from a genuine browser or a script:
- accept/accept-language: Content negotiation headers that indicate browser capabilities
- sec-ch-ua*: Client Hints that provide structured browser information
- sec-fetch-dest/mode/site: Security headers that indicate request context
- upgrade-insecure-requests: Directive for HTTPS preference
- cache-control/pragma: Cache behavior indicators
A request with only basic headers (User-Agent, Accept-Encoding, Connection) looks very "programmatic" to a server and may be blocked by anti-bot systems.
The SequenceExtractor class allows you to define and execute sequences of HTTP requests where each step can depend on data from previous steps.
from header_extractor.sequence_extractor import SequenceExtractor
# Create a sequence
executor = SequenceExtractor()
# Step 1: Get initial data
executor.add_step(
name="get_initial_data",
url="https://httpbin.org/get",
extract={"origin_ip": "origin"} # Extract IP from response
)
# Step 2: Send data (depends on first step)
executor.add_step(
name="post_data",
method="POST",
url="https://httpbin.org/post",
depends_on=["get_initial_data"],
data=lambda ctx: {
"client_ip": ctx.get("origin_ip", ""),
"action": "test"
},
extract={"json_data": "json"}
)
# Step 3: Get headers with custom headers (depends on previous step)
executor.add_step(
name="get_headers",
url="https://httpbin.org/headers",
depends_on=["post_data"],
headers=lambda ctx: {
"X-Client-IP": ctx.get("origin_ip", ""),
"User-Agent": "SequenceExtractor/1.0"
}
)
# Execute the sequence
results = executor.execute()
# Print results
for name, result in results.items():
print(f"\nStep: {name}")
print(f"Success: {result.success}")
if result.error:
print(f"Error: {result.error}")
if result.data:
print("Extracted data:", result.data)- Step Dependencies: Steps can depend on the successful completion of previous steps
- Data Extraction: Extract data from responses and use it in subsequent requests
- Conditional Execution: Skip steps based on conditions
- Retry Logic: Automatic retries for failed requests
- Context Management: Share data between steps using a context dictionary
- Error Handling: Graceful handling of failures with detailed error information
Each step can be configured with the following parameters:
name: Unique identifier for the stepurl: The URL to requestmethod: HTTP method (default: "GET")headers: Headers to include in the request (can be a dict or callable)data: Request body data (can be a dict or callable)depends_on: List of step names that must complete successfully firstcondition: Callable that receives context and returns whether to executeextract: Dict of {name: path} to extract data from responsemax_retries: Maximum number of retry attempts (default: 1)delay: Delay in seconds before executing this step (default: 0)
extractor = HeaderExtractor(timeout=10, custom_headers=None, output_dir=None)Parameters:
timeout(int): Request timeout in seconds (default: 10)custom_headers(dict, optional): Custom headers to include in all requestsoutput_dir(str or Path, optional): Custom output directory for saved files. Uses value from config if not specified.
Extract the request headers that would be sent to a given URL.
Parameters:
url(str): The URL to send request tocustom_headers(dict, optional): Custom headers to include in the request
Returns:
dict: A dictionary containing the request headers that would be sent
Set a new default output directory for saving files.
Parameters:
output_dir(str or Path): New output directory path
Example:
extractor = HeaderExtractor()
extractor.set_output_dir('data/headers') # All future saves will go hereSave data to a JSON file in the output directory.
Parameters:
data(dict): Data to save as JSONfilename(str, optional): Output filename. If None, generates a name in format 'headers__.json'output_dir(str or Path, optional): Output directory for this save only. Uses instance output_dir if None.
Returns:
str: Path to the saved file
Example:
# Basic usage
extractor = HeaderExtractor()
extractor.save_to_file(data) # Saves to default output_dir
# Custom output directory for this save only
extractor.save_to_file(data, output_dir='temp/output')
# Change default output directory for all future saves
extractor.set_output_dir('data/headers')
extractor.save_to_file(data) # Saves to data/headers/Send a request and capture both request and response headers.
Parameters:
url(str): The URL to send request tocustom_headers(dict, optional): Custom headers to include in the request
Returns:
dict: A dictionary containing both request and response headers
- Python 3.6+
- See requirements.txt for Python package dependencies
This project is open source and available under the MIT License.