Skip to content

Catefishh/HTML-Scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

HTML Scraper for Website Analysis

Description

This Python script is an HTML scraper designed for website analysis. It extracts various elements from a given website, including HTML, CSS, APIs, metadata, content, and links. The script also performs basic structural analysis and keyword frequency analysis.

Features

  • Extracts HTML, CSS, and potential API endpoints
  • Retrieves metadata (title, description, keywords)
  • Extracts text content and links (both internal and external)
  • Analyzes basic website structure
  • Performs simple keyword frequency analysis
  • Saves all extracted data to a CSV file
  • Provides key findings, recommendations, and an actionable plan based on the analysis

Requirements

  • Python 3.6+
  • requests library
  • beautifulsoup4 library

Installation

  1. Ensure you have Python 3.6 or higher installed on your system.
  2. Install the required libraries:

Usage

  1. Run the script:
  2. When prompted, enter the URL of the website you want to analyze.
  3. The script will scrape the website and save the results to a file named website_analysis.csv in the same directory.
  4. If needed, a console output with key findings, recommendations and an actionable plan will be provided.

Output

The script generates two types of output:

  1. A CSV file (website_analysis.csv) containing all extracted data.
  2. Console output with key findings, recommendations, and an actionable plan.

CSV File Structure

The CSV file contains the following information:

  • HTML content
  • CSS content
  • Detected API endpoints
  • Metadata (title, description, keywords)
  • Page content
  • Internal and external links
  • Website structure analysis
  • Top keywords

Console Output

The console output provides:

  • A summary of key findings from the analysis
  • Recommendations based on the findings
  • An actionable plan for website improvement

Limitations

  • The script performs basic scraping and analysis. For more complex websites, additional customization may be required.
  • The keyword analysis is based on simple frequency counting and may not account for context or importance.
  • The script does not handle JavaScript-rendered content.

Legal and Ethical Considerations

Ensure you have the right to scrape and analyze the target website. Always review and comply with the website's robots.txt file and terms of service.

Contributing

Contributions, issues, and feature requests are welcome. Feel free to check [issues page] if you want to contribute.

License

MIT License

About

HTML Scraper for Website Analysis

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages