A comprehensive web scraper application in Go with data persistence and web interface.
- Multi-source scraping (Hacker News, Bookstore, and more)
- Data persistence using SQLite
- Web interface for browsing scraped data
- Configurable scraping options
- Export to JSON and CSV formats
- Ethical scraping with rate limiting and user agent rotation
- Detailed logging
- Go 1.20 or higher
- SQLite 3
-
Clone the repository:
git clone https://github.com/GoEcosystem/go-web-scraper.git cd go-web-scraper
-
Install dependencies:
go mod download
-
Build the application:
go build -o scraper ./cmd/scraper go build -o server ./cmd/webserver
Run the scraper to collect data:
./scraper -target=hackernews -pages=5
./scraper -target=bookstore -pages=3
Available options:
-target
: The website to scrape (hackernews, bookstore)-pages
: Number of pages to scrape-output
: Output format (json, csv, db)-file
: Output filename (when using json or csv)
Start the web server:
./server -port=8080
Then open your browser at http://localhost:8080
.
├── cmd/ # Command-line applications
│ ├── scraper/ # CLI scraper tool
│ └── webserver/ # Web interface server
├── db/ # Database management
├── models/ # Data models
├── scrapers/ # Website-specific scrapers
├── utils/ # Utility functions
└── web/ # Web interface
├── server.go # HTTP server
├── static/ # Static assets (CSS, JS)
└── templates/ # HTML templates
Comprehensive documentation is available in the /docs
directory and can be viewed online once GitHub Pages is enabled.
To view the documentation locally:
-
Navigate to the docs directory:
cd docs
-
Install Ruby dependencies:
bundle install
-
Run the Jekyll server:
bundle exec jekyll serve
-
Open your browser and visit:
http://localhost:4000
To publish the documentation on GitHub Pages:
- Go to the repository settings: https://github.com/GoEcosystem/go-web-scraper/settings/pages
- Under "Source", select "Deploy from a branch"
- Choose the "main" branch and the "/docs" folder
- Click "Save"
Once enabled, documentation will be available at: https://goecosystem.github.io/go-web-scraper/
The documentation follows the standardized GoEcosystem documentation approach with:
- API documentation
- Architecture reference
- User guides
- Examples
Contributions are welcome! Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.