Skip to content

GoEcosystem/go-web-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GoEcosystem Web Scraper

Go Report Card License Tests

A comprehensive web scraper application in Go with data persistence and web interface.

Features

  • Multi-source scraping (Hacker News, Bookstore, and more)
  • Data persistence using SQLite
  • Web interface for browsing scraped data
  • Configurable scraping options
  • Export to JSON and CSV formats
  • Ethical scraping with rate limiting and user agent rotation
  • Detailed logging

Installation

Prerequisites

  • Go 1.20 or higher
  • SQLite 3

Getting Started

  1. Clone the repository:

    git clone https://github.com/GoEcosystem/go-web-scraper.git
    cd go-web-scraper
  2. Install dependencies:

    go mod download
  3. Build the application:

    go build -o scraper ./cmd/scraper
    go build -o server ./cmd/webserver

Usage

Command Line Scraper

Run the scraper to collect data:

./scraper -target=hackernews -pages=5
./scraper -target=bookstore -pages=3

Available options:

  • -target: The website to scrape (hackernews, bookstore)
  • -pages: Number of pages to scrape
  • -output: Output format (json, csv, db)
  • -file: Output filename (when using json or csv)

Web Interface

Start the web server:

./server -port=8080

Then open your browser at http://localhost:8080

Project Structure

.
├── cmd/                  # Command-line applications
│   ├── scraper/          # CLI scraper tool
│   └── webserver/        # Web interface server
├── db/                   # Database management
├── models/               # Data models
├── scrapers/             # Website-specific scrapers
├── utils/                # Utility functions
└── web/                  # Web interface
    ├── server.go         # HTTP server
    ├── static/           # Static assets (CSS, JS)
    └── templates/        # HTML templates

Documentation

Comprehensive documentation is available in the /docs directory and can be viewed online once GitHub Pages is enabled.

Local Documentation Preview

To view the documentation locally:

  1. Navigate to the docs directory:

    cd docs
  2. Install Ruby dependencies:

    bundle install
  3. Run the Jekyll server:

    bundle exec jekyll serve
  4. Open your browser and visit:

    http://localhost:4000
    

Enabling GitHub Pages (Repository Admins)

To publish the documentation on GitHub Pages:

  1. Go to the repository settings: https://github.com/GoEcosystem/go-web-scraper/settings/pages
  2. Under "Source", select "Deploy from a branch"
  3. Choose the "main" branch and the "/docs" folder
  4. Click "Save"

Once enabled, documentation will be available at: https://goecosystem.github.io/go-web-scraper/

Documentation Structure

The documentation follows the standardized GoEcosystem documentation approach with:

  • API documentation
  • Architecture reference
  • User guides
  • Examples

Contributing

Contributions are welcome! Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

A comprehensive web scraper application in Go with data persistence and web interface

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published