Skip to content
This repository has been archived by the owner on Nov 16, 2022. It is now read-only.
/ euterpe Public archive

is a web crawler that searches a website for internal and external broken links.

Notifications You must be signed in to change notification settings

Milleus/euterpe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Euterpe

Euterpe is a web crawler that searches a website for internal and external broken links. Crawler is written in Python. Demo dashboard was bootstrapped with Create React App.

Available Scripts

Crawler

Requires scrapy installed, pip install scrapy.

  • scrapy crawl check_anchor_tags -t <type> -o <filename>

    Note: this is only a prototype. Currently it only checks anchor tags

    Runs crawler and logs output into a file.
    -t json -o file.json will log output into file.json.
    -t csv -o file.csv will log output into file.csv.

Demo Dashboard

Go into demo-dashboard folder and run either of these commands:

  • npm start

    Runs the app in the development mode.
    Open http://localhost:3000 to view it in the browser.

    The page will reload if you make edits.
    You will also see any lint errors in the console.

  • npm test

    Launches the test runner in the interactive watch mode.
    See the section about running tests for more information.

  • npm run build

    Builds the app for production to the build folder.
    It correctly bundles React in production mode and optimizes the build for the best performance.

    The build is minified and the filenames include the hashes.
    Your app is ready to be deployed!

    See the section about deployment for more information.

  • npm run eject

    Note: this is a one-way operation. Once you eject, you can’t go back!

    If you aren’t satisfied with the build tool and configuration choices, you can eject at any time. This command will remove the single build dependency from your project.

    Instead, it will copy all the configuration files and the transitive dependencies (Webpack, Babel, ESLint, etc) right into your project so you have full control over them. All of the commands except eject will still work, but they will point to the copied scripts so you can tweak them. At this point you’re on your own.

    You don’t have to ever use eject. The curated feature set is suitable for small and middle deployments, and you shouldn’t feel obligated to use this feature. However we understand that this tool wouldn’t be useful if you couldn’t customize it when you are ready for it.

To Do

  1. Map data with clicks data.
  2. Allow input of website URL (currently hardcoded).
  3. Allow checking of img src links.
  4. Allow checking of script source links, css stylesheet links, favicon, etc?
  5. Allow customisable settings for download timeout, DNS timeout, number of retries, etc.
  6. Improve error handling.
  7. Extract crawler statistics data (already exists but only in console).

Credits

This is a Government Digital Services (GDS) GovTech Hackweek Jan 2020 project. Team members involved are:

  • Cecilia Lim
  • Chan Win Hung
  • Cheong Jie Wei
  • Dave Quah
  • Lim Kim Yong

About

is a web crawler that searches a website for internal and external broken links.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published