Simple Java Web-Crawler for internal and external links and images on websites with XML-Output.
Switch branches/tags
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
doc
src/main/java/eu/menzerath/webcrawler
.gitignore
.gitlab-ci.yml
LICENSE
README.md
WebCrawler.iml
pom.xml

README.md

WebCrawler

This is a simple, recursive Java Web-Crawler for internal and external links and images on a specific website, which creates a simple XML-file including the found pages and the returned status-code. While it attempts to crawl through any website and find new links, it won't crawl a site multiple times or try to crawl a downloadable file.

Important: Crawling may take some time and use many server-resources. Be careful!

Download

Run

GUI

Double-click the downloaded file or use the console:

java -jar WebCrawler.jar

Console

java -jar WebCrawler.jar http://my-website.com

Example Output

GUI

GUI

Console

INTERNAL LINKS:
[1] [200] https://menzerath.eu
[2] [200] https://menzerath.eu/rss/
[3] [200] https://menzerath.eu/tag/android/
[4] [200] https://menzerath.eu/tag/java/
[5] [XXX] ...

EXTERNAL LINKS:
[1] [200] https://facebook.com/menzerath.eu
[2] [200] https://twitter.com/MarvinMenzerath
[3] [200] https://github.com/MarvinMenzerath
[4] [200] http://blackphantom.de
[5] [XXX] ...

INTERNAL / EXTERNAL IMAGES:
[1] [200] https://raw.githubusercontent.com/MarvinMenzerath/IsMyWebsiteDown/master/doc/Screenshot1.png
[2] [200] https://raw.githubusercontent.com/MarvinMenzerath/IsMyWebsiteDown/master/doc/Screenshot2.png
[3] [200] https://raw.githubusercontent.com/MarvinMenzerath/IsMyWebsiteDown/master/doc/Screenshot3.png
[4] [200] https://menzerath.eu/content/images/2014/10/Screen.png
[5] [XXX] ...

XML-File

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<urlset>
    <internal>
        <link>
            <url>https://menzerath.eu</url>
            <code>200</code>
        </link>
        <link>
            <url>https://menzerath.eu/rss/</url>
            <code>200</code>
        </link>
        <link>
            <url>https://menzerath.eu/tag/android/</url>
            <code>200</code>
        </link>
        <link>
            <url>https://menzerath.eu/tag/java/</url>
            <code>200</code>
        </link>
        <link>
            <url>...</url>
            <code>XXX</code>
        </link>
    </internal>
    <external>
        <link>
            <url>https://facebook.com/menzerath.eu</url>
            <code>200</code>
        </link>
        <link>
            <url>https://twitter.com/MarvinMenzerath</url>
            <code>200</code>
        </link>
        <link>
            <url>https://github.com/MarvinMenzerath</url>
            <code>200</code>
        </link>
        <link>
            <url>http://blackphantom.de</url>
            <code>200</code>
        </link>
        <link>
            <url>...</url>
            <code>XXX</code>
        </link>
    </external>
    <images>
        <link>
            <url>https://raw.githubusercontent.com/MarvinMenzerath/IsMyWebsiteDown/master/doc/Screenshot1.png</url>
            <code>200</code>
        </link>
        <link>
            <url>https://raw.githubusercontent.com/MarvinMenzerath/IsMyWebsiteDown/master/doc/Screenshot2.png</url>
            <code>200</code>
        </link>
        <link>
            <url>https://raw.githubusercontent.com/MarvinMenzerath/IsMyWebsiteDown/master/doc/Screenshot3.png</url>
            <code>200</code>
        </link>
        <link>
            <url>https://menzerath.eu/content/images/2014/10/Screen.png</url>
            <code>200</code>
        </link>
        <link>
            <url>...</url>
            <code>XXX</code>
        </link>
    </images>
</urlset>

License

Copyright (C) 2014 Marvin Menzerath

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.