Skip to content

Python3 'Content Crawler', scrapes html and creates an easily navigatable set of files from the result

License

Notifications You must be signed in to change notification settings

Shardj/ccrawler

Repository files navigation

CCrawler

Based on the inputted parameters crawls through the webpage(s), downloads content, creates a navigation tree so you can see which pages link to which and allows for easy navigation.

I don't plan on working further on this as I made it specifically to download all of the ZF1 documentation before it got deleted from the web, and I've now done that.

WIP

For install make sure you have python pip, run pip install requirements.txt or python3-pip install requirements.txt depending on your setup

if you have issues with pip and python2.7 3.2 mixup like I did then use sudo pip3 install MODULE_NAME for each module, curl will be required (sudo apt-get install curl)

Unit Tests

Run all with bootstrapping python3 -m test OR python3 test.py

Run individual test - tests requiring bootstraping will fail python3 -m unittest tests/test_{name}.py

Notes

git update-index --skip-worktree storage/local.ini was run to prevent future changes, undo with git update-index --no-skip-worktree default_values.txt if required in the future

New __builtins__ function added in launcher.py to allow for easy module imports from anywhere in the project to anywhere else. However its usage isn't ideal, a viable alternative would be helpful

TODO handle items in database instead of in object TODO wrapper, indexing, navigation

About

Python3 'Content Crawler', scrapes html and creates an easily navigatable set of files from the result

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages