Exoskeleton

Machine Learning and other applications make it necessary to download thousands or sometimes hundreds of thousands of files.

Using a high-speed-connection carries the risk to run an involuntary denial-of-service attack on the servers that provide those files and webpages.

Exoskeleton is a Python framework that helps you build a crawler / scraper that avoids too high loads on the connection and instead runs permanently and fault tolerant to ultimately download all files.

Its main functionalities are:

Managing the download queue and document data within a MariaDB database.
Avoid processing the same URL more than once.
Working through the queue by either
- downloading files to disk,
- storing the page source code into a database table,
- storing the page text,
- or making PDF-copies of webpages.
Managing already downloaded files:
- Storing multiple versions of a specific file.
- Assigning labels to downloads, so they can be found and grouped easily.
Sending progress reports to the admin.

Documentation

How To Use Exoskeleton

Example Uses

Downloading an Archive : A quite complex use case requiring some custom SQL. This is the actual project that triggered the development of exoskeleton.

Technical Documentation

Example

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import logging

import exoskeleton

logging.basicConfig(level=logging.DEBUG)

# Create a bot
# exoskeleton makes reasonable assumptions about
# parameters left out, like:
# - host = localhost
# - port = 3306 (MariaDB standard)
# - ...
exo = exoskeleton.Exoskeleton(
    project_name='Bot',
    database_settings={'database': 'exoskeleton',
                       'username': 'exoskeleton',
                       'passphrase': ''},
    # True, to stop after the queue is empty, Otherwise it will
    # look consistently for new tasks in the queue:
    bot_behavior={'stop_if_queue_empty': True},
    filename_prefix='bot_',
    chrome_name='chromium-browser',
    target_directory='/home/myusername/myBot/'
)

exo.add_file_download('https://www.ruediger-voigt.eu/examplefile.txt')
# => Will be saved in the target directory. The filename will be the
#    chosen prefix followed by the database id and .txt.

exo.add_file_download(
    'https://www.ruediger-voigt.eu/examplefile.txt',
    {'example-label', 'foo'})
# => Duplicate will be recognized and not added to the queue,
#    but the labels will be associated with the file in the
#    database.


exo.add_file_download(
    'https://www.ruediger-voigt.eu/file_does_not_exist.pdf')
# => Nonexistent file: will be marked, but will not stop the bot.

# Save a page's code into the database:
exo.add_save_page_code('https://www.ruediger-voigt.eu/')

# Use chromium or Google chrome to generate a PDF of the website:
exo.add_page_to_pdf('https://github.com/RuedigerVoigt/exoskeleton')

# work through the queue:
exo.process_queue()

Name		Name	Last commit message	Last commit date
Latest commit History 565 Commits
.github		.github
Database-Scripts		Database-Scripts
documentation		documentation
exoskeleton		exoskeleton
.coveragerc		.coveragerc
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
contributing.md		contributing.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt
salted-linkcheck.ini		salted-linkcheck.ini
setup.py		setup.py
tests_with_side_effects.py		tests_with_side_effects.py
tests_without_side_effects.py		tests_without_side_effects.py

License

RuedigerVoigt/exoskeleton

Folders and files

Latest commit

History

Repository files navigation

Exoskeleton

Documentation

How To Use Exoskeleton

Example Uses

Technical Documentation

Example

About

Topics

Resources

License

Stars

Watchers

Forks

Languages