TinyCrawler

An highly customizable crawler that uses multiprocessing and proxies to download one or more websites following a given filter, search and save functions.

REMEMBER THAT DDOS IS ILLEGAL. DO NOT USE THIS SOFTWARE FOR ILLEGAL PURPOSE.

Installing TinyCrawler

pip install tinycrawler

TODOs for next version

Test proxies while normally downloading. - DONE
Parallelize different domains downloads. - DONE
Add dropping for high failure proxy and add parameters for such rate - DONE, yet to be tested
Make failure rate domain specific with also a global mean.
Enable failure rate also for local.
Check robots txt also before downloading urls
Reduce robots timeout defaults to 2 hours
Change to exponential the wait timeout for the download attempts
To define a binary file, check if in the first 1000 characters you find a number greater than 3/5 of zeros
Add useragent
Stop downloads when all proxies are dead.
Try to use active_children as a way to test for active processes
Add test for proxies
Add way to save progress automatically every given timeout.
Add way to automatically save tested proxies.

Preview (Test case)

This is the preview of the console when running the test_base.py.

Basic usage example

from tinycrawler import TinyCrawler, Log
from bs4 import BeautifulSoup


def url_validator(url: str, logger: Log)->bool:
    """Return a boolean representing if the crawler should parse given url."""
    return url.startswith("http://interestingurl.com")


def file_parser(url:str, soup:BeautifulSoup, logger: Log):
    """Parse and elaborate given soup."""
    # soup parsing...
    pass

TinyCrawler(
    file_parser=file_parser,
    url_validator=url_validator
).run("https://www.example.com/")

Example loading proxies

from tinycrawler import TinyCrawler, Log
from bs4 import BeautifulSoup


def url_validator(url: str, logger: Log)->bool:
    """Return a boolean representing if the crawler should parse given url."""
    return url.startswith("http://interestingurl.com")


def file_parser(url:str, soup:BeautifulSoup, logger: Log):
    """Parse and elaborate given soup."""
    # soup parsing...
    pass

crawler = TinyCrawler(
    file_parser=file_parser,
    url_validator=url_validator
)
crawler.load_proxies("http://myexampletestserver.com", "path/to/proxies.json")
crawler.run("https://www.example.com/")

Proxies are expected to be in the following format:

[
  {
    "ip": "89.236.17.108",
    "port": 3128,
    "type": [
      "https",
      "http"
    ]
  },
  {
    "ip": "128.199.141.151",
    "port": 3128,
    "type": [
      "https",
      "http"
    ]
  }
]

License

The software is released under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 524 Commits
test_data		test_data
tests		tests
tinycrawler		tinycrawler
.coveragerc		.coveragerc
.coveralls.yml		.coveralls.yml
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.rst		README.rst
preview.png		preview.png
pytest.ini		pytest.ini
setup.py		setup.py
sonar-project.properties		sonar-project.properties

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TinyCrawler

Installing TinyCrawler

TODOs for next version

Preview (Test case)

Basic usage example

Example loading proxies

License

About

Releases

Packages

Contributors 2

Languages

License

LucaCappelletti94/tinycrawler

Folders and files

Latest commit

History

Repository files navigation

TinyCrawler

Installing TinyCrawler

TODOs for next version

Preview (Test case)

Basic usage example

Example loading proxies

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages