An highly customizable crawler that uses multiprocessing and proxies to download one or more websites following a given filter, search and save functions.
REMEMBER THAT DDOS IS ILLEGAL. DO NOT USE THIS SOFTWARE FOR ILLEGAL PURPOSE.
pip install tinycrawler
- Test proxies while normally downloading. - DONE
- Parallelize different domains downloads. - DONE
- Add dropping for high failure proxy and add parameters for such rate - DONE, yet to be tested
- Make failure rate domain specific with also a global mean.
- Enable failure rate also for local.
- Check robots txt also before downloading urls
- Reduce robots timeout defaults to 2 hours
- Change to exponential the wait timeout for the download attempts
- To define a binary file, check if in the first 1000 characters you find a number greater than 3/5 of zeros
- Add useragent
- Stop downloads when all proxies are dead.
- Try to use active_children as a way to test for active processes
- Add test for proxies
- Add way to save progress automatically every given timeout.
- Add way to automatically save tested proxies.
This is the preview of the console when running the test_base.py.
from tinycrawler import TinyCrawler, Log
from bs4 import BeautifulSoup
def url_validator(url: str, logger: Log)->bool:
"""Return a boolean representing if the crawler should parse given url."""
return url.startswith("http://interestingurl.com")
def file_parser(url:str, soup:BeautifulSoup, logger: Log):
"""Parse and elaborate given soup."""
# soup parsing...
pass
TinyCrawler(
file_parser=file_parser,
url_validator=url_validator
).run("https://www.example.com/")
from tinycrawler import TinyCrawler, Log
from bs4 import BeautifulSoup
def url_validator(url: str, logger: Log)->bool:
"""Return a boolean representing if the crawler should parse given url."""
return url.startswith("http://interestingurl.com")
def file_parser(url:str, soup:BeautifulSoup, logger: Log):
"""Parse and elaborate given soup."""
# soup parsing...
pass
crawler = TinyCrawler(
file_parser=file_parser,
url_validator=url_validator
)
crawler.load_proxies("http://myexampletestserver.com", "path/to/proxies.json")
crawler.run("https://www.example.com/")
Proxies are expected to be in the following format:
[
{
"ip": "89.236.17.108",
"port": 3128,
"type": [
"https",
"http"
]
},
{
"ip": "128.199.141.151",
"port": 3128,
"type": [
"https",
"http"
]
}
]
The software is released under the MIT license.