DoRePy (pronounced like doe-ray-pee) is your go-to script for automating the download of files from a webpage that match a specific regex pattern. Fed up with manually sifting through pages to download files? DoRePy has got your back!
- Regex Pattern Matching: Use the power of regular expressions to target exactly the files you need.
- Retry Logic: Network hiccup? No problem. DoRePy retries failed downloads, respecting rate limits like a well-mannered netizen.
- Python 3
- Requests:
pip install requests - BeautifulSoup:
pip install beautifulsoup4
Below are the two methods of installing DoRePy
- Pip
pip install DoRePy- Clone the repo
Clone this repository or simply download
dorepy.pyto your local machine:
git clone https://github.com/CillySu/DoRePy/dorepy.gitNavigate to the directory in which you want to have the files downloaded
dorepy [URL] [PATTERN]if you installed with pip. If you downloaded the .py run the dorepy.py file instead such as python ./dorepy.py [URL] [PATTERN]
Where:
[URL] is the webpage URL from which you want to download files. [PATTERN] is the regex pattern that matches the file names you want to download.
Example:
python dorepy.py "http://example.com" "\.pdf$"This command downloads all PDF files which are linked to on http://example.com.
\.matches all literal.pdfmatches pdf (when following a literal.)$matches the end of the filename, the end result being that files ending in.pdfare matched. See RegExr for help on building regex patterns.
The following features are envisaged for DoRePy's second movement:
- Batch URL support, allowing one regex pattern to be matched against a list of URLs
- Combinatorial regex logic matching, such that users can supply multiple regex patterns and combine them with logical operands AND/NOT/NOR/XOR/XNOR
- User-specified output directory for downloading instead of using the present working directory
- User-specified sleep time instead of the current behaviour (check for website-defined retry-after time, failing this default to 30s)
- Recursive downloading of links which are present in pages linked to in a URL, with CLI arguments to define depth of recursion such as -L in
tree - Ability to control whether regex is cAsE sEnSiTiVe (currently always case insensitive)
Feel like DoRePy hit the wrong note? Are we singing off a different hymn sheet? Fork the repo, perform your cover version, and submit a pull request.
Distributed under the MIT License. See LICENSE for more information.
Please use DoRePy wisely and respect website terms of service and your local laws as applicable.
