LofterCrawler

A multi-threading Lofter crawler based on Python 3.6

New Version: https://github.com/BYJRK/LofterCrawlerV2

For Chinese version, please refer to branch advanced

Usage

python loftercrawler.py -h
usage: loftercrawler.py [-h] [-max MAX_PAGE] [-start START_PAGE]
                        [-dir DIRECTORY] [--max_threads MAX_THREADS]
                        [-r REPLACE] [--timeout TIMEOUT]
                        domain

positional arguments:
  domain                domain name or post link

optional arguments:
  -h, --help            show this help message and exit
  -max MAX_PAGE, --max_page MAX_PAGE
                        maximum page amount (default = 160)
  -start START_PAGE, --start_page START_PAGE
                        start searching from this page
  -dir DIRECTORY, --directory DIRECTORY
                        save the downloaded images to this local folder
  --max_threads MAX_THREADS
  -r REPLACE, --replace REPLACE
                        replace the existing files with the same name (default
                        = False)
  --timeout TIMEOUT     request timeout (second, default = 8)

Examples

If you want to download all images from yurisa123.lofter.com (Yurisa's Lofter Homepage), you can type:

python loftercrawler.py yurisa123

Then the crawler will start working and the images will be saved to a folder named yurisa (acquired from the webpage html.head.title) in the same directory with the program itself by default. The program will start searching posts from page 1 to maximally page 160 (max page amount is 160 by default)

If you want to start from page 5, and the total page amount is 10 (page 5 ~ 14), then:

python loftercrawler.py yurisa123 -start 5 -max 10

If you want to save the images to another folder, you can type:

python loftercrawler.py yurisa123 -dir my_favorite_images

If you want do download all images in a specific post, you can type:

python loftercrawler.py http://yurisa123.lofter.com/post/1cf5f941_12bd7e63c

--max_threads 8 means the number of worker processes (for more details)
--timeout 8 is the read timeout of HTTP requests (for more details)
--replace will force the download process to replace the existing files in the folder. Otherwise, the program will ignore the file which (seems to) have been downloaded.

Algorithm

Figure out the actual page range for further crawling;
Collect all post links on the pages;
Collect all image links from the posts;
Downloading all images by multi-threading;
Auto retry failed downloads (only once).

Enjoy Your Photo Collection

This program still needs improvements.

Looking forward to your comments!

Future Works

Separate images into different folders named by the title of posts;
Design a GUI;
Introduce CNNs to decide whether a downloaded image shall be remained depending on face score;
To be continued...

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
README.md		README.md
loftercrawler.py		loftercrawler.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

loftercrawler.py

loftercrawler.py

utils.py

utils.py

Repository files navigation

LofterCrawler

Usage

Examples

Algorithm

Enjoy Your Photo Collection

Future Works

About

Releases

Packages

Languages

BYJRK/LofterCrawler

Folders and files

Latest commit

History

Repository files navigation

LofterCrawler

Usage

Examples

Algorithm

Enjoy Your Photo Collection

Future Works

About

Resources

Stars

Watchers

Forks

Languages