GitHub - AlKou/sfzd.cn-Spider

This is a Scrapy spider parsing jpg pictures from www.sfzd.cn.

Besides the settings, pipeline, middleware item files, chr.py is the spider file, and the ProcessMonitor.py is the spider control script, created to solve the no response issue.

A source file of characters waiting to be searched is not included here. And the output files include a csv with urls matched with characters, a folder with all jpg files downloaded from the site, a txt keeping all successfully scraped characters, and a txt keeping all no-result characters.

The no response issue: the spider stops parsing any data while still keeps requesting after around 10-20 minutes. So the process monitor script is written to control the restart of the spider to keep scraping data from the site.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitattributes		.gitattributes
ProcessMonitor.py		ProcessMonitor.py
README.md		README.md
chr.py		chr.py
items.py		items.py
middlewares.py		middlewares.py
pipelines.py		pipelines.py
settings.py		settings.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This is a Scrapy spider parsing jpg pictures from www.sfzd.cn.

About

Releases

Packages

Languages

AlKou/sfzd.cn-Spider

Folders and files

Latest commit

History

Repository files navigation

This is a Scrapy spider parsing jpg pictures from www.sfzd.cn.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages