Skip to content

Commit

Permalink
pull merge (#1)
Browse files Browse the repository at this point in the history
* fix test

1.1.1.1 took by cloudflare

* Fixed db inconsistency (binux#779)

* Fixed db creation inconsistency in taskdb, projectdb and resultdb

* Fixed typo

* using reserved ip address for testing

rolling out version 0.3.10

* Fix mysql return bytes as field name type (binux#787)

* use pip version of mysql-connector-python for testing

* fix mysql return bytes as field names type

* fix raise Unread result found error

This error raise on mysql-connector-python with C extension

* fix test

Pure version raise InterfaceError,
  but C extension version raise DatabaseError

* fix binux#799

* optimise scheluler dynamic select limit and improve task queue (binux#796)

* optimise scheduler select-limit and task queue

* fix test case in python2.6

* fix: time priority queue only compare exetime

* update:add test case for time priority queue

* optimise: add globally auto increasing value for task to keep priority queue in order

* change async to async_ (binux#803)

* change async to async_

* change async to async_ in tests

* change async_ to async_mode

* modify async to async_mode to support python3.7

* add python3.7 CI test

* add python3.7 CI test

* add python3.7 CI test

* add python3.7 CI test

* remove python3.7 CI test

* add py3.7-dev CI test

* add support py3.7-dev CI test

* removed 2.6 due to lack of support, changed pip install for 3.5 due to pip versioning

* feature puppeteer js engine

* features: add opened pages maximum limit, default 5

* fix: python3.5 install lxml error

* add puppeteer fetcher

* update

* fix bugs
1. some args "async" haven't been replaced completely yet
2. delete Python 3.3 in .travis.yml because the current version of lxml is not supported by Python3.3

* use suggested python3.7 build

* fix build for 3.3

* 1. python2.7 image is different when using metrix
2. pip install just works now days

* sudo not required any more?

* try not to specify a version for apt-get

* fix setup.py test for py3.3

* try manually install

* try again

* fix for 3.7

* try install librt

* try again

* allow fail

* updated requirements.txt to fixed package versions

* port to python 3.6

* upgrade python-six

* updated travis.yml

* fixed "connect to scheduler rpc error: error(111, Connection refused)" error

* fixed phantomjs libssl_conf.so error

* travis test

* another Travis test

* trying to trace "cannot find module express" error in Travis

* using NODE_PATH env var

* moved NODE_PATH assignment after install

* making symlink to node_modules

* travis test

* node modules are currently missing from travis

* added npm install to travis.yml

* fixed travis node dependancy issues

* using run_in_thread for scheduler and fetcher dispatch again

* accommodate changes made in run.py to tests

* changed test_90_docker_scheduler

* added extra asserts to tests

* test

* upgraded sqlAlchemy

* sqlalchemy upgrade

* sqlalchemy upgrade

* sqlalchemy upgrade

* sqlalchemy upgrade

* sqlalchemy upgrade

* sqlalchemy upgrade

* sqlalchemy upgrade fix

* sqlalchemy upgrade

* added extra assertions

* sqlalchemy upgrade

* sqlalchemy upgrade

* sqlalchemy upgrade

* undo previous

* tracing errors

* fix sqlalchemy data encoding

* sqlalchemy changed dict encoding to pure json string

* test_10_save mongodb fix

* undo previous

* tracing test_10_save mongodb bug

* tracing test_10_save mongodb bug

* upgraded pymongo

* mongo tests now passing

* fixed test_a110_one failing by "fetcher() got an unexpected keyword argument xmlrpc"

* upgraded pika

* tracing RabbitMQ ConnectionRefusedError: [Errno 111] Connection refused

* fixed typo

* tracing RabbitMQ ConnectionRefusedError: [Errno 111] Connection refused

* tracing RabbitMQ ConnectionRefusedError: [Errno 111] Connection refused

* tracing RabbitMQ ConnectionRefusedError: [Errno 111] Connection refused

* switching to Pika for Rabbitmq

* skip TestAmqpRabbitMQ

* travis test

* travis build failing with 0 errors and 0 failures, 40 "unexpected successes"

* added updated docker-compose.yaml

* cleanup

* initial couchdb projectdb implementation

* test url parser

* fix couchdb connect url

* fix couchdb connect url

* fix couchdb json encoding

* fix couchdb json encoding

* fix couchdb url encoding

* fix couchdb urls

* fixed couchdb request headers

* travis upgrade couchdb

* travis upgrade couchdb

* travis upgrade couchdb

* travis upgrade couchdb

* travis upgrade couchdb

* fixed "Fields must be an array of strings, not: null" eroor

* fixed responses

* fixed drop database

* tracing insertion issue

* fixed default values

* tracing update bug

* fixed update bug

* fixed drop bug

* changed default fields

* fixed drop bug

* fixed _default_fields usage

* fixed update bug

* fixed update bug

* fixed drop bug

* tracing update bug

* fixed drop bug

* tracing drop bug

* fixed drop bug

* fixed db naming issue

* fixed drop bug

* initial resultdb implementation

* added resultdb tests

* fix resultdb tests

* fix resultdb init

* fix resultdb init

* fix missing class var

* fixed get_docs

* fixed db naming

* fixed db naming

* fixed db naming

* fixed get_docs

* minor fixes

* fixed update_doc

* fixed update_doc

* fixed get_doc

* fixed get_docs

* fixed get_docs

* fixed parse

* fixed get_all_docs

* fixed get_doc

* fixed update_doc

* minor fixes

* fixed select

* initial taskdb implementation

* added debug prints

* added collection_prefix

* minor fixes

* minor fixes

* fixed update

* fixed test_25_get_task

* fixed status_count selector

* fixed update

* tracing test_create_project bug

* fixed collection naming

* Revert "fixed collection naming"

This reverts commit 0d89a0d.

* fixed collection naming

* minor fixes

* minor fixes

* fixed test_z10_drop

* fixed test_50_load_tasks

* fixed get_docs

* fixed get methods

* cleanup

* removed python 3.3 and added 3.7 and 3.8

* added index

* tracing index create bug

* fixed index create bug

* fixed index create bug

* fixed index create bug

* minor test fixes

* added couchdb test run

* added couchdb test run

* full working example

* fixed test setup

* fixed test setup

* updated travis file for couchdb auth

* updated travis file for couchdb auth

* added credentials exception

* fixed credentials

* fixed test auth

* fixed test auth

* tracing auth issue

* tracing auth issue

* fixed test auth issue

* fixed test test_60a_docker_couchdb

* fixed test test_60a_docker_couchdb

* cleanup

* attempting to remove "unexpected successes"

* tracing "unexpected successes"

* tracing "unexpected successes"

* tracing "unexpected successes"

* tracing "unexpected successes"

* tracing "unexpected successes"

* tracing "unexpected successes"

* Revert "tracing "unexpected successes""

This reverts commit 829da8c.

* tracing "unexpected successes"

* tracing "unexpected successes" in crawl

* tracing "unexpected successes" in crawl

* tracing "unexpected successes"

* tracing "unexpected successes"

* tracing "unexpected successes"

* tracing "unexpected successes"

* tracing "unexpected successes"

* tracing "unexpected successes"

* tracing "unexpected successes"

* fixed "unexpected successes"

* fixed TestFetcherProcessor

* fixed TestFetcherProcessor

* fixed TestFetcherProcessor

* fix BaseHandler

* fix BaseHandler

* fix BaseHandler

* fix BaseHandler

* fix BaseHandler

* fix BaseHandler

* fix BaseHandler

* removed beanstalkc

* cleanup

* removed 3.8 from travis

* removed python 3.8 from setup.py

* fixed test_60_relist_projects change

* fixed .travis

* added https to couchdb + cleanup + added couchdb to docs

* added extra comment on top of docker-compose example

* fixed docker-compose issue

* improve docker-compose sample

* remove demo link

* fix test break because couchdb failing to start

* try to use non-auth for CouchDB test

* more couchdb_password

* improve couchdb allow empty username password

* drop support for couchdb

Co-authored-by: Roy Binux <root@binux.me>
Co-authored-by: jxltom <jxltom@users.noreply.github.com>
Co-authored-by: binux <roy@binux.me>
Co-authored-by: sdvcrx <memory.silentvoyage@gmail.com>
Co-authored-by: Lucas <comeson@126.com>
Co-authored-by: vibiu <540650312@qq.com>
Co-authored-by: farmercode <wangchangchun120@gmail.com>
Co-authored-by: Phillip <phillip1.peterson@umontana.edu>
Co-authored-by: feiyang <feiyang@ibantang.com>
Co-authored-by: clchen <ccl0326@163.com>
Co-authored-by: v1nc3nt <vinsechsz@gmail.com>
Co-authored-by: Keith Tunstead <tunstek@tcd.ie>
  • Loading branch information
13 people committed Mar 16, 2021
1 parent c8d4558 commit c4482fd
Show file tree
Hide file tree
Showing 128 changed files with 4,991 additions and 2,460 deletions.
28 changes: 28 additions & 0 deletions .github/ISSUE_TEMPLATE.md
@@ -0,0 +1,28 @@
<!--
Thanks for using pyspider!
如果你需要使用中文提问,请将问题提交到 https://segmentfault.com/t/pyspider
-->

* pyspider version:
* Operating system:
* Start up command:

### Expected behavior

<!-- What do you think should happen? -->

### Actual behavior

<!-- What actually happens? -->

### How to reproduce

<!--
The best chance of getting help is providing enough information that can be reproduce the issue you have.
If it's related to API or extraction behavior, please paste the script of your project.
If it's related to scheduling of whole project, please paste the screenshot of queue status on the top in dashboard.
-->
3 changes: 2 additions & 1 deletion .gitignore
@@ -1,6 +1,7 @@
*.py[cod]
data/*

.venv
.idea
# C extensions
*.so

Expand Down
36 changes: 24 additions & 12 deletions .travis.yml
@@ -1,29 +1,41 @@
language: python
cache: pip
python:
- "2.6"
- "2.7"
- "3.3"
- "3.4"
- 3.5
- 3.6
- 3.7
#- 3.8
services:
- docker
- mongodb
- rabbitmq
- redis-server
- elasticsearch
- redis
- mysql
# - elasticsearch
- postgresql
addons:
postgresql: "9.4"
postgresql: "9.4"
apt:
packages:
- rabbitmq-server
env:
- IGNORE_COUCHDB=1

before_install:
- sudo apt-get update -qq
- sudo apt-get install -y beanstalkd
- echo "START=yes" | sudo tee -a /etc/default/beanstalkd > /dev/null
- sudo service beanstalkd start
- curl -O https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/deb/elasticsearch/2.4.0/elasticsearch-2.4.0.deb && sudo dpkg -i --force-confnew elasticsearch-2.4.0.deb && sudo service elasticsearch restart
- npm install express puppeteer
- sudo docker pull scrapinghub/splash
- sudo docker run -d --net=host scrapinghub/splash
before_script:
- psql -c "CREATE DATABASE pyspider_test_taskdb ENCODING 'UTF8' TEMPLATE=template0;" -U postgres
- psql -c "CREATE DATABASE pyspider_test_projectdb ENCODING 'UTF8' TEMPLATE=template0;" -U postgres
- psql -c "CREATE DATABASE pyspider_test_resultdb ENCODING 'UTF8' TEMPLATE=template0;" -U postgres
- sleep 10
install:
- pip install http://cdn.mysql.com/Downloads/Connector-Python/mysql-connector-python-2.0.4.zip#md5=3df394d89300db95163f17c843ef49df
- pip install --allow-all-external -e .[all,test]
- pip install https://github.com/marcus67/easywebdav/archive/master.zip
- sudo apt-get install libgnutls28-dev
- pip install -e .[all,test]
- pip install coveralls
script:
- coverage run setup.py test
Expand Down
35 changes: 25 additions & 10 deletions Dockerfile
@@ -1,16 +1,28 @@
FROM cmfatih/phantomjs
FROM python:3.6
MAINTAINER binux <roy@binux.me>

# install python
RUN apt-get update && \
apt-get install -y python python-dev python-distribute python-pip && \
apt-get install -y libcurl4-openssl-dev libxml2-dev libxslt1-dev python-lxml python-mysqldb libpq-dev
# install phantomjs
RUN mkdir -p /opt/phantomjs \
&& cd /opt/phantomjs \
&& wget -O phantomjs.tar.bz2 https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2 \
&& tar xavf phantomjs.tar.bz2 --strip-components 1 \
&& ln -s /opt/phantomjs/bin/phantomjs /usr/local/bin/phantomjs \
&& rm phantomjs.tar.bz2
# Fix Error: libssl_conf.so: cannot open shared object file: No such file or directory
ENV OPENSSL_CONF=/etc/ssl/

# install nodejs
ENV NODEJS_VERSION=8.15.0 \
PATH=$PATH:/opt/node/bin
WORKDIR "/opt/node"
RUN apt-get -qq update && apt-get -qq install -y curl ca-certificates libx11-xcb1 libxtst6 libnss3 libasound2 libatk-bridge2.0-0 libgtk-3-0 --no-install-recommends && \
curl -sL https://nodejs.org/dist/v${NODEJS_VERSION}/node-v${NODEJS_VERSION}-linux-x64.tar.gz | tar xz --strip-components=1 && \
rm -rf /var/lib/apt/lists/*
RUN npm install puppeteer express

# install requirements
RUN pip install http://cdn.mysql.com/Downloads/Connector-Python/mysql-connector-python-2.0.4.zip#md5=3df394d89300db95163f17c843ef49df
ADD requirements.txt /opt/pyspider/requirements.txt
COPY requirements.txt /opt/pyspider/requirements.txt
RUN pip install -r /opt/pyspider/requirements.txt
RUN pip install -U pip

# add all repo
ADD ./ /opt/pyspider
Expand All @@ -19,7 +31,10 @@ ADD ./ /opt/pyspider
WORKDIR /opt/pyspider
RUN pip install -e .[all]

VOLUME ["/opt/pyspider"]
# Create a symbolic link to node_modules
RUN ln -s /opt/node/node_modules ./node_modules

#VOLUME ["/opt/pyspider"]
ENTRYPOINT ["pyspider"]

EXPOSE 5000 23333 24444 25555
EXPOSE 5000 23333 24444 25555 22222
23 changes: 6 additions & 17 deletions README.md
@@ -1,14 +1,14 @@
pyspider [![Build Status]][Travis CI] [![Coverage Status]][Coverage] [![Try]][Demo]
pyspider [![Build Status]][Travis CI] [![Coverage Status]][Coverage]
========

A Powerful Spider(Web Crawler) System in Python. **[TRY IT NOW!][Demo]**
A Powerful Spider(Web Crawler) System in Python.

- Write script in Python
- Powerful WebUI with script editor, task monitor, project manager and result viewer
- [MySQL](https://www.mysql.com/), [MongoDB](https://www.mongodb.org/), [Redis](http://redis.io/), [SQLite](https://www.sqlite.org/), [Elasticsearch](https://www.elastic.co/products/elasticsearch); [PostgreSQL](http://www.postgresql.org/) with [SQLAlchemy](http://www.sqlalchemy.org/) as database backend
- [RabbitMQ](http://www.rabbitmq.com/), [Beanstalk](http://kr.github.com/beanstalkd/), [Redis](http://redis.io/) and [Kombu](http://kombu.readthedocs.org/) as message queue
- [RabbitMQ](http://www.rabbitmq.com/), [Redis](http://redis.io/) and [Kombu](http://kombu.readthedocs.org/) as message queue
- Task priority, retry, periodical, recrawl by age, etc...
- Distributed architecture, Crawl Javascript pages, Python 2&3, etc...
- Distributed architecture, Crawl Javascript pages, Python 2.{6,7}, 3.{3,4,5,6} support, etc...

Tutorial: [http://docs.pyspider.org/en/latest/tutorial/](http://docs.pyspider.org/en/latest/tutorial/)
Documentation: [http://docs.pyspider.org/](http://docs.pyspider.org/)
Expand Down Expand Up @@ -41,15 +41,15 @@ class Handler(BaseHandler):
}
```

[![Demo][Demo Img]][Demo]


Installation
------------

* `pip install pyspider`
* run command `pyspider`, visit [http://localhost:5000/](http://localhost:5000/)

**WARNING:** WebUI is open to the public by default, it can be used to execute any command which may harm your system. Please use it in an internal network or [enable `need-auth` for webui](http://docs.pyspider.org/en/latest/Command-Line/#-config).

Quickstart: [http://docs.pyspider.org/en/latest/Quickstart/](http://docs.pyspider.org/en/latest/Quickstart/)

Contribute
Expand All @@ -66,18 +66,9 @@ TODO

### v0.4.0

- [x] local mode, load script from file.
- [x] works as a framework (all components running in one process, no threads)
- [x] redis
- [x] shell mode like `scrapy shell`
- [ ] a visual scraping interface like [portia](https://github.com/scrapinghub/portia)


### more

- [x] edit script with vim via [WebDAV](http://en.wikipedia.org/wiki/WebDAV)


License
-------
Licensed under the Apache License, Version 2.0
Expand All @@ -88,7 +79,5 @@ Licensed under the Apache License, Version 2.0
[Coverage Status]: https://img.shields.io/coveralls/binux/pyspider.svg?branch=master&style=flat
[Coverage]: https://coveralls.io/r/binux/pyspider
[Try]: https://img.shields.io/badge/try-pyspider-blue.svg?style=flat
[Demo]: http://demo.pyspider.org/
[Demo Img]: https://github.com/binux/pyspider/blob/master/docs/imgs/demo.png
[Issue]: https://github.com/binux/pyspider/issues
[User Group]: https://groups.google.com/group/pyspider-users
13 changes: 13 additions & 0 deletions config_example.json
@@ -0,0 +1,13 @@
{
"taskdb": "couchdb+taskdb://user:password@couchdb:5984",
"projectdb": "couchdb+projectdb://user:password@couchdb:5984",
"resultdb": "couchdb+resultdb://user:password@couchdb:5984",
"message_queue": "amqp://rabbitmq:5672/%2F",
"webui": {
"username": "username",
"password": "password",
"need-auth": true,
"scheduler-rpc": "http://scheduler:23333",
"fetcher-rpc": "http://fetcher:24444"
}
}
105 changes: 105 additions & 0 deletions docker-compose.yaml
@@ -0,0 +1,105 @@
version: "3.7"

# replace /path/to/dir/ to point to config.json

# The RabbitMQ and CouchDB services can take some time to startup.
# During this time most of the pyspider services will exit and restart.
# Once RabbitMQ and CouchDB are fully up and running everything should run as normal.

services:
rabbitmq:
image: rabbitmq:alpine
container_name: rabbitmq
networks:
- pyspider
command: rabbitmq-server
mysql:
image: mysql:latest
container_name: mysql
volumes:
- /tmp:/var/lib/mysql
environment:
- MYSQL_ALLOW_EMPTY_PASSWORD=yes
networks:
- pyspider
phantomjs:
image: pyspider:latest
container_name: phantomjs
networks:
- pyspider
volumes:
- ./config_example.json:/opt/pyspider/config.json
command: -c config.json phantomjs
depends_on:
- couchdb
- rabbitmq
restart: unless-stopped
result:
image: pyspider:latest
container_name: result
networks:
- pyspider
volumes:
- ./config_example.json:/opt/pyspider/config.json
command: -c config.json result_worker
depends_on:
- couchdb
- rabbitmq
restart: unless-stopped # Sometimes we'll get a connection refused error because couchdb has yet to fully start
processor:
container_name: processor
image: pyspider:latest
networks:
- pyspider
volumes:
- ./config_example.json:/opt/pyspider/config.json
command: -c config.json processor
depends_on:
- couchdb
- rabbitmq
restart: unless-stopped
fetcher:
image: pyspider:latest
container_name: fetcher
networks:
- pyspider
volumes:
- ./config_example.json:/opt/pyspider/config.json
command : -c config.json fetcher
depends_on:
- couchdb
- rabbitmq
restart: unless-stopped
scheduler:
image: pyspider:latest
container_name: scheduler
networks:
- pyspider
volumes:
- ./config_example.json:/opt/pyspider/config.json
command: -c config.json scheduler
depends_on:
- couchdb
- rabbitmq
restart: unless-stopped
webui:
image: pyspider:latest
container_name: webui
ports:
- "5050:5000"
networks:
- pyspider
volumes:
- ./config_example.json:/opt/pyspider/config.json
command: -c config.json webui
depends_on:
- couchdb
- rabbitmq
restart: unless-stopped

networks:
pyspider:
external:
name: pyspider
default:
driver: bridge
26 changes: 14 additions & 12 deletions docs/About-Projects.md
@@ -1,24 +1,26 @@
About Projects
==============

In most case, a project is one script you write for one website.
In most cases, a project is one script you write for one website.

* Projects are independent, but you can import another project as module with `from projects import other_project`
* project has 5 status: `TODO`, `STOP`, `CHECKING`, `DEBUG`, `RUNNING`
* Projects are independent, but you can import another project as a module with `from projects import other_project`
* A project has 5 status: `TODO`, `STOP`, `CHECKING`, `DEBUG` and `RUNNING`
- `TODO` - a script is just created to be written
- `STOP` - you can mark a project `STOP` if you want it STOP (= =).
- `CHECKING` - when a running project is modified, to prevent incomplete modification, project status will set as `CHECKING` automatically.
- `DEBUG`/`RUNNING` - these two status have on difference to spider. But it's good to mark as `DEBUG` when it's running the first time then change to `RUNNING` after checked.
- `STOP` - you can mark a project as `STOP` if you want it to STOP (= =).
- `CHECKING` - when a running project is modified, to prevent incomplete modification, project status will be set as `CHECKING` automatically.
- `DEBUG`/`RUNNING` - these two status have no difference to spider. But it's good to mark it as `DEBUG` when it's running the first time then change it to `RUNNING` after being checked.
* The crawl rate is controlled by `rate` and `burst` with [token-bucket](http://en.wikipedia.org/wiki/Token_bucket) algorithm.
- `rate` - how many requests in one seconds
- `burst` - consider this situation, `rate/burst = 0.1/3`, it means spider scrawl 1 page every 10 seconds. All tasks are finished, project is checking last updated items every minute. Assume that 3 new items are found, pyspider will "burst" and crawl 3 tasks without waiting 3*10 seconds. However, the fourth task needs wait 10 seconds.
* to delete a project, set `group` to `delete` and status to `STOP`, wait 24 hours.
- `rate` - how many requests in one second
- `burst` - consider this situation, `rate/burst = 0.1/3`, it means that the spider scrawls 1 page every 10 seconds. All tasks are finished, project is checking last updated items every minute. Assume that 3 new items are found, pyspider will "burst" and crawl 3 tasks without waiting 3*10 seconds. However, the fourth task needs wait 10 seconds.
* To delete a project, set `group` to `delete` and status to `STOP`, wait 24 hours.


`on_finished` callback
--------------------
You can override `on_finished` method in the project, the method would be triggered when the task_queue goes to 0.

Example 1: when you starts a project to crawl a website with 100 pages, the `on_finished` callback will be fired when 100 pages success crawled or failed after retries.
Example 2: A project with `auto_recrawl` tasks will **NEVER** trigger the `on_finished` callback, because time queue will never become 0 when auto_recrawl tasks in it.
Example 3: A project with `@every` decorated method will trigger the `on_finished` callback every time when the new submitted tasks finished.
Example 1: When you start a project to crawl a website with 100 pages, the `on_finished` callback will be fired when 100 pages are successfully crawled or failed after retries.

Example 2: A project with `auto_recrawl` tasks will **NEVER** trigger the `on_finished` callback, because time queue will never become 0 when there are auto_recrawl tasks in it.

Example 3: A project with `@every` decorated method will trigger the `on_finished` callback every time when the newly submitted tasks are finished.

0 comments on commit c4482fd

Please sign in to comment.