pull merge (#1)

* fix test 1.1.1.1 took by cloudflare * Fixed db inconsistency (binux#779) * Fixed db creation inconsistency in taskdb, projectdb and resultdb * Fixed typo * using reserved ip address for testing rolling out version 0.3.10 * Fix mysql return bytes as field name type (binux#787) * use pip version of mysql-connector-python for testing * fix mysql return bytes as field names type * fix raise Unread result found error This error raise on mysql-connector-python with C extension * fix test Pure version raise InterfaceError, but C extension version raise DatabaseError * fix binux#799 * optimise scheluler dynamic select limit and improve task queue (binux#796) * optimise scheduler select-limit and task queue * fix test case in python2.6 * fix: time priority queue only compare exetime * update:add test case for time priority queue * optimise: add globally auto increasing value for task to keep priority queue in order * change async to async_ (binux#803) * change async to async_ * change async to async_ in tests * change async_ to async_mode * modify async to async_mode to support python3.7 * add python3.7 CI test * add python3.7 CI test * add python3.7 CI test * add python3.7 CI test * remove python3.7 CI test * add py3.7-dev CI test * add support py3.7-dev CI test * removed 2.6 due to lack of support, changed pip install for 3.5 due to pip versioning * feature puppeteer js engine * features: add opened pages maximum limit, default 5 * fix: python3.5 install lxml error * add puppeteer fetcher * update * fix bugs 1. some args "async" haven't been replaced completely yet 2. delete Python 3.3 in .travis.yml because the current version of lxml is not supported by Python3.3 * use suggested python3.7 build * fix build for 3.3 * 1. python2.7 image is different when using metrix 2. pip install just works now days * sudo not required any more? * try not to specify a version for apt-get * fix setup.py test for py3.3 * try manually install * try again * fix for 3.7 * try install librt * try again * allow fail * updated requirements.txt to fixed package versions * port to python 3.6 * upgrade python-six * updated travis.yml * fixed "connect to scheduler rpc error: error(111, Connection refused)" error * fixed phantomjs libssl_conf.so error * travis test * another Travis test * trying to trace "cannot find module express" error in Travis * using NODE_PATH env var * moved NODE_PATH assignment after install * making symlink to node_modules * travis test * node modules are currently missing from travis * added npm install to travis.yml * fixed travis node dependancy issues * using run_in_thread for scheduler and fetcher dispatch again * accommodate changes made in run.py to tests * changed test_90_docker_scheduler * added extra asserts to tests * test * upgraded sqlAlchemy * sqlalchemy upgrade * sqlalchemy upgrade * sqlalchemy upgrade * sqlalchemy upgrade * sqlalchemy upgrade * sqlalchemy upgrade * sqlalchemy upgrade fix * sqlalchemy upgrade * added extra assertions * sqlalchemy upgrade * sqlalchemy upgrade * sqlalchemy upgrade * undo previous * tracing errors * fix sqlalchemy data encoding * sqlalchemy changed dict encoding to pure json string * test_10_save mongodb fix * undo previous * tracing test_10_save mongodb bug * tracing test_10_save mongodb bug * upgraded pymongo * mongo tests now passing * fixed test_a110_one failing by "fetcher() got an unexpected keyword argument xmlrpc" * upgraded pika * tracing RabbitMQ ConnectionRefusedError: [Errno 111] Connection refused * fixed typo * tracing RabbitMQ ConnectionRefusedError: [Errno 111] Connection refused * tracing RabbitMQ ConnectionRefusedError: [Errno 111] Connection refused * tracing RabbitMQ ConnectionRefusedError: [Errno 111] Connection refused * switching to Pika for Rabbitmq * skip TestAmqpRabbitMQ * travis test * travis build failing with 0 errors and 0 failures, 40 "unexpected successes" * added updated docker-compose.yaml * cleanup * initial couchdb projectdb implementation * test url parser * fix couchdb connect url * fix couchdb connect url * fix couchdb json encoding * fix couchdb json encoding * fix couchdb url encoding * fix couchdb urls * fixed couchdb request headers * travis upgrade couchdb * travis upgrade couchdb * travis upgrade couchdb * travis upgrade couchdb * travis upgrade couchdb * fixed "Fields must be an array of strings, not: null" eroor * fixed responses * fixed drop database * tracing insertion issue * fixed default values * tracing update bug * fixed update bug * fixed drop bug * changed default fields * fixed drop bug * fixed _default_fields usage * fixed update bug * fixed update bug * fixed drop bug * tracing update bug * fixed drop bug * tracing drop bug * fixed drop bug * fixed db naming issue * fixed drop bug * initial resultdb implementation * added resultdb tests * fix resultdb tests * fix resultdb init * fix resultdb init * fix missing class var * fixed get_docs * fixed db naming * fixed db naming * fixed db naming * fixed get_docs * minor fixes * fixed update_doc * fixed update_doc * fixed get_doc * fixed get_docs * fixed get_docs * fixed parse * fixed get_all_docs * fixed get_doc * fixed update_doc * minor fixes * fixed select * initial taskdb implementation * added debug prints * added collection_prefix * minor fixes * minor fixes * fixed update * fixed test_25_get_task * fixed status_count selector * fixed update * tracing test_create_project bug * fixed collection naming * Revert "fixed collection naming" This reverts commit 0d89a0d. * fixed collection naming * minor fixes * minor fixes * fixed test_z10_drop * fixed test_50_load_tasks * fixed get_docs * fixed get methods * cleanup * removed python 3.3 and added 3.7 and 3.8 * added index * tracing index create bug * fixed index create bug * fixed index create bug * fixed index create bug * minor test fixes * added couchdb test run * added couchdb test run * full working example * fixed test setup * fixed test setup * updated travis file for couchdb auth * updated travis file for couchdb auth * added credentials exception * fixed credentials * fixed test auth * fixed test auth * tracing auth issue * tracing auth issue * fixed test auth issue * fixed test test_60a_docker_couchdb * fixed test test_60a_docker_couchdb * cleanup * attempting to remove "unexpected successes" * tracing "unexpected successes" * tracing "unexpected successes" * tracing "unexpected successes" * tracing "unexpected successes" * tracing "unexpected successes" * tracing "unexpected successes" * Revert "tracing "unexpected successes"" This reverts commit 829da8c. * tracing "unexpected successes" * tracing "unexpected successes" in crawl * tracing "unexpected successes" in crawl * tracing "unexpected successes" * tracing "unexpected successes" * tracing "unexpected successes" * tracing "unexpected successes" * tracing "unexpected successes" * tracing "unexpected successes" * tracing "unexpected successes" * fixed "unexpected successes" * fixed TestFetcherProcessor * fixed TestFetcherProcessor * fixed TestFetcherProcessor * fix BaseHandler * fix BaseHandler * fix BaseHandler * fix BaseHandler * fix BaseHandler * fix BaseHandler * fix BaseHandler * removed beanstalkc * cleanup * removed 3.8 from travis * removed python 3.8 from setup.py * fixed test_60_relist_projects change * fixed .travis * added https to couchdb + cleanup + added couchdb to docs * added extra comment on top of docker-compose example * fixed docker-compose issue * improve docker-compose sample * remove demo link * fix test break because couchdb failing to start * try to use non-auth for CouchDB test * more couchdb_password * improve couchdb allow empty username password * drop support for couchdb Co-authored-by: Roy Binux <root@binux.me> Co-authored-by: jxltom <jxltom@users.noreply.github.com> Co-authored-by: binux <roy@binux.me> Co-authored-by: sdvcrx <memory.silentvoyage@gmail.com> Co-authored-by: Lucas <comeson@126.com> Co-authored-by: vibiu <540650312@qq.com> Co-authored-by: farmercode <wangchangchun120@gmail.com> Co-authored-by: Phillip <phillip1.peterson@umontana.edu> Co-authored-by: feiyang <feiyang@ibantang.com> Co-authored-by: clchen <ccl0326@163.com> Co-authored-by: v1nc3nt <vinsechsz@gmail.com> Co-authored-by: Keith Tunstead <tunstek@tcd.ie>
StarUI · Mar 16, 2021 · c4482fd · c4482fd
1 parent c8d4558
commit c4482fd
Show file tree

Hide file tree

Showing 128 changed files with 4,991 additions and 2,460 deletions.
diff --git a/.github/ISSUE_TEMPLATE.md b/.github/ISSUE_TEMPLATE.md
@@ -0,0 +1,28 @@
+<!--
+Thanks for using pyspider!
+
+如果你需要使用中文提问，请将问题提交到 https://segmentfault.com/t/pyspider
+-->
+
+* pyspider version:
+* Operating system:
+* Start up command:
+
+### Expected behavior
+
+<!-- What do you think should happen? -->
+
+### Actual behavior
+
+<!-- What actually happens? -->
+
+### How to reproduce
+
+<!-- 
+
+The best chance of getting help is providing enough information that can be reproduce the issue you have.
+
+If it's related to API or extraction behavior, please paste the script of your project.
+If it's related to scheduling of whole project, please paste the screenshot of queue status on the top in dashboard.
+
+-->
diff --git a/.gitignore b/.gitignore
@@ -1,6 +1,7 @@
 *.py[cod]
 data/*
-
+.venv
+.idea
 # C extensions
 *.so
 

diff --git a/.travis.yml b/.travis.yml
@@ -1,29 +1,41 @@
 language: python
+cache: pip
 python:
-    - "2.6"
-    - "2.7"
-    - "3.3"
-    - "3.4"
+  - 3.5
+  - 3.6
+  - 3.7
+  #- 3.8
 services:
+    - docker
     - mongodb
     - rabbitmq
-    - redis-server
-    - elasticsearch
+    - redis
+    - mysql
+    # - elasticsearch
+    - postgresql
 addons:
-    postgresql: "9.4"
+  postgresql: "9.4"
+  apt:
+    packages:
+    - rabbitmq-server
+env:
+    - IGNORE_COUCHDB=1
+
 before_install:
     - sudo apt-get update -qq
-    - sudo apt-get install -y beanstalkd
-    - echo "START=yes" | sudo tee -a /etc/default/beanstalkd > /dev/null
-    - sudo service beanstalkd start
+    - curl -O https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/deb/elasticsearch/2.4.0/elasticsearch-2.4.0.deb && sudo dpkg -i --force-confnew elasticsearch-2.4.0.deb && sudo service elasticsearch restart
+    - npm install express puppeteer
+    - sudo docker pull scrapinghub/splash
+    - sudo docker run -d --net=host scrapinghub/splash
 before_script:
     - psql -c "CREATE DATABASE pyspider_test_taskdb ENCODING 'UTF8' TEMPLATE=template0;" -U postgres
     - psql -c "CREATE DATABASE pyspider_test_projectdb ENCODING 'UTF8' TEMPLATE=template0;" -U postgres
     - psql -c "CREATE DATABASE pyspider_test_resultdb ENCODING 'UTF8' TEMPLATE=template0;" -U postgres
     - sleep 10
 install:
-    - pip install http://cdn.mysql.com/Downloads/Connector-Python/mysql-connector-python-2.0.4.zip#md5=3df394d89300db95163f17c843ef49df
-    - pip install --allow-all-external -e .[all,test]
+    - pip install https://github.com/marcus67/easywebdav/archive/master.zip
+    - sudo apt-get install libgnutls28-dev
+    - pip install -e .[all,test]
     - pip install coveralls
 script:
     - coverage run setup.py test

diff --git a/Dockerfile b/Dockerfile
@@ -1,16 +1,28 @@
-FROM cmfatih/phantomjs
+FROM python:3.6
 MAINTAINER binux <roy@binux.me>
 
-# install python
-RUN apt-get update && \
-        apt-get install -y python python-dev python-distribute python-pip && \
-        apt-get install -y libcurl4-openssl-dev libxml2-dev libxslt1-dev python-lxml python-mysqldb libpq-dev
+# install phantomjs
+RUN mkdir -p /opt/phantomjs \
+        && cd /opt/phantomjs \
+        && wget -O phantomjs.tar.bz2 https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2 \
+        && tar xavf phantomjs.tar.bz2 --strip-components 1 \
+        && ln -s /opt/phantomjs/bin/phantomjs /usr/local/bin/phantomjs \
+        && rm phantomjs.tar.bz2
+# Fix Error: libssl_conf.so: cannot open shared object file: No such file or directory
+ENV OPENSSL_CONF=/etc/ssl/
+
+# install nodejs
+ENV NODEJS_VERSION=8.15.0 \
+    PATH=$PATH:/opt/node/bin
+WORKDIR "/opt/node"
+RUN apt-get -qq update && apt-get -qq install -y curl ca-certificates libx11-xcb1 libxtst6 libnss3 libasound2 libatk-bridge2.0-0 libgtk-3-0 --no-install-recommends && \
+    curl -sL https://nodejs.org/dist/v${NODEJS_VERSION}/node-v${NODEJS_VERSION}-linux-x64.tar.gz | tar xz --strip-components=1 && \
+    rm -rf /var/lib/apt/lists/*
+RUN npm install puppeteer express
 
 # install requirements
-RUN pip install http://cdn.mysql.com/Downloads/Connector-Python/mysql-connector-python-2.0.4.zip#md5=3df394d89300db95163f17c843ef49df
-ADD requirements.txt /opt/pyspider/requirements.txt
+COPY requirements.txt /opt/pyspider/requirements.txt
 RUN pip install -r /opt/pyspider/requirements.txt
-RUN pip install -U pip
 
 # add all repo
 ADD ./ /opt/pyspider
@@ -19,7 +31,10 @@ ADD ./ /opt/pyspider
 WORKDIR /opt/pyspider
 RUN pip install -e .[all]
 
-VOLUME ["/opt/pyspider"]
+# Create a symbolic link to node_modules
+RUN ln -s /opt/node/node_modules ./node_modules
+
+#VOLUME ["/opt/pyspider"]
 ENTRYPOINT ["pyspider"]
 
-EXPOSE 5000 23333 24444 25555
+EXPOSE 5000 23333 24444 25555 22222
diff --git a/README.md b/README.md
@@ -1,14 +1,14 @@
-pyspider [![Build Status]][Travis CI] [![Coverage Status]][Coverage] [![Try]][Demo]
+pyspider [![Build Status]][Travis CI] [![Coverage Status]][Coverage]
 ========
 
-A Powerful Spider(Web Crawler) System in Python. **[TRY IT NOW!][Demo]**
+A Powerful Spider(Web Crawler) System in Python.
 
 - Write script in Python
 - Powerful WebUI with script editor, task monitor, project manager and result viewer
 - [MySQL](https://www.mysql.com/), [MongoDB](https://www.mongodb.org/), [Redis](http://redis.io/), [SQLite](https://www.sqlite.org/), [Elasticsearch](https://www.elastic.co/products/elasticsearch); [PostgreSQL](http://www.postgresql.org/) with [SQLAlchemy](http://www.sqlalchemy.org/) as database backend
-- [RabbitMQ](http://www.rabbitmq.com/), [Beanstalk](http://kr.github.com/beanstalkd/), [Redis](http://redis.io/) and [Kombu](http://kombu.readthedocs.org/) as message queue
+- [RabbitMQ](http://www.rabbitmq.com/), [Redis](http://redis.io/) and [Kombu](http://kombu.readthedocs.org/) as message queue
 - Task priority, retry, periodical, recrawl by age, etc...
-- Distributed architecture, Crawl Javascript pages, Python 2&3, etc...
+- Distributed architecture, Crawl Javascript pages, Python 2.{6,7}, 3.{3,4,5,6} support, etc...
 
 Tutorial: [http://docs.pyspider.org/en/latest/tutorial/](http://docs.pyspider.org/en/latest/tutorial/)  
 Documentation: [http://docs.pyspider.org/](http://docs.pyspider.org/)  
@@ -41,15 +41,15 @@ class Handler(BaseHandler):
         }
 ```
 
-[![Demo][Demo Img]][Demo]
-
 
 Installation
 ------------
 
 * `pip install pyspider`
 * run command `pyspider`, visit [http://localhost:5000/](http://localhost:5000/)
 
+**WARNING:** WebUI is open to the public by default, it can be used to execute any command which may harm your system. Please use it in an internal network or [enable `need-auth` for webui](http://docs.pyspider.org/en/latest/Command-Line/#-config).
+
 Quickstart: [http://docs.pyspider.org/en/latest/Quickstart/](http://docs.pyspider.org/en/latest/Quickstart/)
 
 Contribute
@@ -66,18 +66,9 @@ TODO
 
 ### v0.4.0
 
-- [x] local mode, load script from file.
-- [x] works as a framework (all components running in one process, no threads)
-- [x] redis
-- [x] shell mode like `scrapy shell` 
 - [ ] a visual scraping interface like [portia](https://github.com/scrapinghub/portia)
 
 
-### more
-
-- [x] edit script with vim via [WebDAV](http://en.wikipedia.org/wiki/WebDAV)
-
-
 License
 -------
 Licensed under the Apache License, Version 2.0
@@ -88,7 +79,5 @@ Licensed under the Apache License, Version 2.0
 [Coverage Status]:      https://img.shields.io/coveralls/binux/pyspider.svg?branch=master&style=flat
 [Coverage]:             https://coveralls.io/r/binux/pyspider
 [Try]:                  https://img.shields.io/badge/try-pyspider-blue.svg?style=flat
-[Demo]:                 http://demo.pyspider.org/
-[Demo Img]:             https://github.com/binux/pyspider/blob/master/docs/imgs/demo.png
 [Issue]:                https://github.com/binux/pyspider/issues
 [User Group]:           https://groups.google.com/group/pyspider-users
diff --git a/config_example.json b/config_example.json
@@ -0,0 +1,13 @@
+{
+  "taskdb": "couchdb+taskdb://user:password@couchdb:5984",
+  "projectdb": "couchdb+projectdb://user:password@couchdb:5984",
+  "resultdb": "couchdb+resultdb://user:password@couchdb:5984",
+  "message_queue": "amqp://rabbitmq:5672/%2F",
+  "webui": {
+    "username": "username",
+    "password": "password",
+    "need-auth": true,
+    "scheduler-rpc": "http://scheduler:23333",
+    "fetcher-rpc": "http://fetcher:24444"
+  }
+}
diff --git a/docker-compose.yaml b/docker-compose.yaml
@@ -0,0 +1,105 @@
+version: "3.7"
+
+# replace /path/to/dir/ to point to config.json
+
+# The RabbitMQ and CouchDB services can take some time to startup.
+# During this time most of the pyspider services will exit and restart.
+# Once RabbitMQ and CouchDB are fully up and running everything should run as normal.
+
+services:
+  rabbitmq:
+    image: rabbitmq:alpine
+    container_name: rabbitmq
+    networks:
+      - pyspider
+    command: rabbitmq-server
+  mysql:
+    image: mysql:latest
+    container_name: mysql
+    volumes:
+      - /tmp:/var/lib/mysql
+    environment:
+      - MYSQL_ALLOW_EMPTY_PASSWORD=yes
+    networks:
+      - pyspider
+  phantomjs:
+    image: pyspider:latest
+    container_name: phantomjs
+    networks:
+      - pyspider
+    volumes:
+      - ./config_example.json:/opt/pyspider/config.json
+    command: -c config.json phantomjs
+    depends_on:
+      - couchdb
+      - rabbitmq
+    restart: unless-stopped
+  result:
+    image: pyspider:latest
+    container_name: result
+    networks:
+      - pyspider
+    volumes:
+      - ./config_example.json:/opt/pyspider/config.json
+    command: -c config.json result_worker
+    depends_on:
+      - couchdb
+      - rabbitmq
+    restart: unless-stopped # Sometimes we'll get a connection refused error because couchdb has yet to fully start
+  processor:
+    container_name: processor
+    image: pyspider:latest
+    networks:
+      - pyspider
+    volumes:
+      - ./config_example.json:/opt/pyspider/config.json
+    command: -c config.json processor
+    depends_on:
+      - couchdb
+      - rabbitmq
+    restart: unless-stopped
+  fetcher:
+    image: pyspider:latest
+    container_name: fetcher
+    networks:
+      - pyspider
+    volumes:
+      - ./config_example.json:/opt/pyspider/config.json
+    command : -c config.json fetcher
+    depends_on:
+      - couchdb
+      - rabbitmq
+    restart: unless-stopped
+  scheduler:
+    image: pyspider:latest
+    container_name: scheduler
+    networks:
+      - pyspider
+    volumes:
+      - ./config_example.json:/opt/pyspider/config.json
+    command: -c config.json scheduler
+    depends_on:
+      - couchdb
+      - rabbitmq
+    restart: unless-stopped
+  webui:
+    image: pyspider:latest
+    container_name: webui
+    ports:
+      - "5050:5000"
+    networks:
+      - pyspider
+    volumes:
+      - ./config_example.json:/opt/pyspider/config.json
+    command: -c config.json webui
+    depends_on:
+      - couchdb
+      - rabbitmq
+    restart: unless-stopped
+
+networks:
+  pyspider:
+    external:
+      name: pyspider
+  default:
+    driver: bridge
diff --git a/docs/About-Projects.md b/docs/About-Projects.md
@@ -1,24 +1,26 @@
 About Projects
 ==============
 
-In most case, a project is one script you write for one website.
+In most cases, a project is one script you write for one website.
 
-* Projects are independent, but you can import another project as module with `from projects import other_project`
-* project has 5 status: `TODO`, `STOP`, `CHECKING`, `DEBUG`, `RUNNING`
+* Projects are independent, but you can import another project as a module with `from projects import other_project`
+* A project has 5 status: `TODO`, `STOP`, `CHECKING`, `DEBUG` and `RUNNING`
     - `TODO` - a script is just created to be written
-    - `STOP` - you can mark a project `STOP` if you want it STOP (= =).
-    - `CHECKING` - when a running project is modified, to prevent incomplete modification, project status will set as `CHECKING` automatically.
-    - `DEBUG`/`RUNNING` -  these two status have on difference to spider. But it's good to mark as `DEBUG` when it's running the first time then change to `RUNNING` after checked.
+    - `STOP` - you can mark a project as `STOP` if you want it to STOP (= =).
+    - `CHECKING` - when a running project is modified, to prevent incomplete modification, project status will be set as `CHECKING` automatically.
+    - `DEBUG`/`RUNNING` - these two status have no difference to spider. But it's good to mark it as `DEBUG` when it's running the first time then change it to `RUNNING` after being checked.
 * The crawl rate is controlled by `rate` and `burst` with [token-bucket](http://en.wikipedia.org/wiki/Token_bucket) algorithm.
-    - `rate` - how many requests in one seconds
-    - `burst` - consider this situation, `rate/burst = 0.1/3`, it means spider scrawl 1 page every 10 seconds. All tasks are finished, project is checking last updated items every minute. Assume that 3 new items are found, pyspider will "burst" and crawl 3 tasks without waiting 3*10 seconds. However, the fourth task needs wait 10 seconds.
-* to delete a project, set `group` to `delete` and status to `STOP`, wait 24 hours.
+    - `rate` - how many requests in one second
+    - `burst` - consider this situation, `rate/burst = 0.1/3`, it means that the spider scrawls 1 page every 10 seconds. All tasks are finished, project is checking last updated items every minute. Assume that 3 new items are found, pyspider will "burst" and crawl 3 tasks without waiting 3*10 seconds. However, the fourth task needs wait 10 seconds.
+* To delete a project, set `group` to `delete` and status to `STOP`, wait 24 hours.
 
 
 `on_finished` callback
 --------------------
 You can override `on_finished` method in the project, the method would be triggered when the task_queue goes to 0.
 
-Example 1: when you starts a project to crawl a website with 100 pages, the `on_finished` callback will be fired when 100 pages success crawled or failed after retries.
-Example 2: A project with `auto_recrawl` tasks will **NEVER** trigger the `on_finished` callback, because time queue will never become 0 when auto_recrawl tasks in it.
-Example 3: A project with `@every` decorated method will trigger the `on_finished` callback every time when the new submitted tasks finished.
+Example 1: When you start a project to crawl a website with 100 pages, the `on_finished` callback will be fired when 100 pages are successfully crawled or failed after retries.
+
+Example 2: A project with `auto_recrawl` tasks will **NEVER** trigger the `on_finished` callback, because time queue will never become 0 when there are auto_recrawl tasks in it.
+
+Example 3: A project with `@every` decorated method will trigger the `on_finished` callback every time when the newly submitted tasks are finished.