Spider Works in Terminal Not in Gerapy #255

wmullaney · 2022-12-15T08:57:52Z

Before I start I just want to say that you all have done a great job developing this project. I love gerapy. I will probably start contributing to the project. I will try to document this as well as I can so it can be helpful to others.

Describe the bug
I have a scrapy project which runs perfectly fine in terminal using the following command:

scrapy crawl examplespider

However, when I schedule it in a task and run it on my local scrapyd client it runs but immediately closes. I don't know why it opens and closes without doing anything. Throws no errors. I think it's a config file issue. When I view the results of the job it shows the following:

`y.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-12-15 07:03:21 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-12-15 07:03:21 [scrapy.core.engine] INFO: Spider opened
2022-12-15 07:03:21 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-12-15 07:03:21 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-12-15 07:03:21 [scrapy.core.engine] INFO: Closing spider (finished)
2022-12-15 07:03:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.002359,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 12, 15, 7, 3, 21, 314439),
 'log_count/DEBUG': 1,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'memusage/max': 63709184,
 'memusage/startup': 63709184,
 'start_time': datetime.datetime(2022, 12, 15, 7, 3, 21, 312080)}
2022-12-15 07:03:21 [scrapy.core.engine] INFO: Spider closed (finished)`

In the logs it shows the following:

/home/ubuntu/env/scrape/bin/logs/examplescraper/examplespider

2022-12-15 07:03:21 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: examplescraper)
2022-12-15 07:03:21 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.7.0, w3lib 2.1.1, Twisted 22.10.0, Python 3.8.10 (default, Nov 14 2022, 12:59:47) - [GCC 9.4.0], pyOpenSSL 22.1.0 (OpenSSL 3.0.7 1 Nov 2022), cryptography 38.0.4, Platform Linux-5.15.0-1026-aws-x86_64-with-glibc2.29
2022-12-15 07:03:21 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'examplescraper', 
 'DOWNLOAD_DELAY': 0.1, 
 'LOG_FILE': 'logs/examplescraper/examplespider/8d623d447c4611edad0641137877ddff.log', 
 'NEWSPIDER_MODULE': 'examplespider.spiders', 
 'SPIDER_MODULES': ['examplespider.spiders'], 
 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '               
 		'(KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36'
}

2022-12-15 07:03:21 [py.warnings] WARNING: /home/ubuntu/env/scrape/lib/python3.8/site-packages/scrapy/utils/request.py:231: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this 
  deprecation.  
    return cls(crawler)
    
2022-12-15 07:03:21 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-12-15 07:03:21 [scrapy.extensions.telnet] INFO: Telnet Password: b11a24faee23f82c
2022-12-15 07:03:21 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats', 
 'scrapy.extensions.telnet.TelnetConsole 
 'scrapy.extensions.memusage.MemoryUsage', 
 'scrapy.extensions.logstats.LogStats']
2022-12-15 07:03:21 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-12-15 07:03:21 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
 'scrapy.spidermiddlewares.referer.RefererMiddleware', 
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-12-15 07:03:21 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-12-15 07:03:21 [scrapy.core.engine] INFO: Spider opened
2022-12-15 07:03:21 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-12-15 07:03:21 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-12-15 07:03:21 [scrapy.core.engine] INFO: Closing spider (finished)
2022-12-15 07:03:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

{'elapsed_time_seconds': 0.002359, 
 'finish_reason': 'finished', 
 'finish_time': datetime.datetime(2022, 12, 15, 7, 3, 21, 314439), 
 'log_count/DEBUG': 1, 
 'log_count/INFO': 10, 
 'log_count/WARNING': 1, 
 'memusage/max': 63709184, 
 'memusage/startup': 63709184, 
 'start_time': datetime.datetime(2022, 12, 15, 7, 3, 21, 312080)
}
 2022-12-15 07:03:21 [scrapy.core.engine] INFO: Spider closed (finished)

/home/ubuntu/gerapy/logs

ubuntu@ip-172-26-13-235:~/gerapy/logs$ cat 20221215065310.log 
 INFO - 2022-12-15 14:53:18,043 - process: 480 - scheduler.py - gerapy.server.core.scheduler - 105 - scheduler - successfully synced task with jobs with force
 INFO - 2022-12-15 14:54:15,011 - process: 480 - scheduler.py - gerapy.server.core.scheduler - 34 - scheduler - execute job of client LOCAL, project examplescraper, spider examplespider
 ubuntu@ip-172-26-13-235:~/gerapy/logs$

To Reproduce
Steps to reproduce the behavior:

AWS Ubuntu 20.04 Instance
Use python3 virtual environment and follow the installation instructions
Create a systemd service for scrapyd and gerapy by doing the following:

    cd /lib/systemd/system
    sudo nano scrapyd.service

paste the following:

     [Unit]
     Description=Scrapyd service
     After=network.target

     [Service]
     User=ubuntu
     Group=ubuntu
     WorkingDirectory=/home/ubuntu/env/scrape/bin
     ExecStart=/home/ubuntu/env/scrape/bin/scrapyd

     [Install]
     WantedBy=multi-user.target

Issue the following commands:

      sudo systemctl enable scrapyd.service
      sudo systemctl start scrapyd.service
      sudo systemctl status scrapyd.service

It should say: active (running)
Create a script to run gerapy as a systemd service

     cd ~/virtualenv/exampleproject/bin/
     nano runserv-gerapy.sh

Paste the following:

     #!/bin/bashcd 
     /home/ubuntu/virtualenv
     source exampleproject/bin/activate
     cd /home/ubuntu/gerapy
     gerapy runserver 0.0.0.0:8000

Give this file execute permissions
sudo chmod +x runserve-gerapy.sh

Navigate back to systemd and create a service to run the runserve-gerapy.sh

     cd /lib/systemd/system
     sudo nano gerapy-web.service

Paste the following:

     [Unit]
     Description=Gerapy Webserver Service
     After=network.target

     [Service]
     User=ubuntu
     Group=ubuntu
     WorkingDirectory=/home/ubuntu/virtualenv/exampleproject/bin
     ExecStart=/bin/bash /home/ubuntu/virtualenv/exampleproject/bin/runserver-gerapy.sh

     [Install]
     WantedBy=multi-user.target

Again issue the following:

     sudo systemctl enable gerapy-web.service
     sudo systemctl start gerapy-web.service
     sudo systemctl status gerapy-web.service

Look for active (running) and navigate to http://your.pub.ip.add:8000 or http://localhost:8000 or http://127.0.0.1:8000 to verify that it is running. Reboot the instance to verify that the services are running on system startup.
5. Log in and create a client for the local scrapyd service. Use IP 127.0.0.1 and Port 6800. No Auth. Save it as "Local" or "Scrapyd"
6. Create a project. Select Clone. For testing I used the following github scrapy project: https://github.com/eneiromatos/NebulaEmailScraper (actually a pretty nice starter project). Save the project. Build the project. Deploy the project. (If you get an error when deploying make sure to be running in the virtual env, you might need to reboot).
7. Create a task. Make sure the project name and spider name matches what is in the scrapy.cfg and examplespider.py files and save the task. Schedule the task. Run the task

Traceback
See logs above ^^^

Expected behavior
It should run for at least 5 minutes and output to a file called emails.json in the project root folder (the folder with scrapy.cfg file)

Screenshots
I can upload screenshots if requested.

Environment (please complete the following information):

OS: AWS Ubuntu 20.04
Browser Firefox
Python Version 3.8
Gerapy Version 0.9.11 (latest)

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

wmullaney added the bug label Dec 15, 2022

wmullaney assigned Germey Dec 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spider Works in Terminal Not in Gerapy #255

Spider Works in Terminal Not in Gerapy #255

wmullaney commented Dec 15, 2022 •

edited

Spider Works in Terminal Not in Gerapy #255

Spider Works in Terminal Not in Gerapy #255

Comments

wmullaney commented Dec 15, 2022 • edited

wmullaney commented Dec 15, 2022 •

edited