Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spider Works in Terminal Not in Gerapy #255

Open
wmullaney opened this issue Dec 15, 2022 · 0 comments
Open

Spider Works in Terminal Not in Gerapy #255

wmullaney opened this issue Dec 15, 2022 · 0 comments
Assignees
Labels

Comments

@wmullaney
Copy link

wmullaney commented Dec 15, 2022

Before I start I just want to say that you all have done a great job developing this project. I love gerapy. I will probably start contributing to the project. I will try to document this as well as I can so it can be helpful to others.

Describe the bug
I have a scrapy project which runs perfectly fine in terminal using the following command:

scrapy crawl examplespider

However, when I schedule it in a task and run it on my local scrapyd client it runs but immediately closes. I don't know why it opens and closes without doing anything. Throws no errors. I think it's a config file issue. When I view the results of the job it shows the following:

`y.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-12-15 07:03:21 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-12-15 07:03:21 [scrapy.core.engine] INFO: Spider opened
2022-12-15 07:03:21 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-12-15 07:03:21 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-12-15 07:03:21 [scrapy.core.engine] INFO: Closing spider (finished)
2022-12-15 07:03:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.002359,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 12, 15, 7, 3, 21, 314439),
 'log_count/DEBUG': 1,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'memusage/max': 63709184,
 'memusage/startup': 63709184,
 'start_time': datetime.datetime(2022, 12, 15, 7, 3, 21, 312080)}
2022-12-15 07:03:21 [scrapy.core.engine] INFO: Spider closed (finished)`

In the logs it shows the following:

/home/ubuntu/env/scrape/bin/logs/examplescraper/examplespider

2022-12-15 07:03:21 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: examplescraper)
2022-12-15 07:03:21 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.7.0, w3lib 2.1.1, Twisted 22.10.0, Python 3.8.10 (default, Nov 14 2022, 12:59:47) - [GCC 9.4.0], pyOpenSSL 22.1.0 (OpenSSL 3.0.7 1 Nov 2022), cryptography 38.0.4, Platform Linux-5.15.0-1026-aws-x86_64-with-glibc2.29
2022-12-15 07:03:21 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'examplescraper', 
 'DOWNLOAD_DELAY': 0.1, 
 'LOG_FILE': 'logs/examplescraper/examplespider/8d623d447c4611edad0641137877ddff.log', 
 'NEWSPIDER_MODULE': 'examplespider.spiders', 
 'SPIDER_MODULES': ['examplespider.spiders'], 
 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '               
 		'(KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36'
}

2022-12-15 07:03:21 [py.warnings] WARNING: /home/ubuntu/env/scrape/lib/python3.8/site-packages/scrapy/utils/request.py:231: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this 
  deprecation.  
    return cls(crawler)
    
2022-12-15 07:03:21 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-12-15 07:03:21 [scrapy.extensions.telnet] INFO: Telnet Password: b11a24faee23f82c
2022-12-15 07:03:21 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats', 
 'scrapy.extensions.telnet.TelnetConsole 
 'scrapy.extensions.memusage.MemoryUsage', 
 'scrapy.extensions.logstats.LogStats']
2022-12-15 07:03:21 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-12-15 07:03:21 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
 'scrapy.spidermiddlewares.referer.RefererMiddleware', 
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-12-15 07:03:21 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-12-15 07:03:21 [scrapy.core.engine] INFO: Spider opened
2022-12-15 07:03:21 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-12-15 07:03:21 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-12-15 07:03:21 [scrapy.core.engine] INFO: Closing spider (finished)
2022-12-15 07:03:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

{'elapsed_time_seconds': 0.002359, 
 'finish_reason': 'finished', 
 'finish_time': datetime.datetime(2022, 12, 15, 7, 3, 21, 314439), 
 'log_count/DEBUG': 1, 
 'log_count/INFO': 10, 
 'log_count/WARNING': 1, 
 'memusage/max': 63709184, 
 'memusage/startup': 63709184, 
 'start_time': datetime.datetime(2022, 12, 15, 7, 3, 21, 312080)
}
 2022-12-15 07:03:21 [scrapy.core.engine] INFO: Spider closed (finished)

/home/ubuntu/gerapy/logs

ubuntu@ip-172-26-13-235:~/gerapy/logs$ cat 20221215065310.log 
 INFO - 2022-12-15 14:53:18,043 - process: 480 - scheduler.py - gerapy.server.core.scheduler - 105 - scheduler - successfully synced task with jobs with force
 INFO - 2022-12-15 14:54:15,011 - process: 480 - scheduler.py - gerapy.server.core.scheduler - 34 - scheduler - execute job of client LOCAL, project examplescraper, spider examplespider
 ubuntu@ip-172-26-13-235:~/gerapy/logs$ 

To Reproduce
Steps to reproduce the behavior:

  1. AWS Ubuntu 20.04 Instance
  2. Use python3 virtual environment and follow the installation instructions
  3. Create a systemd service for scrapyd and gerapy by doing the following:
    cd /lib/systemd/system
    sudo nano scrapyd.service

paste the following:

     [Unit]
     Description=Scrapyd service
     After=network.target

     [Service]
     User=ubuntu
     Group=ubuntu
     WorkingDirectory=/home/ubuntu/env/scrape/bin
     ExecStart=/home/ubuntu/env/scrape/bin/scrapyd

     [Install]
     WantedBy=multi-user.target

Issue the following commands:

      sudo systemctl enable scrapyd.service
      sudo systemctl start scrapyd.service
      sudo systemctl status scrapyd.service

It should say: active (running)
Create a script to run gerapy as a systemd service

     cd ~/virtualenv/exampleproject/bin/
     nano runserv-gerapy.sh

Paste the following:

     #!/bin/bashcd 
     /home/ubuntu/virtualenv
     source exampleproject/bin/activate
     cd /home/ubuntu/gerapy
     gerapy runserver 0.0.0.0:8000

Give this file execute permissions
sudo chmod +x runserve-gerapy.sh

Navigate back to systemd and create a service to run the runserve-gerapy.sh

     cd /lib/systemd/system
     sudo nano gerapy-web.service

Paste the following:

     [Unit]
     Description=Gerapy Webserver Service
     After=network.target

     [Service]
     User=ubuntu
     Group=ubuntu
     WorkingDirectory=/home/ubuntu/virtualenv/exampleproject/bin
     ExecStart=/bin/bash /home/ubuntu/virtualenv/exampleproject/bin/runserver-gerapy.sh

     [Install]
     WantedBy=multi-user.target

Again issue the following:

     sudo systemctl enable gerapy-web.service
     sudo systemctl start gerapy-web.service
     sudo systemctl status gerapy-web.service

Look for active (running) and navigate to http://your.pub.ip.add:8000 or http://localhost:8000 or http://127.0.0.1:8000 to verify that it is running. Reboot the instance to verify that the services are running on system startup.
5. Log in and create a client for the local scrapyd service. Use IP 127.0.0.1 and Port 6800. No Auth. Save it as "Local" or "Scrapyd"
6. Create a project. Select Clone. For testing I used the following github scrapy project: https://github.com/eneiromatos/NebulaEmailScraper (actually a pretty nice starter project). Save the project. Build the project. Deploy the project. (If you get an error when deploying make sure to be running in the virtual env, you might need to reboot).
7. Create a task. Make sure the project name and spider name matches what is in the scrapy.cfg and examplespider.py files and save the task. Schedule the task. Run the task

Traceback
See logs above ^^^

Expected behavior
It should run for at least 5 minutes and output to a file called emails.json in the project root folder (the folder with scrapy.cfg file)

Screenshots
I can upload screenshots if requested.

Environment (please complete the following information):

  • OS: AWS Ubuntu 20.04
  • Browser Firefox
  • Python Version 3.8
  • Gerapy Version 0.9.11 (latest)

Additional context
Add any other context about the problem here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants