爬关键词搜索失败 #196

xwt0016 · 2020-03-07T13:57:24Z

因为ip被微博封了，所以加了ip代理，login运行成功了，但是python first_task_execution/search 之后结果是这样的，weibo_data里也没有出现任何数据。page_get/basic.py里的get_page的need_proxy已经改成=True了
[2020-03-07 21:34:12,974: INFO/MainProcess] Received task: tasks.search.search_keyword[d652d4ea-826a-488f-a1aa-eaf52d9d8363]
2020-03-07 21:34:12 - crawler - INFO - We are searching keyword "武汉红十字会"
[2020-03-07 21:34:12,976: INFO/ForkPoolWorker-1] We are searching keyword "武汉红十字会"
2020-03-07 21:34:12 - crawler - INFO - the crawling url is http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1
[2020-03-07 21:34:12,979: INFO/ForkPoolWorker-1] the crawling url is http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1
2020-03-07 21:37:08 - crawler - WARNING - Excepitons are raised when crawling http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1.Here are details:HTTPConnectionPool(host='183.164.228.73', port=49691): Max retries exceeded with url: http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1 (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd699739ba8>: Failed to establish a new connection: [Errno 110] Connection timed out',)))
[2020-03-07 21:37:08,589: WARNING/ForkPoolWorker-1] Excepitons are raised when crawling http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1.Here are details:HTTPConnectionPool(host='183.164.228.73', port=49691): Max retries exceeded with url: http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1 (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd699739ba8>: Failed to establish a new connection: [Errno 110] Connection timed out',)))
2020-03-07 21:37:08 - crawler - ERROR - failed to crawl http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1，here are details:an integer is required (got type str), stack is File "/home/xwt/Desktop/weibospider-temp_verification/decorators/decorators.py", line 17, in time_limit
return func(*args, **kargs)

[2020-03-07 21:37:08,590: ERROR/ForkPoolWorker-1] failed to crawl http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1，here are details:an integer is required (got type str), stack is File "/home/xwt/Desktop/weibospider-temp_verification/decorators/decorators.py", line 17, in time_limit
return func(*args, **kargs)

2020-03-07 21:37:08 - crawler - WARNING - No search result for keyword 武汉红十字会, the source page is
[2020-03-07 21:37:08,591: WARNING/ForkPoolWorker-1] No search result for keyword 武汉红十字会, the source page is
[2020-03-07 21:37:08,592: INFO/ForkPoolWorker-1] Task tasks.search.search_keyword[d652d4ea-826a-488f-a1aa-eaf52d9d8363] succeeded in 175.61601991499992s: None

Max retries exceeded with url这是因为代理ip失效太快了嘛

thekingofcity · 2020-03-07T14:42:38Z

ProxyError('Cannot connect to proxy.',

检查代理问题

xwt0016 · 2020-03-07T14:46:44Z

好的感谢，另外请问worker是不是不能在root用户下启动啊

thekingofcity · 2020-03-08T05:38:38Z

可以, 只是celery会提示不推荐

xwt0016 · 2020-03-08T07:21:04Z

代理是能拿到的而且登陆的时候用代理就没问题，为什么搜索就出现代理问题呢，如果page_get里不用代理话也是什么都拿不到，直接has been crawled，讲道理不登录的情况下也应该能拿到第一页的数据才对啊

2020-03-08 15:07:05 - other - INFO - Login successful! The login account is 17507424089
2020-03-08 15:07:16 - other - INFO - Login successful! The login account is 18574774032
2020-03-08 15:07:42 - crawler - INFO - the crawling url is http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&scope=ori&suball=1&page=1
2020-03-08 15:07:56 - crawler - INFO - keyword 武汉红十字会 has been crawled in this turn
2020-03-08 15:08:16 - other - INFO - Login successful! The login account is qthku6x0@duoduo.cafe
2020-03-08 15:08:37 - other - INFO - Login successful! The login account is tm5vdlac@anjing.cool
2020-03-08 15:08:55 - other - INFO - Login successful! The login account is 73iadkvm@duoduo.cafe

OneCodeMonkey · 2020-03-08T16:07:05Z

看下你的账号能不能正常登陆，在本机上试

xwt0016 · 2020-03-09T01:57:26Z

是可以正常登录的，之前买了一批需要手机验证，特地重新买了一批，我感觉是不是微博搜索的cookie跟微博的cookie不一样。我之前用的temp_verification那版，今天我登不上超级鹰了，但云打码又可以用了，我用1.7.2试了一下，报错是这样的
2020-03-09 09:41:32 - other - INFO - Login successful! The login account is 17680282715
[2020-03-09 09:41:32,622: INFO/ForkPoolWorker-1] Login successful! The login account is 17680282715
2020-03-09 09:41:32 - crawler - INFO - the crawling url is https://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1
[2020-03-09 09:41:32,897: INFO/ForkPoolWorker-1] the crawling url is https://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1
[2020-03-09 09:41:33,013: WARNING/ForkPoolWorker-1] /home/xwt/miniconda3/envs/py35/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py:852: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
[2020-03-09 09:41:36,520: INFO/MainProcess] Received task: tasks.login.login_task[b8fe2c55-0064-4305-bd92-fcd086c8f2a8]
2020-03-09 09:41:40 - other - INFO - Login successful! The login account is qthku6x0@duoduo.cafe
[2020-03-09 09:41:40,106: INFO/ForkPoolWorker-2] Login successful! The login account is qthku6x0@duoduo.cafe
[2020-03-09 09:41:48,453: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,454: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'img'
[2020-03-09 09:41:48,456: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,458: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,460: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,461: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,461: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'get'
[2020-03-09 09:41:48,463: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,463: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'get'
[2020-03-09 09:41:48,464: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'img'
[2020-03-09 09:41:48,465: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,466: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,467: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,469: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,469: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'img'
[2020-03-09 09:41:48,471: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,474: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,475: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,476: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,477: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
2020-03-09 09:41:48 - crawler - INFO - keyword 武汉红十字会 has been crawled in this turn
[2020-03-09 09:41:48,477: INFO/ForkPoolWorker-1] keyword 武汉红十字会 has been crawled in this turn
2020-03-09 09:41:54 - other - INFO - Login successful! The login account is yax9gheb@anjing.cool

OneCodeMonkey · 2020-03-09T10:04:19Z

这个问题碰到过，首先账号如果手动试验，需要手机号解封，那么即使登陆成功也是请求不到搜索页内容。如果账号没问题，也没有手机号解封，登陆也成功，还拿不到搜索页内容，很可能是 IP 被限制了。两种都有。

xwt0016 · 2020-03-09T13:55:57Z

刚刚换了种api 限制5次1秒的代理，抓了30条，然后又报代理错误了，我放弃了，我主要还是想要搜索到的微博的转发跟评论。我另外去抓了搜索的微博，导进weibo_data然后爬评论跟转发，可以运行。感谢大佬们的回复

xwt0016 closed this as completed Mar 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

爬关键词搜索失败 #196

爬关键词搜索失败 #196

xwt0016 commented Mar 7, 2020

thekingofcity commented Mar 7, 2020

xwt0016 commented Mar 7, 2020

thekingofcity commented Mar 8, 2020

xwt0016 commented Mar 8, 2020

OneCodeMonkey commented Mar 8, 2020

xwt0016 commented Mar 9, 2020

OneCodeMonkey commented Mar 9, 2020

xwt0016 commented Mar 9, 2020

爬关键词搜索失败 #196

爬关键词搜索失败 #196

Comments

xwt0016 commented Mar 7, 2020

thekingofcity commented Mar 7, 2020

xwt0016 commented Mar 7, 2020

thekingofcity commented Mar 8, 2020

xwt0016 commented Mar 8, 2020

OneCodeMonkey commented Mar 8, 2020

xwt0016 commented Mar 9, 2020

OneCodeMonkey commented Mar 9, 2020

xwt0016 commented Mar 9, 2020