Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

爬关键词搜索失败 #196

Closed
xwt0016 opened this issue Mar 7, 2020 · 8 comments
Closed

爬关键词搜索失败 #196

xwt0016 opened this issue Mar 7, 2020 · 8 comments

Comments

@xwt0016
Copy link

xwt0016 commented Mar 7, 2020

因为ip被微博封了,所以加了ip代理,login运行成功了,但是python first_task_execution/search 之后结果是这样的,weibo_data里也没有出现任何数据。page_get/basic.py里的get_page的need_proxy已经改成=True了
[2020-03-07 21:34:12,974: INFO/MainProcess] Received task: tasks.search.search_keyword[d652d4ea-826a-488f-a1aa-eaf52d9d8363]
2020-03-07 21:34:12 - crawler - INFO - We are searching keyword "武汉红十字会"
[2020-03-07 21:34:12,976: INFO/ForkPoolWorker-1] We are searching keyword "武汉红十字会"
2020-03-07 21:34:12 - crawler - INFO - the crawling url is http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1
[2020-03-07 21:34:12,979: INFO/ForkPoolWorker-1] the crawling url is http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1
2020-03-07 21:37:08 - crawler - WARNING - Excepitons are raised when crawling http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1.Here are details:HTTPConnectionPool(host='183.164.228.73', port=49691): Max retries exceeded with url: http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1 (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd699739ba8>: Failed to establish a new connection: [Errno 110] Connection timed out',)))
[2020-03-07 21:37:08,589: WARNING/ForkPoolWorker-1] Excepitons are raised when crawling http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1.Here are details:HTTPConnectionPool(host='183.164.228.73', port=49691): Max retries exceeded with url: http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1 (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd699739ba8>: Failed to establish a new connection: [Errno 110] Connection timed out',)))
2020-03-07 21:37:08 - crawler - ERROR - failed to crawl http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1,here are details:an integer is required (got type str), stack is File "/home/xwt/Desktop/weibospider-temp_verification/decorators/decorators.py", line 17, in time_limit
return func(*args, **kargs)

[2020-03-07 21:37:08,590: ERROR/ForkPoolWorker-1] failed to crawl http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1,here are details:an integer is required (got type str), stack is File "/home/xwt/Desktop/weibospider-temp_verification/decorators/decorators.py", line 17, in time_limit
return func(*args, **kargs)

2020-03-07 21:37:08 - crawler - WARNING - No search result for keyword 武汉红十字会, the source page is
[2020-03-07 21:37:08,591: WARNING/ForkPoolWorker-1] No search result for keyword 武汉红十字会, the source page is
[2020-03-07 21:37:08,592: INFO/ForkPoolWorker-1] Task tasks.search.search_keyword[d652d4ea-826a-488f-a1aa-eaf52d9d8363] succeeded in 175.61601991499992s: None

Max retries exceeded with url这是因为代理ip失效太快了嘛

@thekingofcity
Copy link
Member

ProxyError('Cannot connect to proxy.',

检查代理问题

@xwt0016
Copy link
Author

xwt0016 commented Mar 7, 2020

好的感谢,另外请问worker是不是不能在root用户下启动啊

@thekingofcity
Copy link
Member

可以, 只是celery会提示不推荐

@xwt0016
Copy link
Author

xwt0016 commented Mar 8, 2020

代理是能拿到的而且登陆的时候用代理就没问题,为什么搜索就出现代理问题呢,如果page_get里不用代理话也是什么都拿不到,直接has been crawled,讲道理不登录的情况下也应该能拿到第一页的数据才对啊

2020-03-08 15:07:05 - other - INFO - Login successful! The login account is 17507424089
2020-03-08 15:07:16 - other - INFO - Login successful! The login account is 18574774032
2020-03-08 15:07:42 - crawler - INFO - the crawling url is http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&scope=ori&suball=1&page=1
2020-03-08 15:07:56 - crawler - INFO - keyword 武汉红十字会 has been crawled in this turn
2020-03-08 15:08:16 - other - INFO - Login successful! The login account is qthku6x0@duoduo.cafe
2020-03-08 15:08:37 - other - INFO - Login successful! The login account is tm5vdlac@anjing.cool
2020-03-08 15:08:55 - other - INFO - Login successful! The login account is 73iadkvm@duoduo.cafe

@OneCodeMonkey
Copy link
Member

看下你的账号能不能正常登陆,在本机上试

@xwt0016
Copy link
Author

xwt0016 commented Mar 9, 2020

是可以正常登录的,之前买了一批需要手机验证,特地重新买了一批,我感觉是不是微博搜索的cookie跟微博的cookie不一样。我之前用的temp_verification那版,今天我登不上超级鹰了,但云打码又可以用了,我用1.7.2试了一下,报错是这样的
2020-03-09 09:41:32 - other - INFO - Login successful! The login account is 17680282715
[2020-03-09 09:41:32,622: INFO/ForkPoolWorker-1] Login successful! The login account is 17680282715
2020-03-09 09:41:32 - crawler - INFO - the crawling url is https://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1
[2020-03-09 09:41:32,897: INFO/ForkPoolWorker-1] the crawling url is https://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1
[2020-03-09 09:41:33,013: WARNING/ForkPoolWorker-1] /home/xwt/miniconda3/envs/py35/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py:852: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
[2020-03-09 09:41:36,520: INFO/MainProcess] Received task: tasks.login.login_task[b8fe2c55-0064-4305-bd92-fcd086c8f2a8]
2020-03-09 09:41:40 - other - INFO - Login successful! The login account is qthku6x0@duoduo.cafe
[2020-03-09 09:41:40,106: INFO/ForkPoolWorker-2] Login successful! The login account is qthku6x0@duoduo.cafe
[2020-03-09 09:41:48,453: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,454: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'img'
[2020-03-09 09:41:48,456: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,458: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,460: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,461: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,461: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'get'
[2020-03-09 09:41:48,463: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,463: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'get'
[2020-03-09 09:41:48,464: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'img'
[2020-03-09 09:41:48,465: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,466: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,467: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,469: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,469: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'img'
[2020-03-09 09:41:48,471: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,474: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,475: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,476: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
[2020-03-09 09:41:48,477: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find'
2020-03-09 09:41:48 - crawler - INFO - keyword 武汉红十字会 has been crawled in this turn
[2020-03-09 09:41:48,477: INFO/ForkPoolWorker-1] keyword 武汉红十字会 has been crawled in this turn
2020-03-09 09:41:54 - other - INFO - Login successful! The login account is yax9gheb@anjing.cool

@OneCodeMonkey
Copy link
Member

这个问题碰到过,首先账号如果手动试验,需要手机号解封,那么即使登陆成功也是请求不到搜索页内容。如果账号没问题,也没有手机号解封,登陆也成功,还拿不到搜索页内容,很可能是 IP 被限制了。两种都有。

@xwt0016
Copy link
Author

xwt0016 commented Mar 9, 2020

刚刚换了种api 限制5次1秒的代理,抓了30条,然后又报代理错误了,我放弃了,我主要还是想要搜索到的微博的转发跟评论。我另外去抓了搜索的微博,导进weibo_data然后爬评论跟转发,可以运行。感谢大佬们的回复

@xwt0016 xwt0016 closed this as completed Mar 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants