Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

系统邮件异常 #104

Closed
lizhuquan opened this issue Jul 4, 2018 · 9 comments
Closed

系统邮件异常 #104

lizhuquan opened this issue Jul 4, 2018 · 9 comments

Comments

@lizhuquan
Copy link

lizhuquan commented Jul 4, 2018

如下:
问题:
1、抓取几百个微博账号个人信息后,使用的是163的邮箱,出现如下异常,建议对邮件发送部分做异常处理。

######################
[2018-07-05 08:10:09,864: ERROR/ForkPoolWorker-1] Failed to send emails, (535, b'Error: authentication failed') is raised, here are details: File "/root/softs/weibospider-master/utils/email_warning.py", line 48, in send_email
server.login(email_from, email_pass)

worker: Warm shutdown (MainProcess)
2018-07-05 08:10:09 - crawler - ERROR - failed to crawl http://weibo.com/p/1035051742566624/info?mod=pedit_more,here are details:'NoneType' object is not subscriptable, stack is File "/root/softs/weibospider-master/decorators/decorators.py", line 17, in time_limit
return func(*args, **kargs)

[2018-07-05 08:10:09,878: ERROR/ForkPoolWorker-1] failed to crawl http://weibo.com/p/1035051742566624/info?mod=pedit_more,here are details:'NoneType' object is not subscriptable, stack is File "/root/softs/weibospider-master/decorators/decorators.py", line 17, in time_limit
return func(*args, **kargs)

[2018-07-05 08:10:09,896: INFO/ForkPoolWorker-1] Task tasks.user.crawl_person_infos[3f4c447b-f4cf-41f1-89a6-3d929aed6bfd] succeeded in 2.4491456080004355s: None
[2018-07-05 08:10:10,906: WARNING/MainProcess] Restoring 4 unacknowledged message(s)

@lizhuquan
Copy link
Author

lizhuquan commented Jul 4, 2018

#################################
问题2:
对于异常邮箱的处理,进行了注释,暂时不发邮件。但导入约2000个微博账号,3000个关键词后,
执行命令:
celery -A tasks.workers -Q login_queue,user_crawler,search_crawler,home_crawler worker -l info -c 1

爬虫部分数据后,出现如下异常:
[2018-07-05 08:41:48,785: INFO/MainProcess] Received task: tasks.user.crawl_person_infos[44ed1fa0-953e-49a5-b32e-66a6f2e8a94a]
2018-07-05 08:41:49 - crawler - INFO - the crawling url is http://weibo.com/p/1003061497124480/info?mod=pedit_more
[2018-07-05 08:41:49,822: INFO/ForkPoolWorker-1] the crawling url is http://weibo.com/p/1003061497124480/info?mod=pedit_more
2018-07-05 08:41:49 - crawler - WARNING - No cookie in cookies pool. Maybe all accounts are banned, or all cookies are expired
[2018-07-05 08:41:49,825: WARNING/ForkPoolWorker-1] No cookie in cookies pool. Maybe all accounts are banned, or all cookies are expired

worker: Warm shutdown (MainProcess)
2018-07-05 08:41:49 - crawler - ERROR - failed to crawl http://weibo.com/p/1003061497124480/info?mod=pedit_more,here are details:'NoneType' object is not subscriptable, stack is File "/root/softs/weibospider-master/decorators/decorators.py", line 17, in time_limit
return func(*args, **kargs)

[2018-07-05 08:41:49,835: ERROR/ForkPoolWorker-1] failed to crawl http://weibo.com/p/1003061497124480/info?mod=pedit_more,here are details:'NoneType' object is not subscriptable, stack is File "/root/softs/weibospider-master/decorators/decorators.py", line 17, in time_limit
return func(*args, **kargs)

[2018-07-05 08:41:49,850: INFO/ForkPoolWorker-1] Task tasks.user.crawl_person_infos[e73340af-df06-4745-a349-060112884736] succeeded in 1.632628456999555s: None
[2018-07-05 08:41:50,857: WARNING/MainProcess] Restoring 4 unacknowledged message(s)

@ResolveWang
Copy link
Member

第一个信息那里异常是被捕获到的

Failed to send emails, (535, b'Error: authentication failed') is raised, here are details: File "/root/softs/weibospider-master/utils/email_warning.py", line 48, in send_email

这是自己写的异常处理模块提示的信息
第二条信息的话,你那边可能账号被封了,你或者Cookie过期了,可以进行确认一下。
先手动登录看看账号是否正常,然后再看看你是否设置了定时模拟登录

@lizhuquan
Copy link
Author

第一条日志已经删除。 可以复现,后续将发详细日志。
第二个,Cookie过期。

@lizhuquan
Copy link
Author

第三个问题:
大批量的数据无法写入,具体原因不详。 为何会出现weibo_id 重复现象?
另外,如何才能抓取“展开全文'” 部分的信息?

回复的很及时,非常感谢!

2018-07-05 19:11:30 - storage - ERROR - DB operation error,here are details:(pymysql.err.IntegrityError) (1062, "Duplicate entry '4255964637059647' for key 'weibo_id'") [SQL: 'INSERT INTO weibo_data (weibo_id, weibo_cont, weibo_img, weibo_video, repost_num, comment_num, praise_num, uid, is_origin, device, weibo_url, create_time, comment_crawled, repost_crawled, dialogue_crawled, praise_crawled) VALUES (%(weibo_id)s, %(weibo_cont)s, %(weibo_img)s, %(weibo_video)s, %(repost_num)s, %(comment_num)s, %(praise_num)s, %(uid)s, %(is_origin)s, %(device)s, %(weibo_url)s, %(create_time)s, %(comment_crawled)s, %(repost_crawled)s, %(dialogue_crawled)s, %(praise_crawled)s)'] [parameters: {'weibo_img': 'https://tc.sinaimg.cn/images/tc.service.png', 'praise_crawled': 0, 'weibo_cont': '#上证快讯# 【进一
步扩大保险业对外开放 外资经营保险代理及公估业务文件出炉】我国金融业加速对外开放的举措正在相继落地。上海证券报记者获悉,中国银保监会今
日下发《关于允许境外投资者来华经营保险代理业务的通知》、《关于允许境外投资者来华经营保险公估业务的通知》。上述通知自发布之日起执行 \u200b\u200b\u200b\u200b...展开全文c', 'weibo_video': '', 'comment_crawled': 0, 'dialogue_crawled': 0, 'repost_crawled': 0, 'weibo_url': 'https://weibo.com/1076684233/GntTxtCxh?from=page_1002061076684233_profile&wvr=6&mod=weibotime', 'is_origin': 1, 'device': '搜狗高>速浏览器', 'comment_num': 0, 'weibo_id': '4255964637059647', 'repost_num': 2, 'uid': '1076684233', 'praise_num': 3, 'create_time': '2018-06-28 19:15'}]

@ResolveWang
Copy link
Member

重复Id很正常啊,比如你对两个类似的关键词做搜索,肯定会搜索相同的微博啊,重复ID都做了异常处理的,你贴的这个叫日志,不是报错,换句话说这里记录的东西都是被程序正常捕获到的,只是为了便于调试和改进,才把详细信息放到日志中了。

展开全文不清楚发布1.7.2的时候做没有,dev_wpmmaster 有做,可以参考

crawling_mode: normal

这个配置,如果使用accurate代替normal,就会展开全文

@lizhuquan
Copy link
Author

非常感谢。

但下面这个问题也是类似吗? 我看日志好像是出现了回滚?

2018-07-05 19:10:50 - crawler - INFO - the crawling url is http://weibo.com/u/1076684233?is_ori=1&is_tag=0&profile_ftype=1&page=4
2018-07-05 19:11:10 - storage - ERROR - DB operation error,here are details:This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (pymysql.err.IntegrityError) (1062, "Duplicate entry '4256272704081656' for key 'weibo_id'") [SQL: 'INSERT INTO weibo_data (weibo_id, weibo_cont, weibo_img, weibo_video, repost_num, comment_num, praise_num, uid, is_origin, device, weibo_url, create_time, comment_crawled, repost_crawled, dialogue_crawled, praise_crawled) VALUES (%(weibo_id)s, %(weibo_cont)s, %(weibo_img)s, %(weibo_video)s, %(repost_num)s, %(comment_num)s, %(praise_num)s, %(uid)s, %(is_origin)s, %(device)s, %(weibo_url)s, %(create_time)s, %(comment_crawled)s, %(repost_crawled)s, %(dialogue_crawled)s, %(praise_crawled)s)'] [parameters: {'weibo_img': 'https://r.sinaimg.cn/large/tc/news_cnstock_com/1b120f3c78964d560728fcd64f321857.jpg', 'praise_crawled': 0, 'weibo_cont': '#上证快讯# 【万科总裁祝九胜:坚持为普通人盖好房子,
大力发展租赁住房】29日下午,万科在深圳召开2017年度股东大会,董事会主席郁亮主持会议。会上将审议《2017年度董事会报告》、《2017年度监事>会报告》、《2017年度报告及摘要》、2017年度利润分配方案等7个事项。O万科总裁祝九胜:坚持为普通人盖好房子,大力... \u200b\u200b\u200b\u200b', 'weibo_video': '', 'comment_crawled': 0, 'dialogue_crawled': 0, 'repost_crawled': 0, 'weibo_url': 'https://weibo.com/1076684233/GnBUqh7Pa?from=page_1002061076684233_profile&wvr=6&mod=weibotime', 'is_origin': 1, 'device': '搜狗高速浏览器', 'comment_num': 0, 'weibo_id': '4256272704081656', 'repost_num': 0, 'uid': '1076684233', 'praise_num': 3, 'create_time': '2018-06-29 15:39'}]

@ResolveWang
Copy link
Member

是类似的,好像这个发布版有些错误没有回滚,就会导致这个问题,在dev_wpm中好像已经修复了,这段时间比较忙,开源项目维护得较少,到底修复没给忘了。。。
你可以观察一下出现这个信息之后,数据是否能插入,能的话,就应该没啥问题

@lizhuquan lizhuquan reopened this Jul 6, 2018
@lizhuquan
Copy link
Author

已解决,谢谢!

@lizhuquan
Copy link
Author

master 版本关于数据回滚问题, 虽报错,但数据已入库。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants