对人民网领导留言板2023.5-2023.11的数据进行爬虫和文本分析，数据量为39万条（时间有限，全部爬取的话为200万条左右）。爬虫解决的问题有：

网不好时抓取内容为空的程序稳定性处理；
访问次数太多后被网站限制访问的处理；
只有一页没有下一页的页面的处理；
对无回复留言的处理；
对于不存在的fid的页面处理（这使得不用整理页面fid，确定最小值fid和最大值fid后即可进行爬取）；
对没有在日期范围内的留言的页面处理；
对无留言页面的处理。 Note:该数据无法使用接口爬取，因为网页在每条留言详情页面的接口链接上加了一个随机“signature”作为反爬虫。文本分析使用“cntext”库使用不同情感词典

Crawler and text analysis was carried out on the data of the leadership message board 2023.5-2023.11 of the People's Daily Online, and the amount of data was 390,000 pieces (limited time, about 2 million pieces if all crawled). The problems that crawlers solve are:

When the network is not good, grab the content of the empty program stability processing;
The processing of access restricted by the website after too many visits;
Processing of pages with only one page and no next page;
Handling of no reply messages;
Page processing for non-existent Fids (this makes it possible to crawl after determining the minimum and maximum Fids without collating the page Fids);
Page processing for messages that are not within the date range;
Handling of pages without messages. Note: This data cannot be crawled using the interface because the page adds a random "signature" to the interface link of each message detail page as an anti-crawler. Text analysis uses the "cntext" library using different sentiment dictionaries

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
integrate_data.py		integrate_data.py
spider.py		spider.py
text_sentiment_analysis.py		text_sentiment_analysis.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

Moana679/selenium-message_board-text_analysis

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages