Skip to content

hnliu-git/baiduindex-crawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BaiduIndexCrawl

Collecting baiduindex of particular time and of particular person

MainCode

  • BaiduIndex.py
    Main code
  • SQLTools.py
    Access database
  • ReadXml.py
    Tool to read xml

Operation Environment

  • selenium
  • MySQLdb
  • pytesseract

Data Structure(MySQL)

CREATE TABLE `baidu_index` (
  `input_id` int(11) NOT NULL AUTO_INCREMENT,
  `status` int(11) NOT NULL,
  `keyword` varchar(50) DEFAULT NULL,
  `time` varchar(45) CHARACTER SET latin1 DEFAULT NULL,
  `index` longtext,
  PRIMARY KEY (`input_id`)
) ENGINE=InnoDB AUTO_INCREMENT=10 DEFAULT CHARSET=utf8;
input_id status keyword time index
1 0 GitHub 2016-03-01 ....

Operation Instruction

Prepare some data in the database then
python BaiduIndex.py

Sample

Take “战狼2" as an example,we get one piece of data like this
[1,战狼2,2017-12-12]
The program will request baiduindex.com,then login according to your variable AccountList in BaiduIndex.py
Then it will collect the baiduindex from 2017-11-12 to 2018-1-12
The result is like this
[2017-11-12:3930,2017-11-13:4040……]
And the result will be save to your local database.
After doing all this,the status of input_id=1 will be set 1(The default value is 0)
If the keyword doesn't have any baiduindex, the status will be set -1

Know More Detail

To know more detail of this code you can visit my CSDN blog 基于Selenium与图像识别的百度指数爬虫 or download it.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages