东方财富股吧数据采集

长期维护，欢迎issue，帮助完善代码

现有新旧两个版本，新版本要求高但是免费，旧版本要求不高，但是要使用付费代理(约1 rmb/小时)

如果你仅会使用简答的python，对数据库并不了解，请使用老版本，程序下载地址，详见介绍
如果你有数据库基础（需要用到redis,MongoDB)，请使用新版本，直接往下读

程序特性

可爬取热帖和全部，在 main_class.get_data()中设置 url
- 热帖：https://guba.eastmoney.com/list,600519,99_1.html
- 全部：https://guba.eastmoney.com/list,600519_1.html
使用免费代理，亲测可以完成爬取任务
仅爬取帖子title时，速度极快
redis异步多线程获取完整贴子内容

启动步骤

1. 获取代码

第一种方式，如果你会使用git, 请直接clone
第二种方式，下载源码，详见下图，点击 Download ZIP 既可下载，随即解压既可

2. 配置环境

前置条件是安转并启动了redis,mongo,并将redis密码设置为123456，后续会添加这部分的操作说明

安转代理池模块，再次感谢作者
```
git submodule update --init
```
建议使用虚拟环境，并安装依赖
```
pip install -r requirements.txt
```

3. 启动程序

启动代理池

新开两个终端，第一个运行

cd .\proxy_pool\
python proxyPool.py schedule

第二个运行

cd .\proxy_pool\
python proxyPool.py server

启动FullTextCrawler

新开终端，运行
```
python -m full_text_Crawler
```
启动主程序

在main_class.py中设置好参数，新开终端，运行
```
python -m main_class
```

爬取成功的数据会在，MongoDB.guba 中，如有问题，请 issue

附录

爬取成功的数据截图
股吧页面截图

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
TreadCrawler		TreadCrawler
Utils		Utils
proxy_pool @ 1252de2		proxy_pool @ 1252de2
test		test
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
README_old.md		README_old.md
full_text_Crawler.py		full_text_Crawler.py
main_class.py		main_class.py
requirements.txt		requirements.txt
setting.ini		setting.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

东方财富股吧数据采集

程序特性

启动步骤

1. 获取代码

2. 配置环境

3. 启动程序

附录

About

Releases 1

Packages

Languages

Euclid-Jie/Euclidguba-search

Folders and files

Latest commit

History

Repository files navigation

东方财富股吧数据采集

程序特性

启动步骤

1. 获取代码

2. 配置环境

3. 启动程序

附录

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages