# 实验题目：2023年股票数据的爬取和挖掘

## 实验目标：围绕A股市场进行分析

- 以 **A 股** 作为背景，选取热门行业，以 **2018 年至今** 为时间范围，对 **东方财富网** 的股票数据进行爬取和整理；
- 模仿股票网站的 k 线，做出 **箱线图和折线图**，让技术人员更清晰、更准确的分析行业的走势；
- **分析 2018 年与 2022 年的行业数据**，了解行业的发展情况，并结合国内外实际情况对股票的影响，为投资者提供帮助；
- 做出 **聚类模型**，根据股票的 **涨跌幅和振幅** 等特征数据，将不同的股票划分为不同的簇，以帮助投资者更好地理解不同类型的股票。

## 实验内容

### 1. 数据爬取

爬虫爬取 [东方财富网](https://www.eastmoney.com/) 上的数据（股票及行业数据）。
注意：爬取股票日 k 的数据，行业周 k 的数据，方便后面做可视化分析以及建模。

In [70]:
import requests
from datetime import datetime
import pandas as pd
import os
from pprint import pprint


def fetch_stock_daily_k_data(
        stock_code, 
        stock_exchange='0',
        start_date='2018-01-01', 
        end_date=None, 
        fq=1, 
        other_info=None):
    """
    获取指定股票代码的历史K线数据。
    :param stock_code: 股票代码，如'000001'（平安银行）
    :param stock_exchange: 股票交易所，0为沪市，1为深市，默认为0
    :param start_date: 开始日期，字符串格式，默认为'2018-01-01'
    :param end_date: 结束日期，字符串格式，默认为当前日期
    :param fq: 复权类型，1为前复权，2为后复权，默认为1
    :param other_info: 其他信息，行业、地域、概念，如提供则会保存到Excel文件中
    :return: 包含历史K线数据的DataFrame
    """
    start_date = datetime.strptime(start_date, "%Y-%m-%d").strftime("%Y%m%d")
    if end_date is None:  # 如果未指定结束日期，则默认为今天
        end_date = datetime.now().strftime('%Y%m%d')
    print(f"Fetching data from {start_date} to {end_date} for stock {stock_code}...")
    
    # 所需字段
    columns = [
        '股票代码', '股票名称', '日期', '开盘价', '收盘价', '最高价', '最低价',
        '成交量', '成交额', '振幅(%)', '涨跌幅(%)', '涨跌额', '换手率(%)',
    ]
    if other_info is not None:
        columns += ['行业', '地域', '概念', ]

    # 请求API获取部分数据
    base_url = "https://push2his.eastmoney.com/api/qt/stock/kline/get?"
    # secid字段：0.代表沪市，1.代表深市；.后面跟随股票代码
    secid_fields1 = f"secid={stock_exchange}.{stock_code}&fields1=f1,f2,f3,f4,f5,f6&"
    # fields2字段：f51: 日期；f52: 开盘价；f53: 收盘价；f54: 最高价；f55: 最低价；
    #   f56: 成交量；f57: 成交额；f58: 振幅；f59: 涨跌幅；f60: 涨跌额；f61: 换手率；
    fields2 = "fields2=f51,f52,f53,f54,f55,f56,f57,f58,f59,f60,f61&"
    # klt字段：101代表日K线，102代表周K线，103代表月K线；fqt字段：1代表前复权，2代表后复权；beg和end字段：开始和结束日期
    klt_fqt_beg_end = f"klt=101&fqt={fq}&beg={start_date}&end={end_date}"
    url_api = base_url + secid_fields1 + fields2 + klt_fqt_beg_end
    response = requests.get(url_api)
    response.raise_for_status()
    data = response.json()
    # pprint(data)
    
    # 从API返回的数据中提取部分所需字段
    try:
        klines_data = data['data']['klines']
    except Exception as e:
        print(f"Error: {e}\nPlease check stock exchange or any other parameters.")
        return None
    stock_name = data['data']['name'].replace('*', '')  # 要去掉股票名中的星号
    klines_table = []
    for kline in klines_data:
        kline = kline.split(',')
        kline = [stock_code, stock_name] + kline
        if other_info is not None:
            kline += other_info
        klines_table.append(kline)
    # print(klines_table[:5])
    # 将数据转换为DataFrame
    klines_table = pd.DataFrame(klines_table, columns=columns)

    # 保存数据到excel文件
    data_folder = 'kline_data'
    if not os.path.exists(data_folder):
        os.mkdir(data_folder)
    try:
        klines_table.to_excel(f"{data_folder}/{stock_code}_{stock_name}_daily_k_data.xlsx", index=False)
        print(f"Data saved to {data_folder}/{stock_code}_{stock_name}_daily_k_data.xlsx successfully.\n")  
    except Exception as e:
        print(f"Error: {e}\nPlease check if the file is open and close it before running the script again.")
    return klines_table


fetch_stock_daily_k_data('600070', stock_exchange='1')

Fetching data from 20180101 to 20240527 for stock 600070...
Data saved to kline_data/600070_ST富润_daily_k_data.xlsx successfully.



Unnamed: 0,股票代码,股票名称,日期,开盘价,收盘价,最高价,最低价,成交量,成交额,振幅(%),涨跌幅(%),涨跌额,换手率(%)
0,600070,ST富润,2018-01-02,9.39,9.19,9.39,9.18,38441,36855170.00,2.25,-1.61,-0.15,1.08
1,600070,ST富润,2018-01-03,9.27,9.21,9.27,9.14,19653,18652422.00,1.41,0.22,0.02,0.55
2,600070,ST富润,2018-01-04,9.24,9.27,9.29,9.11,16317,15509115.00,1.95,0.65,0.06,0.46
3,600070,ST富润,2018-01-05,9.29,9.19,9.29,9.13,17362,16508281.00,1.73,-0.86,-0.08,0.49
4,600070,ST富润,2018-01-08,9.17,9.07,9.17,9.03,13355,12539461.00,1.52,-1.31,-0.12,0.37
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1543,600070,ST富润,2024-05-21,1.14,1.08,1.14,1.06,71658,7871500.00,7.27,-1.82,-0.02,1.41
1544,600070,ST富润,2024-05-22,1.09,1.12,1.13,1.07,67656,7522146.00,5.56,3.70,0.04,1.34
1545,600070,ST富润,2024-05-23,1.13,1.09,1.15,1.08,61764,6886190.00,6.25,-2.68,-0.03,1.22
1546,600070,ST富润,2024-05-24,1.08,1.10,1.11,1.05,59412,6405451.00,5.50,0.92,0.01,1.17


In [71]:
Industry = "BK0447"  # 互联网服务

def get_industry_stocks(industry):
    """获取指定行业内所有股票的代码。
    :param industry: 行业代码，如'BK0447'（互联网服务）
    :return: 行业内所有股票的代码列表
    """
    # pn: 页码；pz: 每页数量；po: 排序方式(0: 正序，1: 倒序)；fid: 排序字段
    base_url = "http://push2.eastmoney.com/api/qt/clist/get?pn=1&pz=1000&po=0&np=1&fltt=2&invt=2&fid=f12&"
    # fs: 股票筛选条件；bk: 行业代码
    fs = f"fs=b:{industry}&"
    # fields: 返回字段；f12: 股票代码；f13: 交易所，0为深证，1为上证；f14: 股票名称；f100: 行业；f102: 地域；f103: 概念
    fields = "fields=f12,f13,f14,f100,f102,f103"
    url = base_url + fs + fields
    response = requests.get(url)
    response.raise_for_status()
    data = response.json()
    # pprint(data)
    
    stocks = data['data']['diff']
    stocks_all_info = [list(stock.values()) for stock in stocks]
    # 股票代码开头900、200为B股，5位为H股，需要排除
    stocks_all_info = [stock for stock in stocks_all_info if not stock[0].startswith(('900', '200')) and len(stock[0]) == 6]
    # print(stocks_all_info[0])
    stock_codes = [stock[0] for stock in stocks_all_info]  # 股票代码
    stock_exchanges = [stock[1] for stock in stocks_all_info]  # 交易所
    stock_info = [[stock[3], stock[4], stock[5]] for stock in stocks_all_info]  # 行业、地域、概念
    len_stock_codes = len(stock_codes)
    for i in range(len_stock_codes):
        print(f"[{i+1}/{len_stock_codes}]", end=" ")
        fetch_stock_daily_k_data(
            stock_codes[i], stock_exchange=stock_exchanges[i], other_info=stock_info[i]
        )
    return stock_codes

get_industry_stocks(Industry)

[1/146] Fetching data from 20180101 to 20240527 for stock 000409...
Data saved to kline_data/000409_云鼎科技_daily_k_data.xlsx successfully.

[2/146] Fetching data from 20180101 to 20240527 for stock 000555...
Data saved to kline_data/000555_神州信息_daily_k_data.xlsx successfully.

[3/146] Fetching data from 20180101 to 20240527 for stock 000676...
Data saved to kline_data/000676_智度股份_daily_k_data.xlsx successfully.

[4/146] Fetching data from 20180101 to 20240527 for stock 000938...
Data saved to kline_data/000938_紫光股份_daily_k_data.xlsx successfully.

[5/146] Fetching data from 20180101 to 20240527 for stock 000997...
Data saved to kline_data/000997_新 大 陆_daily_k_data.xlsx successfully.

[6/146] Fetching data from 20180101 to 20240527 for stock 002095...
Data saved to kline_data/002095_生 意 宝_daily_k_data.xlsx successfully.

[7/146] Fetching data from 20180101 to 20240527 for stock 002115...
Data saved to kline_data/002115_三维通信_daily_k_data.xlsx successfully.

[8/146] Fetching data from 20180

['000409',
 '000555',
 '000676',
 '000938',
 '000997',
 '002095',
 '002115',
 '002131',
 '002195',
 '002232',
 '002264',
 '002291',
 '002315',
 '002316',
 '002331',
 '002354',
 '002368',
 '002373',
 '002380',
 '002401',
 '002474',
 '002530',
 '002609',
 '002642',
 '002649',
 '002657',
 '002766',
 '002771',
 '002777',
 '002912',
 '002990',
 '003005',
 '003010',
 '300017',
 '300020',
 '300044',
 '300059',
 '300078',
 '300079',
 '300096',
 '300150',
 '300166',
 '300167',
 '300168',
 '300170',
 '300212',
 '300226',
 '300231',
 '300245',
 '300248',
 '300250',
 '300264',
 '300269',
 '300271',
 '300277',
 '300287',
 '300288',
 '300290',
 '300295',
 '300300',
 '300324',
 '300383',
 '300399',
 '300418',
 '300419',
 '300440',
 '300442',
 '300448',
 '300496',
 '300508',
 '300518',
 '300523',
 '300532',
 '300541',
 '300592',
 '300609',
 '300634',
 '300645',
 '300674',
 '300678',
 '300682',
 '300687',
 '300738',
 '300766',
 '300785',
 '300792',
 '300846',
 '300872',
 '300895',
 '300941',
 '301001',