## 数据处理模块
### 思路：
1. 分层抽样+多个小模型vote投票，是一种模型集成技术，可以提升指标的准确率，一般用于一些刷榜单时候，缺点是耗费更大的算力
2. 分层抽样，目前用的5折，也就是其中4份用于train，另外1份用于test，训练5个模型（也可以使用不同的模型，但是分层抽样更有用）
3. 每个模型只使用了抽样的部分数据，因此多个模型的结果应该不一致，这些样本人工再过一遍，然后用校对后的样本进行训练
4. 对于每日的批数据，可以用kfold模型进行达标，对不一致的标签进行人工核对

In [None]:
import pandas as pd
from torch4keras.snippets import YamlConfig
import os

config = YamlConfig('./config.yaml')
data_dir = config['data_dir']
data_path = os.path.join(data_dir, 'cls.xlsx')
map_path = os.path.join(data_dir, 'category_map.xlsx')
data = pd.read_excel(data_path)
map_data =  pd.read_excel(map_path)

if 'class_id' in data.columns:
    data.drop('class_id', axis=1, inplace=True)

data = pd.merge(data, map_data[['class_id', 'class_name']], on='class_name')
print('数据量:', len(data))

data.head()

In [None]:
# 简单数据统计
data['class_name'].value_counts()

In [None]:
# 分层抽样并保存
from sklearn.model_selection import StratifiedKFold
import json

X, y = data[['content', 'class_id']], data['class_name']
kf = StratifiedKFold(n_splits=5, random_state=42, shuffle=True)
for i, (train_index, test_index) in enumerate(kf.split(X, y)):
    train_data = []
    content, class_id = X.iloc[train_index]['content'], X.iloc[train_index]['class_id']
    for c, id, l in zip(content, class_id, y[train_index]):
        train_data.append(json.dumps({'content': c, 'class_id': id, 'class_name': l}, ensure_ascii=False) + '\n')

    test_data = []
    content, class_id = X.iloc[test_index]['content'], X.iloc[test_index]['class_id']
    for c, id, l in zip(content, class_id, y[train_index]):
        test_data.append(json.dumps({'content': c, 'class_id': id, 'class_name': l}, ensure_ascii=False) + '\n')

    with open(f'./data/fold_{i}_train.jsonl', 'w', encoding='utf-8') as f:
        f.writelines(train_data)
    
    with open(f'./data/fold_{i}_test.jsonl', 'w', encoding='utf-8') as f:
        f.writelines(test_data)
    print('标签数: ', len(set(class_id)))