# 240403更新

将notebook配置为读取配置文件之后决定特征列表名称以及标签名称，以便于响应变化快速的数据情况

# 机器学习拟合并测试数据

1. 读取数据目录：

训练集：`../data/train.csv`

测试集：`../data/test.csv`

2. 数据格式（列名）

```
index,q,powerloss,f,vout,iout,pa,prx,bty_temp,chnl,ce_pkg,rpp_pkg,ss_pkg,location
```

其中：`index`为数据id，唯一；`location`为数据标签；其余部分为特征值字段。

3. 测试模型

- [X] 线性回归（linear regression）
- [X] 逻辑回归（logistic regression）

In [2]:
import sklearn
import pandas as pd
import numpy as np
import time
import random
import pickle

In [1]:
import json

# 0. 读取数据

读取数据并转换为机器学习可用的矩阵（numpy格式）

**240403更新，现在机器学习部分会读取`data/settings`文件夹中的`feature.json`和`label.json`文件作为特征以及标签的字段名**

二者与读取进来的文件中相关字段应该是匹配的。

In [42]:
feature_settings_fpath = '../data/settings/feature.json'
label_settings_fpath = '../data/settings/label.json'

In [43]:
with open(feature_settings_fpath, 'r',encoding='utf-8') as file:
    feature_dict = json.load(file)

In [44]:
with open(label_settings_fpath, 'r',encoding='utf-8') as file:
    label_dict = json.load(file)

In [45]:
label_dict

{'index': [0, 1]}

In [46]:
feature_dict

{'q/qm': [0, 200],
 'ploss': [-10000, 10000],
 'fre': [130, 146],
 'vpa': [0, 400000],
 'papower': [0, 65000],
 'ch': [0, 8],
 'ce': [-127, 128],
 'rppower': [0, 65000],
 'ss': [0, 255]}

In [47]:
feature_columns = list(feature_dict.keys())

In [48]:
# 训练集数据路径
train_fpath = '../data/test_data/train_240403.xlsx'
# 测试集数据路径
test_fpath = '../data/test_data/test_240403.xlsx'

In [49]:
train_df = pd.read_excel(train_fpath)
test_df = pd.read_excel(test_fpath)

#### 将所有columns名转换为小写

In [50]:
train_df.columns = [i.lower() for i in train_df.columns]
test_df.columns = [i.lower() for i in test_df.columns]

#### 兼容之前的代码写法

In [51]:
feature_key_lst = feature_columns

In [52]:
feature_range_dict = feature_dict

In [53]:
label_key = list(label_dict.keys())[0]

# 1. 缺失数据补全（todo）

如果存在部分列数据缺失，在这一步骤进行数据的补全以便于减少这部分缺失数据的影响。

# 2. 特征处理

这一步骤针对每列特征进行处理，处理手段有很多种，例如数据分桶，数据扩展，数据统计等等，此处只对数据进行归一化放缩，使得所有特征值范围均在`0,1`之间，以便于机器学习模型学习的稳定性。

后续根据问题的深入程度不同，可以将更多的特征处理手段应用在这一步骤，来增强整体机器学习系统的表现。

## 2.1 数据归一化

根据特征值范围，将现有特征放缩到`(0,1)`之间

In [54]:
for k in feature_key_lst:
    # 数据范围上界
    ub = feature_range_dict[k][0]
    # 数据范围下界
    lb = feature_range_dict[k][1]
    
    train_df[k] = train_df[k].apply(lambda x:(x-lb)/(ub-lb))
    test_df[k] = test_df[k].apply(lambda x:(x-lb)/(ub-lb))

In [55]:
train_df

Unnamed: 0,index,q/qm,ploss,fre,vpa,papower,ch,ce,rppower,ss
0,0,0.44,0.55855,-8928.375,0.970137,0.958108,-0.000,0.533333,0.961538,0.431373
1,0,0.44,0.56235,-9053.375,0.953228,0.888615,0.125,0.501961,0.902308,0.431373
2,0,0.44,0.55915,-9053.375,0.940720,0.550938,0.125,0.498039,0.604969,0.431373
3,0,0.44,0.58175,-9053.375,0.938763,0.449062,-0.000,0.501961,0.506231,0.431373
4,0,0.44,0.57265,-9053.375,0.936492,0.320385,-0.000,0.494118,0.379554,0.431373
...,...,...,...,...,...,...,...,...,...,...
247,1,0.48,0.52910,-7309.625,0.970137,0.950938,-0.000,0.494118,0.963354,0.517647
248,1,0.48,0.52110,-9053.375,0.944815,0.882508,0.625,0.364706,0.936677,0.517647
249,1,0.48,0.48840,-9053.375,0.929283,0.755862,0.125,0.501961,0.903031,0.517647
250,1,0.48,0.48395,-9053.375,0.928170,0.746723,0.375,0.435294,0.903831,0.517647


## 2.2 输出特征矩阵

输出训练特征矩阵分别为：`train_X`和`test_X`，测试标签为`train_y`和`test_y`。

In [56]:
train_X = train_df[feature_key_lst].to_numpy()

In [57]:
train_X.shape

(252, 9)

In [58]:
test_X = test_df[feature_key_lst].to_numpy()

In [59]:
test_X.shape

(8, 9)

In [60]:
train_y = train_df[label_key].to_numpy()
test_y = test_df[label_key].to_numpy()

# 3. 模型学习

使用`sklearn`工具库自带的机器学习模型对训练数据进行拟合，并在测试数据上测试，评价指标为`precision`,`recall`,`f1_score`,`accuracy`。

不同模型的调参方法不同，可以根据需要进行参数调整。

## 3.0 引入评价指标

In [61]:
from sklearn.metrics import precision_score,recall_score,f1_score,accuracy_score

## 3.1 线性回归模型

In [121]:
from sklearn.linear_model import LinearRegression

In [122]:
# 设置分类阈值 超过该值的分类为1
threshold = 0.5

In [123]:
model = LinearRegression()

In [124]:
model.fit(train_X,train_y)

LinearRegression()

In [125]:
pred_y = model.predict(test_X)

In [126]:
pred_test_y = np.array([1 if i>threshold else 0 for i in pred_y])

In [127]:
pred_train = model.predict(train_X)

In [128]:
pred_train_y = np.array([1 if i>threshold else 0 for i in pred_train])

### 训练集分数

In [129]:
p_score = precision_score(train_y, pred_train_y)
r_score = recall_score(train_y, pred_train_y)
f_score = f1_score(train_y, pred_train_y)
acc_score = accuracy_score(train_y, pred_train_y)

### 训练集结果

In [130]:
print(f'Model:{model.__str__()}, Recall:{r_score:.4}, Precision:{p_score:.4} F1:{f_score:.4} Accuracy:{acc_score:.4}')

Model:LinearRegression(), Recall:0.947, Precision:0.9398 F1:0.9434 Accuracy:0.9405


In [131]:
train_y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)

In [132]:
pred_train_y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 1])

### 测试集分数

In [133]:
p_score = precision_score(test_y, pred_test_y)
r_score = recall_score(test_y, pred_test_y)
f_score = f1_score(test_y, pred_test_y)
acc_score = accuracy_score(test_y, pred_test_y)

### 测试集结果

In [134]:
print(f'Model:{model.__str__()}, Recall:{r_score:.4}, Precision:{p_score:.4} F1:{f_score:.4} Accuracy:{acc_score:.4}')

Model:LinearRegression(), Recall:0.75, Precision:0.6 F1:0.6667 Accuracy:0.625


## 3.1 逻辑回归模型

In [135]:
from sklearn.linear_model import LogisticRegression

In [136]:
# 设置分类阈值 超过该值的分类为1
threshold = 0.5

In [137]:
model = LogisticRegression()

In [138]:
model.fit(train_X,train_y)

LogisticRegression()

In [139]:
pred_y = model.predict(test_X)

In [140]:
pred_test_y = np.array([1 if i>threshold else 0 for i in pred_y])

In [141]:
pred_train = model.predict(train_X)

In [142]:
pred_train_y = np.array([1 if i>threshold else 0 for i in pred_train])

### 训练集分数

In [143]:
p_score = precision_score(train_y, pred_train_y)
r_score = recall_score(train_y, pred_train_y)
f_score = f1_score(train_y, pred_train_y)
acc_score = accuracy_score(train_y, pred_train_y)

### 训练集结果

In [144]:
print(f'Model:{model.__str__()}, Recall:{r_score:.4}, Precision:{p_score:.4} F1:{f_score:.4} Accuracy:{acc_score:.4}')

Model:LogisticRegression(), Recall:1.0, Precision:0.5238 F1:0.6875 Accuracy:0.5238


In [145]:
pred_train_y

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

### 测试集分数

In [146]:
p_score = precision_score(test_y, pred_test_y)
r_score = recall_score(test_y, pred_test_y)
f_score = f1_score(test_y, pred_test_y)
acc_score = accuracy_score(test_y, pred_test_y)

### 测试集结果

In [147]:
print(f'Model:{model.__str__()}, Recall:{r_score:.4}, Precision:{p_score:.4} F1:{f_score:.4} Accuracy:{acc_score:.4}')

Model:LogisticRegression(), Recall:1.0, Precision:0.5 F1:0.6667 Accuracy:0.5
