# 机器学习拟合并测试数据

1. 读取数据目录：

训练集：`../data/train.csv`

测试集：`../data/test.csv`

2. 数据格式（列名）

```
index,q,powerloss,f,vout,iout,pa,prx,bty_temp,chnl,ce_pkg,rpp_pkg,ss_pkg,location
```

其中：`index`为数据id，唯一；`location`为数据标签；其余部分为特征值字段。

3. 测试模型

- [X] 线性回归（linear regression）
- [X] 逻辑回归（logistic regression）

In [1]:
import sklearn
import pandas as pd
import numpy as np
import time
import random
import pickle

# 0. 读取数据

读取数据并转换为机器学习可用的矩阵（numpy格式）

In [5]:
# 数据范围
feature_range_dict = {
    'q':(0,2),
    'powerloss':(0,10_000),
    'f':(130,146),
    'vout':(15,19),
    'iout':(0,2500),
    'pa':(0,65),
    'prx':(0,65),
    'bty_temp':(0,50),
    'chnl':(0,8),
    'ce_pkg':(-127,128),
    'rpp_pkg':(0,65),
    'ss_pkg':(0,255)
}

In [4]:
# 特征字段名 顺序任意
feature_key_lst = ['q',
                   'powerloss',
                   'f',
                   'vout',
                   'iout',
                   'pa',
                   'prx',
                   'bty_temp',
                   'chnl',
                   'ce_pkg',
                   'rpp_pkg',
                   'ss_pkg']
# 标签字段名
label_key = 'location'

In [2]:
# 训练集数据路径
train_csv_path = '../data/train.csv'
# 测试集数据路径
test_csv_path = '../data/test.csv'

In [3]:
train_df = pd.read_csv(train_csv_path)
test_df = pd.read_csv(test_csv_path)

# 1. 缺失数据补全（todo）

如果存在部分列数据缺失，在这一步骤进行数据的补全以便于减少这部分缺失数据的影响。

# 2. 特征处理

这一步骤针对每列特征进行处理，处理手段有很多种，例如数据分桶，数据扩展，数据统计等等，此处只对数据进行归一化放缩，使得所有特征值范围均在`0,1`之间，以便于机器学习模型学习的稳定性。

后续根据问题的深入程度不同，可以将更多的特征处理手段应用在这一步骤，来增强整体机器学习系统的表现。

## 2.1 数据归一化

根据特征值范围，将现有特征放缩到`(0,1)`之间

In [6]:
for k in feature_key_lst:
    # 数据范围上界
    ub = feature_range_dict[k][0]
    # 数据范围下界
    lb = feature_range_dict[k][1]
    
    train_df[k] = train_df[k].apply(lambda x:(x-lb)/(ub-lb))
    test_df[k] = test_df[k].apply(lambda x:(x-lb)/(ub-lb))

In [7]:
train_df

Unnamed: 0,index,q,powerloss,f,vout,iout,pa,prx,bty_temp,chnl,ce_pkg,rpp_pkg,ss_pkg,location
0,0,0.655,0.230471,0.815000,0.4375,0.428064,0.325077,0.555846,0.5664,0.33625,0.488196,0.771692,0.800235,0
1,2,0.475,0.449208,0.294375,0.5050,0.502956,0.089385,0.341538,0.7860,0.49875,0.610157,0.677385,0.031922,0
2,3,0.205,0.952435,0.506875,0.6550,0.578700,0.830308,0.555846,0.3628,0.85125,0.752588,0.545385,0.610078,1
3,4,0.560,0.065574,0.243125,0.7025,0.652048,0.751385,0.404769,0.0940,0.50250,0.879333,0.341538,0.299333,1
4,5,0.620,0.127662,0.681250,0.6800,0.987036,0.118769,0.949231,0.5948,0.20125,0.724549,0.620462,0.146510,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1595,1994,0.405,0.978006,0.770000,0.0100,0.244508,0.703846,0.040154,0.2202,0.60875,0.016353,0.303538,0.228196,0
1596,1995,0.970,0.011367,0.008750,0.3075,0.013540,0.962615,0.341692,0.6446,0.91750,0.754941,0.846923,0.749961,1
1597,1996,0.650,0.811528,0.893750,0.0725,0.160432,0.630615,0.493231,0.1374,0.26875,0.330196,0.103692,0.776667,1
1598,1998,0.030,0.780665,0.053750,0.1925,0.432216,0.292769,0.190462,0.0680,0.05375,0.527647,0.141077,0.065098,1


## 2.2 输出特征矩阵

输出训练特征矩阵分别为：`train_X`和`test_X`，测试标签为`train_y`和`test_y`。

In [8]:
train_X = train_df[feature_key_lst].to_numpy()

In [10]:
train_X.shape

(1600, 12)

In [11]:
test_X = test_df[feature_key_lst].to_numpy()

In [12]:
test_X.shape

(400, 12)

In [13]:
train_y = train_df[label_key].to_numpy()
test_y = test_df[label_key].to_numpy()

# 3. 模型学习

使用`sklearn`工具库自带的机器学习模型对训练数据进行拟合，并在测试数据上测试，评价指标为`precision`,`recall`,`f1_score`,`accuracy`。

不同模型的调参方法不同，可以根据需要进行参数调整。

## 3.0 引入评价指标

In [59]:
from sklearn.metrics import precision_score,recall_score,f1_score,accuracy_score

## 3.1 线性回归模型

In [46]:
from sklearn.linear_model import LinearRegression

In [47]:
# 设置分类阈值 超过该值的分类为1
threshold = 0.5

In [48]:
model = LinearRegression()

In [49]:
model.fit(train_X,train_y)

In [50]:
pred_y = model.predict(test_X)

In [51]:
pred_test_y = np.array([1 if i>threshold else 0 for i in pred_y])

In [60]:
p_score = precision_score(test_y, pred_test_y)
r_score = recall_score(test_y, pred_test_y)
f_score = f1_score(test_y, pred_test_y)
acc_score = accuracy_score(test_y, pred_test_y)

### 测试集结果

In [70]:
print(f'Model:{model.__str__()}, Recall:{r_score:.4}, Precision:{p_score:.4} F1:{f_score:.4} Accuracy:{acc_score:.4}')

Model:LogisticRegression(), Recall:0.6133, Precision:0.4302 F1:0.5057 Accuracy:0.4575


## 3.1 逻辑回归模型

In [62]:
from sklearn.linear_model import LogisticRegression

In [63]:
# 设置分类阈值 超过该值的分类为1
threshold = 0.5

In [64]:
model = LogisticRegression()

In [65]:
model.fit(train_X,train_y)

In [66]:
pred_y = model.predict(test_X)

In [67]:
pred_test_y = np.array([1 if i>threshold else 0 for i in pred_y])

In [68]:
p_score = precision_score(test_y, pred_test_y)
r_score = recall_score(test_y, pred_test_y)
f_score = f1_score(test_y, pred_test_y)
acc_score = accuracy_score(test_y, pred_test_y)

### 测试集结果

In [71]:
print(f'Model:{model.__str__()}, Recall:{r_score:.4}, Precision:{p_score:.4} F1:{f_score:.4} Accuracy:{acc_score:.4}')

Model:LogisticRegression(), Recall:0.6133, Precision:0.4302 F1:0.5057 Accuracy:0.4575
