# Week5-机器学习-分类算法
- sklearn 中文社区： https://scikit-learn.org.cn/
- sklearn.linear_model.LinearRegression：https://scikit-learn.org.cn/view/394.html
- 英文官网：https://scikit-learn.org/stable/

## 实训目标：
### 1. 掌握逻辑回归算法
### 2.  掌握分类任务的评估指标

## 案例一：逻辑回归——智能家居
- 案例：小明想使用传感器监测房间信息，目的根据室内的环境数据，判断是否有人无人。
- 数据集包含以下数据：温度、湿度、光线、二氧化碳含量等信息，且包含了房间占用信息（Bool，0 代表没有人，1 代表有人）

- 逻辑回归最终的分类是通过属于某个类别的概率值(sigmoid的输出）来判断是否属于某个类别，**如果≥0.5，则判为1；＜0.5，判为0**

### 实训步骤
1. 数据加载
1. 数据基本处理（本案例已处理好，略）
1. 数据集拆分
1. 数据标准化
1. 构建模型
1. 模型预测及评估

### 1. 数据加载

In [2]:
import pandas as pd # 导入pandas
from sklearn.model_selection import train_test_split # 训练集 测试集拆分
from sklearn.linear_model import LogisticRegression# 逻辑回归

In [3]:
# 加载数据：'./data/datatest.txt'
data = pd.read_csv('./data/datatest.txt')
data.head()

Unnamed: 0,date,Temperature,Humidity,Light,CO2,HumidityRatio,Occupancy
140,2015-02-02 14:19:00,23.7,26.272,585.2,749.2,0.004764,1
141,2015-02-02 14:19:59,23.718,26.29,578.4,760.4,0.004773,1
142,2015-02-02 14:21:00,23.73,26.23,572.666667,769.666667,0.004765,1
143,2015-02-02 14:22:00,23.7225,26.125,493.75,774.75,0.004744,1
144,2015-02-02 14:23:00,23.754,26.2,488.6,779.0,0.004767,1


In [4]:
# 数据基本情况
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2665 entries, 140 to 2804
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           2665 non-null   object 
 1   Temperature    2665 non-null   float64
 2   Humidity       2665 non-null   float64
 3   Light          2665 non-null   float64
 4   CO2            2665 non-null   float64
 5   HumidityRatio  2665 non-null   float64
 6   Occupancy      2665 non-null   int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 166.6+ KB


####  2. 数据基本处理（本案例已处理好，略）
### 3. 数据集拆分

In [5]:
# 拆分特征X和目标y
# X = data.iloc[:,1:6] 
X = data.drop(['date', 'Occupancy'], axis=1)
X.head()

Unnamed: 0,Temperature,Humidity,Light,CO2,HumidityRatio
140,23.7,26.272,585.2,749.2,0.004764
141,23.718,26.29,578.4,760.4,0.004773
142,23.73,26.23,572.666667,769.666667,0.004765
143,23.7225,26.125,493.75,774.75,0.004744
144,23.754,26.2,488.6,779.0,0.004767


In [6]:
# 查看y的分布
data['Occupancy'].value_counts()

0    1693
1     972
Name: Occupancy, dtype: int64

In [7]:
# y：最后一列Occupancy
y = data['Occupancy']
y.head()

140    1
141    1
142    1
143    1
144    1
Name: Occupancy, dtype: int64

In [8]:
# 拆分训练集和测试集，0.75，0.25
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=666)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1998, 5), (667, 5), (1998,), (667,))

### 4. 数据标准化

In [9]:
# 特征数据标准化 用z-score标准化
from sklearn.preprocessing import StandardScaler

sc = StandardScaler() # 初始化
# sc.fit(X_train) # 训练 求解X_train的均值和标准差
X_train_std = sc.fit_transform(X_train) # transform 数据转换
X_test_std = sc.transform(X_test)

In [10]:
# 看看转化后的数据
X_train_std[3]

array([-0.17919888,  1.00638653, -0.77409467,  0.19776804,  0.51200572])

In [11]:
# # 最后，处理完数据，再确认一遍数据的维度
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1998, 5), (667, 5), (1998,), (667,))

### 5. 构建模型
- 逻辑回归：https://scikit-learn.org.cn/view/4.html#1.1.11%20Logistic%E5%9B%9E%E5%BD%92
- 英文官网：https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [12]:
# 逻辑回归

# 构建一个分类器，使用默认参数
lgr = LogisticRegression()

# 模型训练
lgr.fit(X_train_std, y_train)

# 模型参数
print('逻辑回归的参数w和b分别是：', lgr.coef_, lgr.intercept_)

逻辑回归的参数w和b分别是： [[-1.07021251  1.23027738  4.68031157 -0.28156325  0.92523067]] [-2.26199545]


### 模型：y = sigmoid(𝑤1𝑥1+𝑤2𝑥2+…+𝑏)

### 6. 模型预测与评估

![](img/2.png)

#### 模型评价用到的sklearn的包：
- 混淆矩阵confusion_matrix：https://scikit-learn.org.cn/view/485.html
- 分类报告classification_report:https://scikit-learn.org.cn/view/482.html

In [13]:
# 导入包  
from sklearn.metrics import confusion_matrix # 混淆矩阵
from sklearn.metrics import classification_report # 分类报告：包含了精度，召回率和f1值

# 在训练集上的效果
y_train_pred = lgr.predict(X_train_std)

print('训练集上的混淆矩阵：\n', confusion_matrix(y_train, y_train_pred)) # 参数: y真实值, y预测值
print('\n训练集上的分类报告：\n', classification_report(y_train, y_train_pred))

训练集上的混淆矩阵：
 [[1225   40]
 [   2  731]]

训练集上的分类报告：
               precision    recall  f1-score   support

           0       1.00      0.97      0.98      1265
           1       0.95      1.00      0.97       733

    accuracy                           0.98      1998
   macro avg       0.97      0.98      0.98      1998
weighted avg       0.98      0.98      0.98      1998



**上述分类报告的说明：**
|  混淆矩阵   | 预测0 | 预测1  |
|  ----  | ----  | ----  |
| 真实0  | --  |--  |
| 真实1  | --  |--  |
- 精度：0.95 = 731/(731+40)
- 召回：1.00 = 731/(731+2)
- support: 实际的0和1的数量

In [14]:
731/(731+40), 731/(731+2)

(0.9481193255512321, 0.9972714870395635)

In [15]:
# 在测试集上的效果
y_test_pred = lgr.predict(X_test_std)
print('训练集上的混淆矩阵：\n', confusion_matrix(y_test, y_test_pred))
print('\n训练集上的分类报告：\n', classification_report(y_test, y_test_pred))

训练集上的混淆矩阵：
 [[414  14]
 [  1 238]]

训练集上的分类报告：
               precision    recall  f1-score   support

           0       1.00      0.97      0.98       428
           1       0.94      1.00      0.97       239

    accuracy                           0.98       667
   macro avg       0.97      0.98      0.98       667
weighted avg       0.98      0.98      0.98       667



---