# 项目：沃尔玛销售数据_逻辑模型建立与评估_机器学习

## 1. 简介

### 分析目标

本实战项目的目的是，练习评估数据的结构性问题和内容性问题，并且基于评估结果对沃尔玛的销售数据进行清洗、整理，从而得到干净、整洁的数据，供下一步用于探索不同地区沃尔玛零售店销售额的影响因素，通过进一步创建预测模型，可以预测未来X个月/年的销售情况，并针对零售门店库存管理方面的问题提出建议，以使得供应与需求更加匹配。

#### 数据每列的含义如下：


- `Store`：店铺编号
- `Date`：销售周
- `Weekly_Sales`：店铺在该周的销售额
- `Holiday_Flag`：是否为假日周
- `Temperature`：销售日的温度
- `Fuel_Price`：该地区的燃油成本
- `CPI（消费者物价指数）`：消费者物价指数
- `Unemployment`：失业率

## 2. 载入库和数据集

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image

%matplotlib inline

In [2]:
plt.rcParams['font.sans-serif'] = ['SimHei'] # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False #  用来正常显示负号
plt.rcParams['figure.figsize'] = (10,6) #  设置输出图片大小

In [3]:
# 读取有销售额分类变量的数据集
train = pd.read_csv('cleaned_data.csv', index_col=['Unnamed: 0'])
train.shape

(6435, 14)

In [4]:
train.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Quarter,TemperatureBand,CPIBand,UnemploymentBand,Fuel_PriceBand,WeeklySalesBand
0,1,2010-02-05,1643690.9,0,42.31,2.572,211.096358,8.106,1,1,3,3,1,9
1,1,2010-02-12,1641957.44,1,38.51,2.548,211.24217,8.106,1,1,3,3,1,9
2,1,2010-02-19,1611968.17,0,39.93,2.514,211.289143,8.106,1,1,3,3,1,9
3,1,2010-02-26,1409727.59,0,46.63,2.561,211.319643,8.106,1,1,3,3,1,8
4,1,2010-03-05,1554806.68,0,46.5,2.625,211.350143,8.106,1,1,3,3,1,9


In [5]:
# 读取无销售额分类变量的数据集
data = pd.read_csv('LogisticR_data.csv', index_col=['Unnamed: 0'])
data.head()

Unnamed: 0,Store,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Quarter_1,Quarter_2,Quarter_3,Quarter_4
0,1,0,42.31,2.572,211.096358,8.106,True,False,False,False
1,1,1,38.51,2.548,211.24217,8.106,True,False,False,False
2,1,0,39.93,2.514,211.289143,8.106,True,False,False,False
3,1,0,46.63,2.561,211.319643,8.106,True,False,False,False
4,1,0,46.5,2.625,211.350143,8.106,True,False,False,False


## 3. 模型搭建

- 在进行模型选择前，我们需要知道数据集最终是进行监督学习还是无监督学习  
- 模型的选择一方面是通过我们的任务来决定的  
- 除了根据我们的任务来选择模型外，还可以根据数据样本量以及特征的稀疏性来决定  
- 刚开始我们总是先尝试使用一个基本的模型来作为其baseline，进而再训练其他模型做对比，最终选择泛化能力或性能比较好的模型  
  
这里我们使用一个机器学习最常用的一个库(sklearn)来完成模型的搭建。

### 3.1 切割训练集和测试集

这里使用留出法划分数据集：  
  
- 将数据集分为自变量和因变量  
- 按比例切割训练集和测试集(一般测试集的比例有30%、25%、20%、15%和10%)  
- 使用分层抽样  
- 设置随机种子以便结果能复现

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
# 一般先取出x和y后再切割，有些情况会使用到未切割的，这时候x和y就可以用，x是清洗好的数据，y是我们要预测的销售额数据'WeeklySales'
X = data
y = train['WeeklySalesBand']

In [8]:
# 对数据集进行切割
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

In [9]:
# 查看数据形状
X_train.shape, X_test.shape

((4826, 10), (1609, 10))

### 3.2 模型创建

- 创建基于线性模型的分类模型(逻辑回归)
- 创建基于树的分类模型(决策树、随机森林)
- 分别使用这些模型进行训练，分别得到训练集和测试集的得分
- 查看模型的参数，并更改参数值，观察模型变化

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [11]:
# 默认参数逻辑回归模型
lr = LogisticRegression()
lr.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [12]:
# 查看训练集和测试集score值
print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(lr.score(X_test, y_test)))

Training set score: 0.20
Testing set score: 0.18


In [13]:
# 调整参数后的逻辑回归模型
lr2 = LogisticRegression(C=100)
lr2.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [14]:
print("Training set score: {:.2f}".format(lr2.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(lr2.score(X_test, y_test)))

Training set score: 0.20
Testing set score: 0.19


In [15]:
# 默认参数的随机森林分类模型
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

In [16]:
print("Training set score: {:.2f}".format(rfc.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(rfc.score(X_test, y_test)))

Training set score: 1.00
Testing set score: 0.71


In [33]:
# 调整后的随机森林分类模型
rfc2 = RandomForestClassifier(n_estimators=100, max_depth=5)
rfc2.fit(X_train, y_train)

In [34]:
print("Training set score: {:.2f}".format(rfc2.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(rfc2.score(X_test, y_test)))

Training set score: 0.57
Testing set score: 0.55


In [47]:
# 调整后的随机森林分类模型
rfc3 = RandomForestClassifier(n_estimators=100, max_depth=18)
rfc3.fit(X_train, y_train)

In [48]:
print("Training set score: {:.2f}".format(rfc3.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(rfc3.score(X_test, y_test)))

Training set score: 1.00
Testing set score: 0.72


### 3.3 输出模型预测结果

- 输出模型预测分类标签
- 输出不同分类标签的预测概率

In [49]:
# 预测标签
pred = rfc3.predict(X_train)

In [50]:
# 此时我们可以看到0到10的数组，即预测它对应的是哪个级别/分类水平的销售额
pred[:20]

array([ 9,  8,  7,  1,  5,  8,  3,  4,  9,  1,  4,  3,  3,  3,  1,  1,  6,
        9, 10,  8], dtype=int64)

In [51]:
# 预测标签概率
pred_proba = rfc3.predict_proba(X_train)
pred_proba[:3]

array([[0.00333333, 0.0025    , 0.01      , 0.00173913, 0.        ,
        0.        , 0.01395257, 0.06885705, 0.89961792, 0.        ],
       [0.        , 0.        , 0.        , 0.02      , 0.0011885 ,
        0.02945082, 0.08592908, 0.8007207 , 0.06271089, 0.        ],
       [0.01833333, 0.        , 0.        , 0.        , 0.14      ,
        0.17      , 0.66166667, 0.01      , 0.        , 0.        ]])

## 4. 模型评估

- 模型评估是为了知道模型的泛化能力
- 交叉验证是一种评估泛化性能的统计学方法，它比单次划分训练集和测试集的方法更加稳定和全面
- 再交叉验证中，数据被多次划分，并且需要训练多个模型
- 最常用的交叉验证法是k-fold cross-validation，其中k是用户指定的数字，通常取5或10
- 准确率(precision)度量的是被预测为正例的样本中有多少是真正的正例
- 召回率(recall)度量的是正类样本中有多少被预测为正类
- f-分数是准确率与召回率的调和平均

### 4.1 交叉验证

- 用10折交叉验证来评估之前的逻辑回归模型
- 计算交叉验证精度的平均值

In [52]:
from sklearn.model_selection import cross_val_score

In [53]:
# 逻辑回归模型的交叉验证
lr = LogisticRegression(C=100)
scores = cross_val_score(lr, X_train, y_train, cv=10)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [54]:
# k折交叉验证分数
scores

array([0.19254658, 0.18012422, 0.17391304, 0.1863354 , 0.20289855,
       0.2173913 , 0.19502075, 0.20539419, 0.20539419, 0.18879668])

In [55]:
# 平均交叉验证分数
print('Average Cross-Validation score: {:.2f}'.format(scores.mean()))

Average Cross-Validation score: 0.19


In [58]:
# 随机森林模型的交叉验证分数
rfc = RandomForestClassifier(max_depth=18)
scores_rfc = cross_val_score(rfc, X_train, y_train, cv=10)
scores_rfc

array([0.68322981, 0.68322981, 0.7184265 , 0.70393375, 0.68530021,
       0.68322981, 0.69294606, 0.70539419, 0.71576763, 0.68879668])

In [59]:
# 平均交叉验证分数
print('Average Cross-Validation score: {:.2f}'.format(scores_rfc.mean()))

Average Cross-Validation score: 0.70


In [63]:
# 减少k折数
rfc = RandomForestClassifier(max_depth=18)
scores_rfc1 = cross_val_score(rfc, X_train, y_train, cv=3)
scores_rfc1

array([0.68551896, 0.68241144, 0.68967662])

In [64]:
# 平均交叉验证分数
print('Average Cross-Validation score: {:.2f}'.format(scores_rfc1.mean()))

Average Cross-Validation score: 0.69


k折数越多，预测结果越可靠，准确率越高。

### 4.2 混淆矩阵

- 计算二分类问题的混淆矩阵
- 计算精确率、召回率以及f分数

In [65]:
from sklearn.metrics import confusion_matrix 

In [71]:
# 训练模型
rfc = RandomForestClassifier(max_depth=15)
rfc.fit(X_train, y_train)

In [72]:
# 模型预测结果
pred = rfc.predict(X_train)

In [73]:
# 混淆矩阵
confusion_matrix(y_train, pred)

array([[483,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  1, 477,   5,   0,   0,   0,   0,   0,   0,   0],
       [  0,   3, 477,   2,   1,   0,   0,   0,   0,   0],
       [  0,   0,   1, 476,   5,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0, 475,   8,   0,   0,   0,   0],
       [  0,   0,   0,   0,   7, 472,   3,   0,   0,   0],
       [  0,   0,   0,   1,   0,   5, 467,   9,   0,   0],
       [  0,   0,   0,   0,   0,   0,   7, 475,   1,   0],
       [  0,   0,   0,   0,   0,   0,   0,  12, 466,   4],
       [  0,   0,   0,   0,   0,   0,   0,   1,   2, 480]], dtype=int64)

In [74]:
from sklearn.metrics import classification_report

In [75]:
# 精确率、召回率以及f1-score
print(classification_report(y_train, pred))

              precision    recall  f1-score   support

           1       1.00      1.00      1.00       483
           2       0.99      0.99      0.99       483
           3       0.99      0.99      0.99       483
           4       0.99      0.99      0.99       482
           5       0.97      0.98      0.98       483
           6       0.97      0.98      0.98       482
           7       0.98      0.97      0.97       482
           8       0.96      0.98      0.97       483
           9       0.99      0.97      0.98       482
          10       0.99      0.99      0.99       483

    accuracy                           0.98      4826
   macro avg       0.98      0.98      0.98      4826
weighted avg       0.98      0.98      0.98      4826



### 4.3 ROC曲线

- ROC曲线下面包围的面积越大越好

#### 4.3.1 one-hot分类数据

In [80]:
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_curve, auc
y_test = label_binarize(y_test, classes=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
n_classes = y_test.shape[1] # 几分类（我这里是5分类）

#### 4.3.2 构建模型

In [82]:
# 实例化决策树，基尼指数，训练集训练
from sklearn.multiclass import OneVsRestClassifier

rfc = OneVsRestClassifier(RandomForestClassifier(max_depth=15))
# clf为拟合好的模型
clf = rfc.fit(X=X_train, y=y_train)

# 概率分数y_score ，是一个shape为(测试集条数, 分类种数)的矩阵。
# 比如你测试集有200条数据，模型是5分类，那矩阵就是(200,5)。
# 矩阵的第(i,j)元素代表第i条数据是第j类的概率。
y_score = clf.predict_proba(X_test) 

未完待续。。。