# 项目：沃尔玛销售数据_模型建立与评估

## 1. 简介

### 分析目标

本实战项目的目的是，练习评估数据的结构性问题和内容性问题，并且基于评估结果对沃尔玛的销售数据进行清洗、整理，从而得到干净、整洁的数据，供下一步用于探索不同地区沃尔玛零售店销售额的影响因素，通过进一步创建预测模型，可以预测未来X个月/年的销售情况，并针对零售门店库存管理方面的问题提出建议，以使得供应与需求更加匹配。

#### 数据每列的含义如下：


- `Store`：店铺编号
- `Date`：销售周
- `Weekly_Sales`：店铺在该周的销售额
- `Holiday_Flag`：是否为假日周
- `Temperature`：销售日的温度
- `Fuel_Price`：该地区的燃油成本
- `CPI（消费者物价指数）`：消费者物价指数
- `Unemployment`：失业率

## 2. 载入库和数据集

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image

%matplotlib inline

In [2]:
plt.rcParams['font.sans-serif'] = ['SimHei'] # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False #  用来正常显示负号
plt.rcParams['figure.figsize'] = (10,6) #  设置输出图片大小

In [3]:
# 读取有销售额分类变量的数据集
train = pd.read_csv('cleaned_data.csv', index_col=['Unnamed: 0'])
train.shape

(6435, 14)

In [4]:
train.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Quarter,TemperatureBand,CPIBand,UnemploymentBand,Fuel_PriceBand,WeeklySalesBand
0,1,2010-02-05,1643690.9,0,42.31,2.572,211.096358,8.106,1,1,3,3,1,9
1,1,2010-02-12,1641957.44,1,38.51,2.548,211.24217,8.106,1,1,3,3,1,9
2,1,2010-02-19,1611968.17,0,39.93,2.514,211.289143,8.106,1,1,3,3,1,9
3,1,2010-02-26,1409727.59,0,46.63,2.561,211.319643,8.106,1,1,3,3,1,8
4,1,2010-03-05,1554806.68,0,46.5,2.625,211.350143,8.106,1,1,3,3,1,9


In [5]:
# 读取无销售额分类变量的数据集
data = pd.read_csv('clear_cleaned_data.csv', index_col=['Unnamed: 0'])
data.head()

Unnamed: 0,Store,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Quarter
0,1,0,42.31,2.572,211.096358,8.106,1
1,1,1,38.51,2.548,211.24217,8.106,1
2,1,0,39.93,2.514,211.289143,8.106,1
3,1,0,46.63,2.561,211.319643,8.106,1
4,1,0,46.5,2.625,211.350143,8.106,1


## 3. 模型搭建

- 在进行模型选择前，我们需要知道数据集最终是进行监督学习还是无监督学习  
- 模型的选择一方面是通过我们的任务来决定的  
- 除了根据我们的任务来选择模型外，还可以根据数据样本量以及特征的稀疏性来决定  
- 刚开始我们总是先尝试使用一个基本的模型来作为其baseline，进而再训练其他模型做对比，最终选择泛化能力或性能比较好的模型  
  
这里我们使用一个机器学习最常用的一个库(sklearn)来完成模型的搭建。

### 3.1 切割训练集和测试集

这里使用留出法划分数据集：  
  
- 将数据集分为自变量和因变量  
- 按比例切割训练集和测试集(一般测试集的比例有30%、25%、20%、15%和10%)  
- 使用分层抽样  
- 设置随机种子以便结果能复现

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
# 一般先取出x和y后再切割，有些情况会使用到未切割的，这时候x和y就可以用，x是清洗好的数据，y是我们要预测的销售额数据'WeeklySales'
X = data
y = train['WeeklySalesBand']

In [9]:
# 对数据集进行切割
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

In [10]:
# 查看数据形状
X_train.shape, X_test.shape

((4826, 7), (1609, 7))

### 3.2 模型创建

- 创建基于线性模型的分类模型(逻辑回归)
- 创建基于树的分类模型(决策树、随机森林)
- 分别使用这些模型进行训练，分别得到训练集和测试集的得分
- 查看模型的参数，并更改参数值，观察模型变化

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [13]:
# 默认参数逻辑回归模型
lr = LogisticRegression()
lr.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [14]:
# 查看训练集和测试集score值
print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(lr.score(X_test, y_test)))

Training set score: 0.19
Testing set score: 0.16


In [33]:
# 调整参数后的逻辑回归模型
lr2 = LogisticRegression(C=100)
lr2.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [34]:
print("Training set score: {:.2f}".format(lr2.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(lr2.score(X_test, y_test)))

Training set score: 0.19
Testing set score: 0.16


In [17]:
# 默认参数的随机森林分类模型
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

In [18]:
print("Training set score: {:.2f}".format(rfc.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(rfc.score(X_test, y_test)))

Training set score: 1.00
Testing set score: 0.71


In [36]:
# 调整后的随机森林分类模型
rfc2 = RandomForestClassifier(n_estimators=100, max_depth=5)
rfc2.fit(X_train, y_train)

In [37]:
print("Training set score: {:.2f}".format(rfc2.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(rfc2.score(X_test, y_test)))

Training set score: 0.55
Testing set score: 0.53


In [40]:
# 调整后的随机森林分类模型
rfc3 = RandomForestClassifier(n_estimators=100, max_depth=20)
rfc3.fit(X_train, y_train)

In [41]:
print("Training set score: {:.2f}".format(rfc3.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(rfc3.score(X_test, y_test)))

Training set score: 1.00
Testing set score: 0.72


### 3.3 输出模型预测结果

- 输出模型预测分类标签
- 输出不同分类标签的预测概率

In [42]:
# 预测标签
pred = rfc3.predict(X_train)

In [44]:
# 此时我们可以看到0到10的数组，即预测它对应的是哪个级别/分类水平的销售额
pred[:20]

array([ 9,  8,  7,  1,  5,  8,  3,  4,  9,  1,  4,  3,  3,  3,  1,  1,  6,
        9, 10,  8], dtype=int64)

In [47]:
# 预测标签概率
pred_proba = rfc3.predict_proba(X_train)
pred_proba[:3]

array([[0.00579365, 0.01333333, 0.        , 0.        , 0.00111111,
        0.        , 0.01850649, 0.05936508, 0.90189033, 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.0465    , 0.85433333, 0.09916667, 0.        ],
       [0.03      , 0.        , 0.        , 0.        , 0.05      ,
        0.205     , 0.675     , 0.04      , 0.        , 0.        ]])

## 4. 模型评估

- 模型评估是为了知道模型的泛化能力
- 交叉验证是一种评估泛化性能的统计学方法，它比单次划分训练集和测试集的方法更加稳定和全面
- 再交叉验证中，数据被多次划分，并且需要训练多个模型
- 最常用的交叉验证法是k-fold cross-validation，其中k是用户指定的数字，通常取5或10
- 准确率(precision)度量的是被预测为正例的样本中有多少是真正的正例
- 召回率(recall)度量的是正类样本中有多少被预测为正类
- f-分数是准确率与召回率的调和平均

### 4.1 交叉验证

### 4.2 混淆矩阵

### 4.3 ROC曲线