# 沃尔玛销售数据

简介：数据集包含沃尔玛的销售数据。沃尔玛全国范围内有多家零售门店在库存管理方面的问题，那么如何将供应与需求相匹配呢？作为一名数据科学家，你可以利用数据，提供有用的见解，并创建预测模型，从而能预测未来X个月/年的销售情况。

变量含义：
- Store：店铺编号
- Date：销售周
- Weekly_Sales：店铺在该周的销售额
- Holiday_Flag：是否为假日周
- Temperature：销售日的温度
- Fuel_Price：该地区的燃油成本
- CPI（消费者物价指数）：消费者物价指数
- Unemployment：失业率

In [1]:
import pandas as pd

In [2]:
original_data = pd.read_csv(r"C:\Users\35731\沃尔玛销售数据清洗与分析\walmart_stores_data.csv")

In [3]:
original_data.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,1,05-02-2010,1643690.9,0,42.31,2.572,211.096358,8.106
1,1,12-02-2010,1641957.44,1,38.51,2.548,211.24217,8.106
2,1,19-02-2010,1611968.17,0,39.93,2.514,211.289143,8.106
3,1,26-02-2010,1409727.59,0,46.63,2.561,211.319643,8.106
4,1,05-03-2010,1554806.68,0,46.5,2.625,211.350143,8.106


## 评估数据整洁度

In [4]:
original_data.sample(10)

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
4432,31,26-10-2012,1340232.55,0,70.5,3.506,223.078337,6.17
6237,44,14-10-2011,293031.78,0,51.74,3.567,129.770645,6.078
5918,42,25-02-2011,526904.08,0,53.59,3.398,128.13,8.744
3248,23,20-01-2012,1146992.13,0,15.33,3.542,136.856419,4.261
1649,12,22-07-2011,922231.92,0,91.17,3.794,129.150774,13.503
3956,28,02-12-2011,1368130.35,0,52.5,3.701,129.845967,12.89
5646,40,03-06-2011,1075687.74,0,66.16,3.973,134.855161,4.781
1285,9,19-10-2012,542009.46,0,68.01,3.594,227.214288,4.954
1395,10,02-03-2012,1990371.02,0,57.62,3.882,130.645793,7.545
5975,42,30-03-2012,544408.14,0,67.92,4.294,130.967097,7.545


表格数据符合“每列一个变量，每行一个观察值，每个单元格一个值”，具体来看每行是每个销售门店在某周的具体情况，每列是关于门店的具体情况，因此不存在结构性问题

## 评估数据干净度

In [5]:
original_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6435 entries, 0 to 6434
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Store         6435 non-null   int64  
 1   Date          6435 non-null   object 
 2   Weekly_Sales  6435 non-null   float64
 3   Holiday_Flag  6435 non-null   int64  
 4   Temperature   6435 non-null   float64
 5   Fuel_Price    6435 non-null   float64
 6   CPI           6435 non-null   float64
 7   Unemployment  6435 non-null   float64
dtypes: float64(5), int64(2), object(1)
memory usage: 402.3+ KB


有6435行，没有数据缺失，Date，Holiday_Flag数据类型有误

In [6]:
original_data["Date"]

0       05-02-2010
1       12-02-2010
2       19-02-2010
3       26-02-2010
4       05-03-2010
           ...    
6430    28-09-2012
6431    05-10-2012
6432    12-10-2012
6433    19-10-2012
6434    26-10-2012
Name: Date, Length: 6435, dtype: object

In [7]:
original_data["Holiday_Flag"]

0       0
1       1
2       0
3       0
4       0
       ..
6430    0
6431    0
6432    0
6433    0
6434    0
Name: Holiday_Flag, Length: 6435, dtype: int64

数据都可以重复

数据是否一致

In [8]:
original_data["Holiday_Flag"].value_counts()

Holiday_Flag
0    5985
1     450
Name: count, dtype: int64

数据一致

数据是否合理

In [9]:
original_data.describe()

Unnamed: 0,Store,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
count,6435.0,6435.0,6435.0,6435.0,6435.0,6435.0,6435.0
mean,23.0,1046965.0,0.06993,60.663782,3.358607,171.578394,7.999151
std,12.988182,564366.6,0.255049,18.444933,0.45902,39.356712,1.875885
min,1.0,209986.2,0.0,-2.06,2.472,126.064,3.879
25%,12.0,553350.1,0.0,47.46,2.933,131.735,6.891
50%,23.0,960746.0,0.0,62.67,3.445,182.616521,7.874
75%,34.0,1420159.0,0.0,74.94,3.735,212.743293,8.622
max,45.0,3818686.0,1.0,100.14,4.468,227.232807,14.313


温度不合理

In [10]:
original_data[original_data["Temperature"]>75]

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
15,1,21-05-2010,1399662.07,0,76.44,2.826,210.617093,7.808
16,1,28-05-2010,1432069.95,0,80.44,2.759,210.896761,7.808
17,1,04-06-2010,1615524.71,0,80.69,2.705,211.176428,7.808
18,1,11-06-2010,1542561.09,0,80.43,2.668,211.456095,7.808
19,1,18-06-2010,1503284.06,0,84.11,2.637,211.453772,7.808
...,...,...,...,...,...,...,...,...
6422,45,03-08-2012,725729.51,0,76.58,3.654,191.164090,8.684
6423,45,10-08-2012,733037.32,0,78.65,3.722,191.162613,8.684
6424,45,17-08-2012,722496.93,0,75.71,3.807,191.228492,8.684
6426,45,31-08-2012,734297.87,0,75.09,3.867,191.461281,8.684


In [11]:
original_data[original_data["Temperature"]>75]["Date"].str[3:5].astype(int).describe()

count    1598.000000
mean        7.228411
std         1.315966
min         4.000000
25%         6.000000
50%         7.000000
75%         8.000000
max        10.000000
Name: Date, dtype: float64

该温度是华氏度，数据没问题

## 清理数据

只有Holiday_Flag，date数据类型有误

In [12]:
cleaned_data = original_data.copy()

In [13]:
cleaned_data.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,1,05-02-2010,1643690.9,0,42.31,2.572,211.096358,8.106
1,1,12-02-2010,1641957.44,1,38.51,2.548,211.24217,8.106
2,1,19-02-2010,1611968.17,0,39.93,2.514,211.289143,8.106
3,1,26-02-2010,1409727.59,0,46.63,2.561,211.319643,8.106
4,1,05-03-2010,1554806.68,0,46.5,2.625,211.350143,8.106


In [14]:
cleaned_data["Holiday_Flag"] = cleaned_data["Holiday_Flag"].astype(bool)

In [15]:
cleaned_data["Date"] = pd.to_datetime(cleaned_data["Date"], format="%d-%m-%Y")

In [16]:
cleaned_data.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,1,2010-02-05,1643690.9,False,42.31,2.572,211.096358,8.106
1,1,2010-02-12,1641957.44,True,38.51,2.548,211.24217,8.106
2,1,2010-02-19,1611968.17,False,39.93,2.514,211.289143,8.106
3,1,2010-02-26,1409727.59,False,46.63,2.561,211.319643,8.106
4,1,2010-03-05,1554806.68,False,46.5,2.625,211.350143,8.106


## 保存数据

In [17]:
cleaned_data.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,1,2010-02-05,1643690.9,False,42.31,2.572,211.096358,8.106
1,1,2010-02-12,1641957.44,True,38.51,2.548,211.24217,8.106
2,1,2010-02-19,1611968.17,False,39.93,2.514,211.289143,8.106
3,1,2010-02-26,1409727.59,False,46.63,2.561,211.319643,8.106
4,1,2010-03-05,1554806.68,False,46.5,2.625,211.350143,8.106


In [18]:
cleaned_data.to_csv(r"walmart_cleaned_data",index=False)

In [19]:
pd.read_csv(r"walmart_cleaned_data")

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,1,2010-02-05,1643690.90,False,42.31,2.572,211.096358,8.106
1,1,2010-02-12,1641957.44,True,38.51,2.548,211.242170,8.106
2,1,2010-02-19,1611968.17,False,39.93,2.514,211.289143,8.106
3,1,2010-02-26,1409727.59,False,46.63,2.561,211.319643,8.106
4,1,2010-03-05,1554806.68,False,46.50,2.625,211.350143,8.106
...,...,...,...,...,...,...,...,...
6430,45,2012-09-28,713173.95,False,64.88,3.997,192.013558,8.684
6431,45,2012-10-05,733455.07,False,64.89,3.985,192.170412,8.667
6432,45,2012-10-12,734464.36,False,54.47,4.000,192.327265,8.667
6433,45,2012-10-19,718125.53,False,56.47,3.969,192.330854,8.667
