### 简介：数据集包含沃尔玛的销售数据。沃尔玛全国范围内有多家零售门店在库存管理方面的问题，那么如何将供应与需求相匹配呢？作为一名数据科学家，你可以利用数据，提供有用的见解，并创建预测模型，从而能预测未来X个月/年的销售情况。
变量含义：
- Store：店铺编号
- Date：销售周
- Weekly_Sales：店铺在该周的销售额
- Holiday_Flag：是否为假日周
- Temperature：销售日的温度
- Fuel_Price：该地区的燃油成本
- CPI（消费者物价指数）：消费者物价指数
- Unemployment：失业率

In [1]:
import pandas as pd

In [2]:
original_data=pd.read_csv("walmart_stores_data.csv")

## 评估数据
在这一部分，我将对在上一部分建立的`original_data`这个DataFrame所包含的数据进行评估。

评估主要从两个方面进行：结构和内容，即整齐度和干净度。数据的结构性问题指不符合“每列是一个变量，每行是一个观察值，每个单元格是一个值”这三个标准，数据的内容性问题包括存在丢失数据、重复数据、无效数据等。

In [3]:
original_data.sample(10)

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
5575,39,19-10-2012,1577486.33,0,71.45,3.594,222.095172,6.228
169,2,06-08-2010,1991909.98,0,89.53,2.627,211.160805,8.099
5585,40,02-04-2010,1041202.13,0,41.39,2.826,131.901968,5.435
2052,15,21-01-2011,487311.03,0,21.84,3.391,133.028516,7.771
3542,25,16-03-2012,638204.27,0,50.64,3.862,214.016713,6.961
6278,44,27-07-2012,319855.26,0,80.42,3.537,130.719581,5.407
221,2,05-08-2011,1876704.26,0,93.34,3.684,215.197852,7.852
1441,11,23-04-2010,1283766.55,0,68.37,2.795,213.722185,7.343
139,1,05-10-2012,1670785.97,0,68.55,3.617,223.181477,6.573
1941,14,02-09-2011,1750891.47,0,70.63,3.703,186.618927,8.625


#### 从抽样结果看,数据符合“每列是一个变量，每行是一个观察值，每个单元格是一个值”这三个标准,因此不存在结构性问题

### 评估数据干净度

In [4]:
original_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6435 entries, 0 to 6434
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Store         6435 non-null   int64  
 1   Date          6435 non-null   object 
 2   Weekly_Sales  6435 non-null   float64
 3   Holiday_Flag  6435 non-null   int64  
 4   Temperature   6435 non-null   float64
 5   Fuel_Price    6435 non-null   float64
 6   CPI           6435 non-null   float64
 7   Unemployment  6435 non-null   float64
dtypes: float64(5), int64(2), object(1)
memory usage: 402.3+ KB


###### 从输出结果看,不存在空缺值

###### 此外Store的类型应为字符串,Holiday_Flag应为是或否

#### 评估无效或错误数据

In [5]:
original_data.describe()

Unnamed: 0,Store,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
count,6435.0,6435.0,6435.0,6435.0,6435.0,6435.0,6435.0
mean,23.0,1046965.0,0.06993,60.663782,3.358607,171.578394,7.999151
std,12.988182,564366.6,0.255049,18.444933,0.45902,39.356712,1.875885
min,1.0,209986.2,0.0,-2.06,2.472,126.064,3.879
25%,12.0,553350.1,0.0,47.46,2.933,131.735,6.891
50%,23.0,960746.0,0.0,62.67,3.445,182.616521,7.874
75%,34.0,1420159.0,0.0,74.94,3.735,212.743293,8.622
max,45.0,3818686.0,1.0,100.14,4.468,227.232807,14.313


#### 从输出结果来看,Temperature出现负数,猜测有可能是华氏度,数值在正常范围内,因此不需要调整

### 评估重复数据

In [6]:
original_data.duplicated(subset=["Store","Weekly_Sales","Date"])

0       False
1       False
2       False
3       False
4       False
        ...  
6430    False
6431    False
6432    False
6433    False
6434    False
Length: 6435, dtype: bool

#### 由输出结果可知,不存在重复数据

### 清理数据

根据前面评估部分得到的结论，我们需要进行的数据清理包括：
- 将Store的类型转换成字符串
- 将Holiday_Flag的结果改为是或否

为了区分开经过清理的数据和原始的数据，我们创建新的变量`cleaned_data`，让它为`original_data`复制出的副本。我们之后的清理步骤都将被运用在`cleaned_data`上。

In [7]:
cleaned_data = original_data.copy()
cleaned_data.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,1,05-02-2010,1643690.9,0,42.31,2.572,211.096358,8.106
1,1,12-02-2010,1641957.44,1,38.51,2.548,211.24217,8.106
2,1,19-02-2010,1611968.17,0,39.93,2.514,211.289143,8.106
3,1,26-02-2010,1409727.59,0,46.63,2.561,211.319643,8.106
4,1,05-03-2010,1554806.68,0,46.5,2.625,211.350143,8.106


In [8]:
cleaned_data["Store"]=original_data["Store"].astype(str)
cleaned_data["Store"]

0        1
1        1
2        1
3        1
4        1
        ..
6430    45
6431    45
6432    45
6433    45
6434    45
Name: Store, Length: 6435, dtype: object

In [9]:
cleaned_data["Holiday_Flag"]=cleaned_data["Holiday_Flag"].astype(str)

In [10]:
cleaned_data["Holiday_Flag"]=cleaned_data["Holiday_Flag"].astype(str)

In [11]:
cleaned_data["Holiday_Flag"]=cleaned_data["Holiday_Flag"].replace("1","是")

In [12]:
cleaned_data["Holiday_Flag"]=cleaned_data["Holiday_Flag"].replace("0","否")

In [13]:
cleaned_data

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,1,05-02-2010,1643690.90,否,42.31,2.572,211.096358,8.106
1,1,12-02-2010,1641957.44,是,38.51,2.548,211.242170,8.106
2,1,19-02-2010,1611968.17,否,39.93,2.514,211.289143,8.106
3,1,26-02-2010,1409727.59,否,46.63,2.561,211.319643,8.106
4,1,05-03-2010,1554806.68,否,46.50,2.625,211.350143,8.106
...,...,...,...,...,...,...,...,...
6430,45,28-09-2012,713173.95,否,64.88,3.997,192.013558,8.684
6431,45,05-10-2012,733455.07,否,64.89,3.985,192.170412,8.667
6432,45,12-10-2012,734464.36,否,54.47,4.000,192.327265,8.667
6433,45,19-10-2012,718125.53,否,56.47,3.969,192.330854,8.667


## 保存清理后的数据

完成数据清理后，把干净整齐的数据保存到新的文件里，文件名为`walmart_cleaned.csv`。