### 项目描述
Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.

Rossmann公司在7个欧洲国家一共拥有3000家药妆店。目前Rossmann商店经理需要预测未来6周的销售额。商店的销售额会受到许多因素的影响，包括促销，竞争对手，学校，节假日，季度和区域性。由上千位背景不同的经理基于他们的环境而预测出的销售额差别是非常大的。

In their first Kaggle competition, Rossmann is challenging you to predict 6 weeks of daily sales for 1,115 stores located across Germany. Reliable sales forecasts enable store managers to create effective staff schedules that increase productivity and motivation. By helping Rossmann create a robust prediction model, you will help store managers stay focused on what’s most important to them: their customers and their teams! 

在Rossmann的第一次Kaggle竞赛中，它需要你为它预测德国境内的1115家药妆店6周内每天的销售额。可靠的销售额可以激发员工的工作积极性和产出。通过帮助Rossmann创建一个可靠，强壮的预测模型，你可以帮助Rossmann的经理们关注对他们来说最重要的东西：客户和团队。


### 问题描述
  项目选自Kaggle的一个竞赛项目 [Rossmann Store Sales](https://www.kaggle.com/c/rossmann-store-sales)，目标是预测Rossmann的销售额。是一个监督学习的回归类的问题，可以用LinearRegression，XGBoost，神经网络等机器学习的算法来预测销售额。
  
### 数据集的下载
数据全部可以从[Kaggle](https://www.kaggle.com/c/rossmann-store-sales/data)上下载。一共提供了4个文件：

- train.csv - historical data including Sales (训练集，包括了销售额)
- test.csv - historical data excluding Sales （测试集，没有包括销售额）
- sample_submission.csv - a sample submission file in the correct format （最终需要提交的文件模板）
- store.csv - supplemental information about the stores （补充信息）

#### 数据描述

- Id - an Id that represents a (Store, Date) duple within the test set（测试集中的唯一索引，由商店和日期表示，可以理解为数据库中的联合主键）
- Store - a unique Id for each store （商店的唯一索引）
- Sales - the turnover for any given day (this is what you are predicting) （销售额，需要在测试集中预测）
- Customers - the number of customers on a given day （当天的客户数量）
- Open - an indicator for whether the store was open: 0 = closed, 1 = open （是否营业，0.否 1.是）
- StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None （法定假日，大部分商店都会放假，但也有些列外，a = 公共假日，b = 复活节，c = 圣诞节， 0 = 无）
- SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools （学校假日，代表商店是否收到学校临时停课的影响）
- StoreType - differentiates between 4 different store models: a, b, c, d （商店类型）
- Assortment - describes an assortment level: a = basic, b = extra, c = extended （商店分类级别，a = 基本，b = 额外， c = 扩展）
- CompetitionDistance - distance in meters to the nearest competitor store （最近的竞争者的距离，单位：米）
- CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened （最近的竞争者，开始营业的月及年份）
- Promo - indicates whether a store is running a promo on that day （当天是否有促销）
- Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating（持续促销， 0 = 未参与， 1 = 参与）
- Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2 （持续促销开始的年/周数）
- PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store （促销的间隔）

至此我对项目有了一个大概的了解，接下来我们需要对数据进行探索。

### 数据的探索

- 首先根据Key Store合并2个特征文件，然后调用describe和head方法，看一下特征数据。
- 

In [27]:
import pandas as pd

train_df = pd.read_csv('data/train.csv')
store_df = pd.read_csv('data/store.csv')

# 合并特征
features_df = pd.merge(train_df, store_df, on='Store')

  interactivity=interactivity, compiler=compiler, result=result)


In [28]:
features_df.describe()

Unnamed: 0,Store,DayOfWeek,Sales,Customers,Open,Promo,SchoolHoliday,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear
count,1017209.0,1017209.0,1017209.0,1017209.0,1017209.0,1017209.0,1017209.0,1014567.0,693861.0,693861.0,1017209.0,509178.0,509178.0
mean,558.4297,3.998341,5773.819,633.1459,0.8301067,0.3815145,0.1786467,5430.086,7.222866,2008.690228,0.5005638,23.269093,2011.752774
std,321.9087,1.997391,3849.926,464.4117,0.3755392,0.4857586,0.3830564,7715.324,3.211832,5.992644,0.4999999,14.095973,1.66287
min,1.0,1.0,0.0,0.0,0.0,0.0,0.0,20.0,1.0,1900.0,0.0,1.0,2009.0
25%,280.0,2.0,3727.0,405.0,1.0,0.0,0.0,710.0,4.0,2006.0,0.0,13.0,2011.0
50%,558.0,4.0,5744.0,609.0,1.0,0.0,0.0,2330.0,8.0,2010.0,1.0,22.0,2012.0
75%,838.0,6.0,7856.0,837.0,1.0,1.0,0.0,6890.0,10.0,2013.0,1.0,37.0,2013.0
max,1115.0,7.0,41551.0,7388.0,1.0,1.0,1.0,75860.0,12.0,2015.0,1.0,50.0,2015.0


In [29]:
features_df.head()

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
0,1,5,2015-07-31,5263,555,1,1,0,1,c,a,1270.0,9.0,2008.0,0,,,
1,1,4,2015-07-30,5020,546,1,1,0,1,c,a,1270.0,9.0,2008.0,0,,,
2,1,3,2015-07-29,4782,523,1,1,0,1,c,a,1270.0,9.0,2008.0,0,,,
3,1,2,2015-07-28,5011,560,1,1,0,1,c,a,1270.0,9.0,2008.0,0,,,
4,1,1,2015-07-27,6102,612,1,1,0,1,c,a,1270.0,9.0,2008.0,0,,,


In [32]:
# 将数据切分成特征和对应的标签
sales_raw = features_df['Sales']
customer_raw = features_df['Customers']
features_raw = features_df.drop(['Sales', 'Customers'], axis = 1)


features_raw.head()

Unnamed: 0,Store,DayOfWeek,Date,Open,Promo,StateHoliday,SchoolHoliday,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
0,1,5,2015-07-31,1,1,0,1,c,a,1270.0,9.0,2008.0,0,,,
1,1,4,2015-07-30,1,1,0,1,c,a,1270.0,9.0,2008.0,0,,,
2,1,3,2015-07-29,1,1,0,1,c,a,1270.0,9.0,2008.0,0,,,
3,1,2,2015-07-28,1,1,0,1,c,a,1270.0,9.0,2008.0,0,,,
4,1,1,2015-07-27,1,1,0,1,c,a,1270.0,9.0,2008.0,0,,,
