# Background

## Description/Evaluation

Competition link: https://www.kaggle.com/c/competitive-data-science-predict-future-sales/overview

We are asking you to predict total sales for every product and store in the next month.

Evaluation: Submissions are evaluated by root mean squared error (RMSE). True target values are clipped into [0,20] range.

Submission File: For each id in the test set, you must predict a total number of sales.

## Data

You are provided with daily historical sales data. The task is to forecast the total amount of products sold in every shop for the test set. Note that the list of shops and products slightly changes every month. Creating a robust model that can handle such situations is part of the challenge.

File descriptions

- sales_train.csv - the training set. Daily historical data from January 2013 to October 2015.
- test.csv - the test set. You need to forecast the sales for these shops and products for November 2015.
- sample_submission.csv - a sample submission file in the correct format.
- items.csv - supplemental information about the items/products.
- item_categories.csv  - supplemental information about the items categories.
- shops.csv- supplemental information about the shops.

Data fields

- ID - an Id that represents a (Shop, Item) tuple within the test set
- shop_id - unique identifier of a shop
- item_id - unique identifier of a product
- item_category_id - unique identifier of item category
- item_cnt_day - number of products sold. You are predicting a monthly amount of this measure
- item_price - current price of an item
- date - date in format dd/mm/yyyy
- date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
- item_name - name of item
- shop_name - name of shop
- item_category_name - name of item category

## Assessment

- For this assessment you will have 2 days to submit.
- You are required to use Python for this assessment
- You are required to include an EDA, with a link to the dataset/s you used.
- Get as far as you can in the 2 days and submit what you have. This is to see where your skills are strongest and weakest, so we can put together a well-balanced team.

Please fulfill the following instructions:

- Find a data set on Kaggle … any data set of interest to you (You are welcome to use more than one dataset if you find that one or more sets might add value to the insights of the original one you chose)
- Go through your data cleaning and data exploration as per normal
- Build an XGBoost regression model for your data set
- In a way that is comfortable to you, do a 3-step forecast (depending on the data you chose, this will be 3 days, 3 hours etc.)
 
Submission:

- Please load all your code, documentation, and data sets to your GitHub profile.
- Email your GitHub link to us.
- Include a readme file to your repository.

# EDA

files and notes: 

sales_train.csv - training data with features dates, data blocks, shop id, item id, item price, item count per day
items.csv - Matches item names to item id's, category id's. 
item_categories.csv - Matches item category names to item category id's. Names in Russian. Probably unimportant.
shops.csv - Matches shop id to shop names. Some shop names seem similar (different branches of same company?) - could do some feature engineering with this if time.
test.csv - Predict total sales in the next month, per item, per shop. We'll ignore this for now, because we want to do forecasting. But can make a submission at the end.


# libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

In [5]:
df = pd.read_csv("sales_train.csv")

In [27]:
df.head()

Unnamed: 0,date_block_num,shop_id,item_id,item_price,item_cnt_day
count,2935849.0,2935849.0,2935849.0,2935849.0,2935849.0
mean,14.56991,33.00173,10197.23,890.8532,1.242641
std,9.422988,16.22697,6324.297,1729.8,2.618834
min,0.0,0.0,0.0,-1.0,-22.0
25%,7.0,22.0,4476.0,249.0,1.0
50%,14.0,31.0,9343.0,399.0,1.0
75%,23.0,47.0,15684.0,999.0,1.0
max,33.0,59.0,22169.0,307980.0,2169.0


In [28]:
df.describe()

Unnamed: 0,date_block_num,shop_id,item_id,item_price,item_cnt_day
count,2935849.0,2935849.0,2935849.0,2935849.0,2935849.0
mean,14.56991,33.00173,10197.23,890.8532,1.242641
std,9.422988,16.22697,6324.297,1729.8,2.618834
min,0.0,0.0,0.0,-1.0,-22.0
25%,7.0,22.0,4476.0,249.0,1.0
50%,14.0,31.0,9343.0,399.0,1.0
75%,23.0,47.0,15684.0,999.0,1.0
max,33.0,59.0,22169.0,307980.0,2169.0


In [34]:
# Auxiliary data
shops, items, item_categories = pd.read_csv("shops.csv"), pd.read_csv("items.csv"), pd.read_csv("item_categories.csv")

for dataframe in [shops, items, item_categories]:
    display(dataframe.head())


Unnamed: 0,shop_name,shop_id
0,"!Якутск Орджоникидзе, 56 фран",0
1,"!Якутск ТЦ ""Центральный"" фран",1
2,"Адыгея ТЦ ""Мега""",2
3,"Балашиха ТРК ""Октябрь-Киномир""",3
4,"Волжский ТЦ ""Волга Молл""",4


Unnamed: 0,item_name,item_id,item_category_id
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,0,40
1,!ABBYY FineReader 12 Professional Edition Full...,1,76
2,***В ЛУЧАХ СЛАВЫ (UNV) D,2,40
3,***ГОЛУБАЯ ВОЛНА (Univ) D,3,40
4,***КОРОБКА (СТЕКЛО) D,4,40


Unnamed: 0,item_category_name,item_category_id
0,PC - Гарнитуры/Наушники,0
1,Аксессуары - PS2,1
2,Аксессуары - PS3,2
3,Аксессуары - PS4,3
4,Аксессуары - PSP,4
