## DSR Mini-Competition 
### Rossman Sales Prediction
#### Team 1: John Enevoldsen, Sara Ghasemi, Mena Nasr

This is a notebook to reproduce the results that are submitted for the competition. Data understanding and exploration and visualisations are not included in this notebook.

In [5]:
import sys

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import xgboost
from sklearn.ensemble import RandomForestRegressor
from sklearn import linear_model

In [6]:
sys.path.insert(0, './modules/')
import cleaning as cln
import feature_eng as feng

## 1. Initial Data Preparation

Read the files and make one data frame to be used for training and cross validation, and one to be used for the final test.

In [11]:
#Read the data from 'train', 'store' and 'holdout' files:

full_df_train = pd.read_csv("./data/train.csv")
full_df_store = pd.read_csv("./data/store.csv")
full_df_holdout = pd.read_csv("./data/holdout.csv")

  interactivity=interactivity, compiler=compiler, result=result)


In [12]:
#Merge the 'train' and 'store' data frames, to be used for training and cross validation:

full_df_train_cv = cln.merge(full_df_train, full_df_store)

In [13]:
#Merge the 'holdout' and 'store' data frames, to be used for final test:

full_df_test = cln.merge(full_df_holdout, full_df_store)

## 2. Data Cleaning

In [14]:
# Remove the Cusomers column, as was instructed to us for the competition

df_train_cv = cln.drop_column(full_df_train_cv, column='Customers')

In [15]:
# Target cleaning: Remove the zero and null values in targt ('Sales')

df_train_cv = cln.clean_targets(df_train_cv, target='Sales')

In [16]:
# Feature cleaning:
# Remove the rows with null feature values if number of null values are very small.
# Drop the whole feature column if number of null values are not small.
# The threshold is 10%.
# Transform the values in 'StateHoliday' column: transform 0.0 to 0 and all column to string.
    
df_train_cv = cln.rough_features_cleaning(df_train_cv, threshold=0.10, drop_columns=True, verbose=False)

Total number of rows before cleaning:  531983
Total number of rows after cleaning:  425689


## 3. Feature Engineering

Some feature engineering (namely mean encoding) are done after split of data into training and cross validation sets, to avoid data leakage, but others (make new date features, one_hot_encoding, etc.) can be done before.

For the data split, the last 3 months (from 2014-05-01 to 2014-07-31) are kept for the cross validation set and the rest (from 2013-01-01 to 2014-04-30) for training.

In [17]:
# Make new date features, before split into training and cross validation sets

df_train_cv = feng.dates_features(df_train_cv)

In [18]:
# Add one hot encoding of StateHoliday, StoreType, Assortment before split

df_train_cv = feng.one_hot_encoding(df_train_cv, 'StateHoliday')
df_train_cv = feng.one_hot_encoding(df_train_cv, 'StoreType')
df_train_cv = feng.one_hot_encoding(df_train_cv, 'Assortment')

In [20]:
# Split the data into training and cross validation sets

df_train, df_cv = feng.date_split_train_test(df_train_cv, '2014-05-01')