## Problem Statement

![](https://datahack-prod.s3.ap-south-1.amazonaws.com/__sized__/contest_cover/cover_1_3vEBqwk-thumbnail-1200x1200.png)

One of the largest retail chains in the world wants to use their vast data source to build an efficient forecasting model to predict the sales for each SKU in its portfolio at its 76 different stores using historical sales data for the past 3 years on a week-on-week basis. Sales and promotional information is also available for each week - product and store wise. 

However, no other information regarding stores and products are available. Can you still forecast accurately the sales values for every such product/SKU-store combination for the next 12 weeks accurately? If yes, then dive right in!

**Data Dictionary**

| **Column**  | **Description** |
| --- | --- |
| record_ID | Unique ID for each week store sku combination |
| week |  Starting Date of the week |
| store_id | Unique ID for each store (no numerical order to be assumed) |
| sku_id | Unique ID for each product (no numerical order to be assumed) |
| total_price | Sales Price of the product |
| base_price | Base price of the product |
| is_featured_sku | Was part of the featured item of the week |
| is_display_sku | Product was on display at a prominent place at the store |
| units_sold | (Target) Total Units sold for that week-store-sku combination |

**Evaluation Metric**

The evaluation metric for this competition is 100*RMSLE (Root Mean Squared Log Error).

In [372]:
## Import necessary Libraries.

import pandas as pd ## Pandas Library (will use to load data,create data frame...etc).
import numpy as np ## Numpy Library ( will use to convert data frame to array or creating array etc...).
import os ## For connecting to machine to get path for reading/writing files.
from sklearn.model_selection import train_test_split ## For splitting data into train and validation.
from sklearn.preprocessing import LabelEncoder ## For label encoding(converting categorical values to label).
from sklearn.model_selection import GridSearchCV ##For Grid search(cross validation).
import re ## For regular ecpression.
from statsmodels.stats.outliers_influence import variance_inflation_factor ## For VIF.
from sklearn.linear_model import LinearRegression ## For regression model.
from sklearn.metrics import mean_squared_log_error ## For MSLE
from math import sqrt ## For square root.
from sklearn.tree import DecisionTreeRegressor ## For Decision tree model.
from sklearn.ensemble import RandomForestRegressor ## For Random Forest model.
from sklearn.neighbors import KNeighborsRegressor ## For KNN mmodel.
from sklearn.svm import SVR ## For SVR mmodel.
from sklearn.ensemble import AdaBoostRegressor ## For Adaboost model.
from sklearn.ensemble import GradientBoostingRegressor ## For GBR model.
from xgboost.sklearn import XGBRegressor ## For  XGB model.
from keras.models import Sequential ## For sequential model
from keras.layers import Dense ## For fully connnected layer.
from sklearn.model_selection import GridSearchCV ## For Grid search.
from sklearn.linear_model import Ridge ## For Ridge model.
from sklearn.linear_model import Lasso ## For Lasso model.

In [373]:
## Get current working directory.
os.getcwd()

'D:\\DataScience\\Pratice\\Demand Forecasting'

In [374]:
## Set working directory.
os.chdir("D:\DataScience\Pratice\Demand Forecasting")
os.getcwd()

'D:\\DataScience\\Pratice\\Demand Forecasting'

In [375]:
## Load train and test data sets.
train = pd.read_csv('train.csv',header='infer',sep=',')
test = pd.read_csv('test.csv',header='infer',sep=',')

In [376]:
## Get train and test data dimensions.
print('Train Dimensions',train.shape)
print('Test Dimensions',test.shape)

Train Dimensions (150150, 9)
Test Dimensions (13860, 8)


In [377]:
## Get first 5 records of train data.
train.head()

Unnamed: 0,record_ID,week,store_id,sku_id,total_price,base_price,is_featured_sku,is_display_sku,units_sold
0,1,17/01/11,8091,216418,99.0375,111.8625,0,0,20
1,2,17/01/11,8091,216419,99.0375,99.0375,0,0,28
2,3,17/01/11,8091,216425,133.95,133.95,0,0,19
3,4,17/01/11,8091,216233,133.95,133.95,0,0,44
4,5,17/01/11,8091,217390,141.075,141.075,0,0,52


In [378]:
## Get first 5 records of test data.
test.head()

Unnamed: 0,record_ID,week,store_id,sku_id,total_price,base_price,is_featured_sku,is_display_sku
0,212645,16/07/13,8091,216418,108.3,108.3,0,0
1,212646,16/07/13,8091,216419,109.0125,109.0125,0,0
2,212647,16/07/13,8091,216425,133.95,133.95,0,0
3,212648,16/07/13,8091,216233,133.95,133.95,0,0
4,212649,16/07/13,8091,217390,176.7,176.7,0,0


In [379]:
## Get last 5 records of train data.
train.tail()

Unnamed: 0,record_ID,week,store_id,sku_id,total_price,base_price,is_featured_sku,is_display_sku,units_sold
150145,212638,09/07/13,9984,223245,235.8375,235.8375,0,0,38
150146,212639,09/07/13,9984,223153,235.8375,235.8375,0,0,30
150147,212642,09/07/13,9984,245338,357.675,483.7875,1,1,31
150148,212643,09/07/13,9984,547934,141.7875,191.6625,0,1,12
150149,212644,09/07/13,9984,679023,234.4125,234.4125,0,0,15


In [380]:
## Get last 5 records of test data.
test.tail()

Unnamed: 0,record_ID,week,store_id,sku_id,total_price,base_price,is_featured_sku,is_display_sku
13855,232281,01/10/13,9984,223245,241.5375,241.5375,0,0
13856,232282,01/10/13,9984,223153,240.825,240.825,0,0
13857,232285,01/10/13,9984,245338,382.6125,401.85,1,1
13858,232286,01/10/13,9984,547934,191.6625,191.6625,0,0
13859,232287,01/10/13,9984,679023,234.4125,234.4125,0,0


In [381]:
## Get summary statistics of train data.
train.describe(include = 'all')

Unnamed: 0,record_ID,week,store_id,sku_id,total_price,base_price,is_featured_sku,is_display_sku,units_sold
count,150150.0,150150,150150.0,150150.0,150149.0,150150.0,150150.0,150150.0,150150.0
unique,,130,,,,,,,
top,,11/09/12,,,,,,,
freq,,1155,,,,,,,
mean,106271.555504,,9199.422511,254761.132468,206.626751,219.425927,0.095611,0.1332,51.674206
std,61386.037861,,615.591445,85547.306447,103.308516,110.961712,0.294058,0.339792,60.207904
min,1.0,,8023.0,216233.0,41.325,61.275,0.0,0.0,1.0
25%,53111.25,,8562.0,217217.0,130.3875,133.2375,0.0,0.0,20.0
50%,106226.5,,9371.0,222087.0,198.075,205.9125,0.0,0.0,35.0
75%,159452.75,,9731.0,245338.0,233.7,234.4125,0.0,0.0,62.0


In [382]:
## Get summary statistics of test data.
test.describe(include='all')

Unnamed: 0,record_ID,week,store_id,sku_id,total_price,base_price,is_featured_sku,is_display_sku
count,13860.0,13860,13860.0,13860.0,13860.0,13860.0,13860.0,13860.0
unique,,12,,,,,,
top,,10/09/13,,,,,,
freq,,1155,,,,,,
mean,222460.146392,,9199.422511,254761.132468,212.188874,223.92266,0.08658,0.133333
std,5668.25849,,615.611603,85550.107852,93.138162,103.429522,0.281229,0.339947
min,212645.0,,8023.0,216233.0,65.55,70.5375,0.0,0.0
25%,217557.75,,8562.0,217217.0,132.525,137.5125,0.0,0.0
50%,222466.5,,9371.0,222087.0,213.0375,218.7375,0.0,0.0
75%,227367.25,,9731.0,245338.0,241.5375,261.4875,0.0,0.0


In [383]:
## Get column names of train data.
train.columns

Index(['record_ID', 'week', 'store_id', 'sku_id', 'total_price', 'base_price',
       'is_featured_sku', 'is_display_sku', 'units_sold'],
      dtype='object')

In [384]:
## Get column names of test data.
test.columns

Index(['record_ID', 'week', 'store_id', 'sku_id', 'total_price', 'base_price',
       'is_featured_sku', 'is_display_sku'],
      dtype='object')

In [385]:
## Get column data types of train data.
train.dtypes

record_ID            int64
week                object
store_id             int64
sku_id               int64
total_price        float64
base_price         float64
is_featured_sku      int64
is_display_sku       int64
units_sold           int64
dtype: object

In [386]:
## Get column data types of test data.
test.dtypes

record_ID            int64
week                object
store_id             int64
sku_id               int64
total_price        float64
base_price         float64
is_featured_sku      int64
is_display_sku       int64
dtype: object

In [387]:
## Get index range of train data.
train.index

RangeIndex(start=0, stop=150150, step=1)

In [388]:
## Get index range of test data.
test.index

RangeIndex(start=0, stop=13860, step=1)

In [389]:
## Check NA values of train data.
train.isna().sum()

record_ID          0
week               0
store_id           0
sku_id             0
total_price        1
base_price         0
is_featured_sku    0
is_display_sku     0
units_sold         0
dtype: int64

In [390]:
## Check NA values of test data.
test.isna().sum()

record_ID          0
week               0
store_id           0
sku_id             0
total_price        0
base_price         0
is_featured_sku    0
is_display_sku     0
dtype: int64

In [391]:
## Drop NA record,one record will not impact on huge data.
train.dropna(inplace=True)

In [392]:
## Check NA values after dropping NAs from train data.
train.isna().sum()

record_ID          0
week               0
store_id           0
sku_id             0
total_price        0
base_price         0
is_featured_sku    0
is_display_sku     0
units_sold         0
dtype: int64

In [393]:
## This method will return number of levels,null values,unique values,data types.

def statistics(df):
    return(pd.DataFrame({'dtypes' : df.dtypes,
                         'levels' : [df[x].unique() for x in df.columns],
                         'null_values' : df.isnull().sum(),
                         'Unique Values': df.nunique()
                        }))

In [394]:
## Get train data statistics.
statistics(train)

Unnamed: 0,dtypes,levels,null_values,Unique Values
record_ID,int64,"[1, 2, 3, 4, 5, 9, 10, 13, 14, 17, 18, 19, 22,...",0,150149
week,object,"[17/01/11, 24/01/11, 31/01/11, 07/02/11, 14/02...",0,130
store_id,int64,"[8091, 8095, 8094, 8063, 8023, 8058, 8222, 812...",0,76
sku_id,int64,"[216418, 216419, 216425, 216233, 217390, 21900...",0,28
total_price,float64,"[99.0375, 133.95, 141.075, 227.2875, 327.0375,...",0,646
base_price,float64,"[111.8625, 99.0375, 133.95, 141.075, 227.2875,...",0,572
is_featured_sku,int64,"[0, 1]",0,2
is_display_sku,int64,"[0, 1]",0,2
units_sold,int64,"[20, 28, 19, 44, 52, 18, 47, 50, 82, 99, 120, ...",0,708


In [395]:
## Get test data statistics.
statistics(test)

Unnamed: 0,dtypes,levels,null_values,Unique Values
record_ID,int64,"[212645, 212646, 212647, 212648, 212649, 21265...",0,13860
week,object,"[16/07/13, 23/07/13, 30/07/13, 06/08/13, 13/08...",0,12
store_id,int64,"[8091, 8095, 8094, 8063, 8023, 8058, 8222, 812...",0,76
sku_id,int64,"[216418, 216419, 216425, 216233, 217390, 21900...",0,28
total_price,float64,"[108.3, 109.0125, 133.95, 176.7, 218.7375, 341...",0,442
base_price,float64,"[108.3, 109.0125, 133.95, 176.7, 218.7375, 341...",0,370
is_featured_sku,int64,"[0, 1]",0,2
is_display_sku,int64,"[0, 1]",0,2


In [396]:
## Below logic is used for checking special characters in numeric columns.

def specialCharcterVerification_NumCol(data):
    for col in data.select_dtypes(['int64','float64']).columns: 
        print('\n',col,'----->')
        for index in range(1,len(data)):
            try:
                skip=float(data.loc[index,col])
                skip=int(data.loc[index,col])
            except ValueError :
                print(index,data.loc[index,col])

In [397]:
## Check special charcters for train data numeric columns.
specialCharcterVerification_NumCol(test)


 record_ID ----->

 store_id ----->

 sku_id ----->

 total_price ----->

 base_price ----->

 is_featured_sku ----->

 is_display_sku ----->


In [398]:
### calculate variance column wise for numeric columns.
def variance(x):
        return(pd.DataFrame({'Datatype' : x.dtypes,
                            'Variance': [round(x[i].var()) for i in x] }))

In [399]:
## Get variance for train data numeric columns.
variance(train.select_dtypes(['int64','float64']))

Unnamed: 0,Datatype,Variance
record_ID,int64,3768219582
store_id,int64,378955
sku_id,int64,7318389790
total_price,float64,10673
base_price,float64,12312
is_featured_sku,int64,0
is_display_sku,int64,0
units_sold,int64,3625


In [400]:
## Get variance for test data numeric columns.
variance(test.select_dtypes(['int64','float64']))

Unnamed: 0,Datatype,Variance
record_ID,int64,32129154
store_id,int64,378978
sku_id,int64,7318820954
total_price,float64,8675
base_price,float64,10698
is_featured_sku,int64,0
is_display_sku,int64,0


In [401]:
## Check special characters for categorical columns.
def checkSpclCharcters(df):
    for col in df.select_dtypes(['object']).columns:
        print('\n',col,'----->')
        for index in range(1,len(df)):
            if  str(df.loc[index,col]).isdigit() or df.loc[index,col]==' ' or \
                str(df.loc[index,col]).isalpha() or re.sub('[\s+]', '',df.loc[index,col]).isalpha() or \
                re.sub('[\s+]', '',df.loc[index,col]).replace('-','').isalnum() or str(df.loc[index,col]).isalnum():
                skip = True
            else:
                print("Index ",index,"\tSpecial Character ",df.loc[index,col])       

In [402]:
## Set record_ID column as index to train and test.
train.set_index('record_ID',inplace=True)
test.set_index('record_ID',inplace=True)

In [403]:
## Check train first record after setting index.
train.head(1)

Unnamed: 0_level_0,week,store_id,sku_id,total_price,base_price,is_featured_sku,is_display_sku,units_sold
record_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,17/01/11,8091,216418,99.0375,111.8625,0,0,20


In [404]:
## Check test first record after setting index.
test.head(1)

Unnamed: 0_level_0,week,store_id,sku_id,total_price,base_price,is_featured_sku,is_display_sku
record_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
212645,16/07/13,8091,216418,108.3,108.3,0,0


In [405]:
## Extract features from week column.

## Converting week into datetime format for train data.
train['date'] = pd.to_datetime(train['week'])

In [406]:
## Converting week into datetime format for test data.
test['date'] = pd.to_datetime(test['week'])

In [407]:
## Extract date from date for train data.
train['date'] = [d.date() for d in train['date']]

In [408]:
## Extract date from date for test data.
test['date'] = [d.date() for d in test['date']]

In [409]:
## Drop week column from train data beacuse we have extracted features from them
## so those columns are not required.
train.drop(['week'], axis=1, inplace=True)

In [410]:
## Drop week column from test data beacuse we have extracted features from them
## so those columns are not required.
test.drop(['week'], axis=1, inplace=True)

In [411]:
## Extract day,month,year features from date column of train data and also drop 
## date column after feature extraction.
train['year'] = train['date'].apply(lambda x: x.year)
train['month'] = train['date'].apply(lambda x: x.month)
train['day'] = train['date'].apply(lambda x: x.day)
train.drop(['date'], axis=1, inplace=True)

In [412]:
## Extract day,month,year features from date column of test data and also drop 
## date column after feature extraction.
test['year'] = test['date'].apply(lambda x: x.year)
test['month'] = test['date'].apply(lambda x: x.month)
test['day'] = test['date'].apply(lambda x: x.day)
test.drop(['date'], axis=1, inplace=True)

In [413]:
## Check first record of train data after feature extractions.
train.head(1)

Unnamed: 0_level_0,store_id,sku_id,total_price,base_price,is_featured_sku,is_display_sku,units_sold,year,month,day
record_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,8091,216418,99.0375,111.8625,0,0,20,2011,1,17


In [414]:
## Check first record of test data after feature extractions.
test.head(1)

Unnamed: 0_level_0,store_id,sku_id,total_price,base_price,is_featured_sku,is_display_sku,year,month,day
record_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
212645,8091,216418,108.3,108.3,0,0,2013,7,16


In [415]:
## Drop duplicate records from train data
train.drop_duplicates(keep = False, inplace = True) ## Return DataFrame with duplicate rows removed.

In [416]:
## Drop duplicate records from test data
test.drop_duplicates(keep = False, inplace = True) ## Return DataFrame with duplicate rows removed.

In [417]:
## Data type conversion.

## Convert objet/int64/float64 into category(non numeric columns).
cat_column = ['store_id','sku_id','is_featured_sku','is_display_sku','year','month','day']

## Convert object type to category data type.
def dtypeConversion(df):  
    for i in cat_column:
        df[i]=df[i].astype('category')

In [418]:
## Convert object data type to category data type for train data.
dtypeConversion(train)

In [419]:
dtypeConversion(test)

In [420]:
## Check column data types for train after conversion.
train.dtypes

store_id           category
sku_id             category
total_price         float64
base_price          float64
is_featured_sku    category
is_display_sku     category
units_sold            int64
year               category
month              category
day                category
dtype: object

In [421]:
## Check column data types for test after conversion.
test.dtypes

store_id           category
sku_id             category
total_price         float64
base_price          float64
is_featured_sku    category
is_display_sku     category
year               category
month              category
day                category
dtype: object

In [422]:
## Check corrlation between numeric columns of validation data.
train.select_dtypes(['int64','float64']).corr()

Unnamed: 0,total_price,base_price,units_sold
total_price,1.0,0.958885,-0.235625
base_price,0.958885,1.0,-0.140022
units_sold,-0.235625,-0.140022,1.0


In [423]:
## Check corrlation between numeric columns of test data.
test.select_dtypes(['int64','float64']).corr()

Unnamed: 0,total_price,base_price
total_price,1.0,0.96406
base_price,0.96406,1.0


In [433]:
## Split data into train and validation(70:30 ratio).
X_train,X_test,y_train,y_test = train_test_split(train.drop('units_sold',axis=1),train['units_sold'],test_size=0.3,random_state=123)

In [434]:
## Check first record of train data.
X_train.head(1)

Unnamed: 0_level_0,store_id,sku_id,total_price,base_price,is_featured_sku,is_display_sku,year,month,day
record_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
142235,9984,219029,312.7875,312.7875,0,0,2012,11,9


In [435]:
## Check first record of validation data.
X_test.head(1)

Unnamed: 0_level_0,store_id,sku_id,total_price,base_price,is_featured_sku,is_display_sku,year,month,day
record_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
194013,9498,219029,327.0375,327.0375,0,0,2013,4,23


In [436]:
## Check first record of train target data.
y_train.head(1)

record_ID
142235    15
Name: units_sold, dtype: int64

In [437]:
## Check first record of validation target data.
y_test.head(1)

record_ID
194013    21
Name: units_sold, dtype: int64

In [438]:
## Check NA values for train data.
X_train.isna().sum()

store_id           0
sku_id             0
total_price        0
base_price         0
is_featured_sku    0
is_display_sku     0
year               0
month              0
day                0
dtype: int64

In [439]:
## Check NA values for validation data.
X_test.isna().sum()

store_id           0
sku_id             0
total_price        0
base_price         0
is_featured_sku    0
is_display_sku     0
year               0
month              0
day                0
dtype: int64

In [None]:
## Create a empty dataframe and calculate VIF for train data.
vif=pd.DataFrame()
vif['Vif']=[variance_inflation_factor(X_train.values,i) for i in range(X_train.shape[1])]
vif['Variables']=X_train.columns.values

In [None]:
## Create a empty dataframe and calculate VIF for validation data.
vif=pd.DataFrame()
vif['Vif']=[variance_inflation_factor(X_test.values,i) for i in range(X_test.shape[1])]
vif['Variables']=X_test.columns.values

In [None]:
#################################################### Label Encoding ###########################################################

In [442]:
## Instantiate lable encoder.

le_storeId = LabelEncoder()
le_skuId = LabelEncoder()
le_featuresSKU = LabelEncoder()
le_displaySKU = LabelEncoder()
le_year = LabelEncoder()
le_month = LabelEncoder()
le_day = LabelEncoder()

In [443]:
## Do label encoding on train data.
X_train['store_id'] = le_storeId.fit_transform(X_train['store_id'])
X_train['sku_id'] = le_skuId.fit_transform(X_train['sku_id'])
X_train['is_featured_sku'] = le_featuresSKU.fit_transform(X_train['is_featured_sku'])
X_train['is_display_sku'] = le_displaySKU.fit_transform(X_train['is_display_sku'])
X_train['year'] = le_year.fit_transform(X_train['year'])
X_train['month'] = le_month.fit_transform(X_train['month'])
X_train['day'] = le_day.fit_transform(X_train['day'])

In [444]:
## Do label encoding on validation data.
X_test['store_id'] = le_storeId.transform(X_test['store_id'])
X_test['sku_id'] = le_skuId.transform(X_test['sku_id'])
X_test['is_featured_sku'] = le_featuresSKU.transform(X_test['is_featured_sku'])
X_test['is_display_sku'] = le_displaySKU.transform(X_test['is_display_sku'])
X_test['year'] = le_year.transform(X_test['year'])
X_test['month'] = le_month.transform(X_test['month'])
X_test['day'] = le_day.transform(X_test['day'])

In [445]:
## Do label encoding on test data.
test['store_id'] = le_storeId.transform(test['store_id'])
test['sku_id'] = le_skuId.transform(test['sku_id'])
test['is_featured_sku'] = le_featuresSKU.transform(test['is_featured_sku'])
test['is_display_sku'] = le_displaySKU.transform(test['is_display_sku'])
test['year'] = le_year.transform(test['year'])
test['month'] = le_month.transform(test['month'])
test['day'] = le_day.transform(test['day'])

In [446]:
## Check train data after doing label encoding.
X_train.head(1)

Unnamed: 0_level_0,store_id,sku_id,total_price,base_price,is_featured_sku,is_display_sku,year,month,day
record_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
142235,75,8,312.7875,312.7875,0,0,1,10,8


In [447]:
## Check validation data after doing label encoding.
X_test.head(1)

Unnamed: 0_level_0,store_id,sku_id,total_price,base_price,is_featured_sku,is_display_sku,year,month,day
record_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
194013,47,8,327.0375,327.0375,0,0,2,3,22


In [448]:
## Check test data after doing label encoding.
test.head(1)

Unnamed: 0_level_0,store_id,sku_id,total_price,base_price,is_featured_sku,is_display_sku,year,month,day
record_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
212645,3,1,108.3,108.3,0,0,2,6,15


In [449]:
########################################## Build Different Models #############################################################

In [450]:
################################################# Linear Regression ###########################################################

In [155]:
## Instantiate regression model and fit  a model.
linreg=LinearRegression()
linear_model=linreg.fit(X_train,y_train)

In [156]:
## Get the predictions on train and validation data.
pred_train = linear_model.predict(X_train)
pred_test = linear_model.predict(X_test)

In [157]:
## Get predictions on test data.
test_pred = linear_model.predict(test)

In [180]:
## Below function is used to calculate 
def rmsle(y, y0):
    return np.sqrt(np.mean(np.square(np.log1p(y) - np.log1p(y0))))*100

In [183]:
## Display RMSE * 100 value for train and validation data.
print("Train Error:",rmsle(y_train, pred_train))
print("Test Error:",rmsle(y_test, pred_test))

Train Error: 70.63058666321805
Test Error: 70.37652179854565


In [184]:
## Prepare a dataframe with test data index,prediction values.
dataframe = pd.DataFrame({'record_ID' : test.index,
                          'units_sold' : test_pred})

In [185]:
## Check dimesnions of test data.
test.shape

(13860, 9)

In [186]:
## Check dimensons of dataframe.
dataframe.shape

(13860, 2)

In [187]:
## Copy dataframe data into a CSV file.
dataframe.to_csv('LinearModel.csv',index=False)

In [172]:
############################################### Decision Tree ##################################################################

In [188]:
## Instantiate and fit a regression model.
dtr = DecisionTreeRegressor(max_depth=5,min_samples_leaf=10,min_samples_split=5,random_state=123)
dtr.fit(X_train,y_train)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=5,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=10, min_samples_split=5,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=123, splitter='best')

In [189]:
## Get the predictions on train and validation data.
pred_train = dtr.predict(X_train)
pred_test = dtr.predict(X_test)

In [190]:
## Get predictions on test data.
test_pred = linear_model.predict(test)

## Note : mean_squared_log_error() method won't accept negative values.

In [191]:
## Display RMSLE * 100 value for train and validation data.
print("Train Error:",sqrt(mean_squared_log_error(y_train, pred_train))*100)
print("Test Error:",sqrt(mean_squared_log_error(y_test, pred_test))*100)

Train Error: 70.63058666321871
Test Error: 70.37652179854706


In [192]:
## Prepare a dataframe with test data index,prediction values.
dataframe = pd.DataFrame({'record_ID' : test.index,
                          'units_sold' : test_pred})

In [193]:
## Check dimesnions of test data.
test.shape

(13860, 9)

In [194]:
## Check dimensons of dataframe.
dataframe.shape

(13860, 2)

In [195]:
## Copy dataframe data into a CSV file.
dataframe.to_csv('DecisionTree.csv',index=False)

In [197]:
############################################## Random Forest ##################################################################

In [270]:
## Instantiate a regressor model.
rc = RandomForestRegressor(n_estimators= 200, max_depth= 10 ,min_samples_leaf = 4 ,max_features='sqrt')

In [271]:
## Fit a model.
rc.fit(X_train,y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=10, max_features='sqrt', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=4,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=200, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

In [272]:
## Get the predictions on train and validation data.
pred_train = rc.predict(X_train)
pred_test = rc.predict(X_test)

In [273]:
## Get predictions on test data.
test_pred = rc.predict(test)

In [219]:
## Display RMSLE * 100 value for train and validation data.
print("Train Error:",sqrt(mean_squared_log_error(y_train, pred_train))*100)
print("Test Error:",sqrt(mean_squared_log_error(y_test, pred_test))*100)

Train Error: 70.63058666321871
Test Error: 70.37652179854706


In [210]:
## Prepare a dataframe with test data index,prediction values.
dataframe = pd.DataFrame({'record_ID' : test.index,
                          'units_sold' : test_pred})

In [211]:
## Check dimesnions of test data.
test.shape

(13860, 9)

In [212]:
## Check dimensons of dataframe.
dataframe.shape

(13860, 2)

In [213]:
## Copy dataframe data into a CSV file.
dataframe.to_csv('RandomForest.csv',index=False)

In [216]:
################################################### KNN #######################################################################

In [220]:
## Instantiate KNN model and fit it.
knn = KNeighborsRegressor(algorithm = 'brute', n_neighbors = 10,
                           metric = "euclidean")
knn.fit(X_train, y_train)

KNeighborsRegressor(algorithm='brute', leaf_size=30, metric='euclidean',
                    metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                    weights='uniform')

In [221]:
## Get the predictions on train and validation.
pred_train = knn.predict(X_train)
pred_test = knn.predict(X_test)

In [222]:
## Get predictions on test data.
test_pred = knn.predict(test)

In [224]:
## Display RMSLE * 100 value for train and validation data.
print("Train Error:",sqrt(mean_squared_log_error(y_train, pred_train))*100)
print("Test Error:",sqrt(mean_squared_log_error(y_test, pred_test))*100)

Train Error: 57.54764848075348
Test Error: 62.36310237996467


In [225]:
## Prepare a dataframe with test data index,prediction values.
dataframe = pd.DataFrame({'record_ID' : test.index,
                          'units_sold' : test_pred})

In [226]:
## Check dimesnions of test data.
test.shape

(13860, 9)

In [227]:
## Check dimensons of dataframe.
dataframe.shape

(13860, 2)

In [228]:
## Copy dataframe data into a CSV file.
dataframe.to_csv('KNN.csv',index=False)

In [229]:
#################################################### SVM #######################################################################

In [230]:
## Instantiate SVR model.
svr_model = SVR()
svr_model

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [231]:
## Fit a model.
svr_model.fit(X = X_train, y = y_train)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [232]:
## Get the predictions on train and validation.
pred_train = svr_model.predict(X_train)
pred_test = svr_model.predict(X_test)

In [238]:
## Get predictions on test data.
test_pred = knn.predict(test)

In [None]:
## Display RMSLE * 100 value for train and validation data.
print("Train Error:",sqrt(mean_squared_log_error(y_train, pred_train))*100)
print("Test Error:",sqrt(mean_squared_log_error(y_test, pred_test))*100)

In [240]:
## Prepare a dataframe with test data index,prediction values.
dataframe = pd.DataFrame({'record_ID' : test.index,
                          'units_sold' : test_pred})

In [241]:
## Check dimesnions of test data.
test.shape

(13860, 9)

In [242]:
## Check dimensons of dataframe.
dataframe.shape

(13860, 2)

In [243]:
## Copy dataframe data into a CSV file.
dataframe.to_csv('SVM.csv',index=False)

In [244]:
################################################### AdaBoost ##################################################################

In [245]:
## Instantiate regressor model and fit it.
Adaboost_model = AdaBoostRegressor(n_estimators=200,learning_rate=0.001)
%time Adaboost_model.fit(X_train, y_train)

Wall time: 22.2 s


AdaBoostRegressor(base_estimator=None, learning_rate=0.001, loss='linear',
                  n_estimators=200, random_state=None)

In [246]:
## Get the predictions on train and validation data.
pred_train = Adaboost_model.predict(X_train)
pred_test = Adaboost_model.predict(X_test)

In [247]:
## Get predictions on test data.
test_pred = Adaboost_model.predict(test)

In [249]:
## Display RMSLE * 100 value for train and validation data.
print("Train Error:",sqrt(mean_squared_log_error(y_train, pred_train))*100)
print("Test Error:",sqrt(mean_squared_log_error(y_test, pred_test))*100)

Train Error: 76.92122447609304
Test Error: 76.62514792680571


In [250]:
## Prepare a dataframe with test data index,prediction values.
dataframe = pd.DataFrame({'record_ID' : test.index,
                          'units_sold' : test_pred})

In [251]:
## Check dimesnions of test data.
test.shape

(13860, 9)

In [252]:
## Check dimensons of dataframe.
dataframe.shape

(13860, 2)

In [253]:
## Copy dataframe data into a CSV file.
dataframe.to_csv('AdaBoost.csv',index=False)

In [254]:
##################################################### GradientBoosting #########################################################

In [255]:
## Instantiate GBR and fit it.
gbm = GradientBoostingRegressor(n_estimators=200,learning_rate=0.001,random_state=474)
%time gbm.fit(X=X_train, y=y_train)

Wall time: 13.8 s


GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.001, loss='ls',
                          max_depth=3, max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=200,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=474, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

In [256]:
## Get the predictions on train and validation.
pred_train = gbm.predict(X_train)
pred_test = gbm.predict(X_test)

In [257]:
## Get predictions on test data.
test_pred = gbm.predict(test)

In [259]:
## Display RMSLE * 100 value for train and validation data.
print("Train Error:",sqrt(mean_squared_log_error(y_train, pred_train))*100)
print("Test Error:",sqrt(mean_squared_log_error(y_test, pred_test))*100)

Train Error: 89.55891229538047
Test Error: 89.36619239123944


In [260]:
## Prepare a dataframe with test data index,prediction values.
dataframe = pd.DataFrame({'record_ID' : test.index,
                          'units_sold' : test_pred})

In [261]:
## Check dimesnions of test data.
test.shape

(13860, 9)

In [262]:
## Check dimensons of dataframe.
dataframe.shape

(13860, 2)

In [263]:
## Copy dataframe data into a CSV file.
dataframe.to_csv('GB.csv',index=False)

In [264]:
################################################## XGradient Boosting ##########################################################

In [265]:
## Model Building with Grid Search.
xgb = XGBRegressor() ## Instantiate XGB model.

optimization_dict = {'max_depth': [2,3,4,5,6,7,10,15], ## trying with different max_depth,n_estimators to find best model.
                      'n_estimators': [50,60,70,80,90,100,150,200]} 

## Build best model with Grid Search params.
model = GridSearchCV(xgb, ## XGB model.
                     optimization_dict, ## dictory with different max_depth,n_estimators.
                     verbose=1, ## for messaging purpose.
                     n_jobs=-1) ## Number of jobs to run in parallel. ''-1' means use all processors.

%time model.fit(X_train, y_train) ## Fit a model.
print(model.best_score_) ## Display best score calues.
print(model.best_params_) ## Display best parameters.

Fitting 5 folds for each of 64 candidates, totalling 320 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   20.4s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done 320 out of 320 | elapsed:  9.0min finished
  if getattr(data, 'base', None) is not None and \


Wall time: 9min 28s
0.8119970054960417
{'max_depth': 10, 'n_estimators': 200}


In [309]:
## Instantiate XGBR and fit it.
xgb_model=XGBRegressor(n_estimators=200,learning_rate=0.001,max_depth=7)
%time xgb_model.fit(X_train,y_train,verbose=True)

  if getattr(data, 'base', None) is not None and \


Wall time: 17.1 s


XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.001, max_delta_step=0,
             max_depth=7, min_child_weight=1, missing=None, n_estimators=200,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)

In [310]:
## Get the predictions on train and validation.
pred_train = xgb_model.predict(X_train)
pred_test = xgb_model.predict(X_test)

In [311]:
## Get predictions on test data.
test_pred = xgb_model.predict(test)

In [313]:
## Display RMSLE * 100 value for train and validation data.
print("Train Error:",sqrt(mean_squared_log_error(y_train, pred_train))*100)
print("Test Error:",sqrt(mean_squared_log_error(y_test, pred_test))*100)

Train Error: 149.7562801456566
Test Error: 149.62871162091162


In [314]:
## Prepare a dataframe with test data index,prediction values.
dataframe = pd.DataFrame({'record_ID' : test.index,
                          'units_sold' : test_pred})

In [315]:
## Check dimesnions of test data.
test.shape

(13860, 9)

In [316]:
## Check dimensons of dataframe.
dataframe.shape

(13860, 2)

In [317]:
## Copy dataframe data into a CSV file.
dataframe.to_csv('XGB.csv',index=False)

In [318]:
############################################# Neural Network Linear Algoritham #################################################

In [363]:
## Instantiate squential model.
model = Sequential()

## Add dense model.
model.add(Dense(1, input_dim=X_train.shape[1]))

## Add compiler to model.
model.compile(loss='mse', optimizer='rmsprop')

## Fit a model.
model.fit(X_train, y_train, epochs=50, batch_size=32)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.callbacks.History at 0x19c4f249828>

In [364]:
## Get the predictions on train and validation.
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)

In [365]:
## Get predictions on test data.
test_pred = model.predict(test)

In [313]:
## Display RMSLE * 100 value for train and validation data.
print("Train Error:",sqrt(mean_squared_log_error(y_train, pred_train))*100)
print("Test Error:",sqrt(mean_squared_log_error(y_test, pred_test))*100)

Train Error: 149.7562801456566
Test Error: 149.62871162091162


In [366]:
## Prepare a dataframe with test data index,prediction values.
dataframe = pd.DataFrame({'record_ID' : test.index,
                          'units_sold' : test_pred.tolist()})

In [367]:
## Convert list to numeric.
dataframe['units_sold'] = dataframe.units_sold.apply(lambda x : x[0])

In [368]:
## Check firt record of dataframe.
dataframe.head(1)

Unnamed: 0,record_ID,units_sold
0,212645,55.559364


In [369]:
## Check dimesnions of test data.
test.shape

(13860, 9)

In [370]:
## Check dimensons of dataframe.
dataframe.shape

(13860, 2)

In [371]:
## Copy dataframe data into a CSV file.
dataframe.to_csv('NeuralNetworks.csv',index=False)

In [335]:
############################### Perform Grid Search,Ridge,Lasso ###############################################################

In [336]:
##################################################### Ridge ###################################################################

In [337]:
## Ridge regression is parametric and takes a parameter alpha. The value of alpha determines the reduction in magnitude of coefficients.
## But we also need to check which value of alpha gives best predictions on test data. For this we experiment with several values of alpha and pick the best
## We do this by performing grid search over several values of alpha. 
alphas = np.array([1,0.1,0.01,0.001,0.0001,0,1.5,2]) ## Pick the best of these values.
## Create and fit a ridge regression model, testing each alpha.
model_ridge = Ridge()
grid = GridSearchCV(estimator=model_ridge, param_grid=dict(alpha=alphas),cv=10) ## Here the argument cv=10 implies compute error on 10 chucks of data and report average value.
grid.fit(X_train,y_train)
print(grid)

GridSearchCV(cv=10, error_score=nan,
             estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                             max_iter=None, normalize=False, random_state=None,
                             solver='auto', tol=0.001),
             iid='deprecated', n_jobs=None,
             param_grid={'alpha': array([1.0e+00, 1.0e-01, 1.0e-02, 1.0e-03, 1.0e-04, 0.0e+00, 1.5e+00,
       2.0e+00])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)


In [338]:
## Display best params.
print(grid.best_score_)
print(grid.best_estimator_.alpha)

0.25372441841201543
2.0


In [355]:
## Instantiate Ridge and fit it.
Ridge_model= Ridge(alpha=2,normalize=False)
Ridge_model.fit(X_train,y_train) ## Applying it on the train data, to obtain the coefficients.

Ridge(alpha=2, copy_X=True, fit_intercept=True, max_iter=None, normalize=False,
      random_state=None, solver='auto', tol=0.001)

In [356]:
## Get the predictions on train and validation data.
pred_train = Ridge_model.predict(X_train)
pred_test = Ridge_model.predict(X_test)

In [357]:
## Get predictions on test data.
test_pred = Ridge_model.predict(test)

In [313]:
## Display RMSLE * 100 value for train and validation data.
print("Train Error:",sqrt(mean_squared_log_error(y_train, pred_train))*100)
print("Test Error:",sqrt(mean_squared_log_error(y_test, pred_test))*100)

Train Error: 149.7562801456566
Test Error: 149.62871162091162


In [359]:
## Prepare a dataframe with test data index,prediction values.
dataframe = pd.DataFrame({'record_ID' : test.index,
                          'units_sold' : test_pred})

In [360]:
## Check dimesnions of test data.
test.shape

(13860, 9)

In [361]:
## Check dimensons of dataframe.
dataframe.shape

(13860, 2)

In [362]:
## Copy dataframe data into a CSV file.
dataframe.to_csv('Ridge.csv',index=False)

In [343]:
####################################################### Lasso #################################################################

In [344]:
## Get best parameter vlaues by doing grid search.
model_lasso = Lasso()
grid = GridSearchCV(estimator=model_lasso, param_grid=dict(alpha=alphas),cv=10) #Here the argument cv=10 implies compute error on 10 chucks of data and report average value
grid.fit(X_train,y_train)
print(grid)

  estimator.fit(X_train, y_train, **fit_params)
  positive)
  positive)
  estimator.fit(X_train, y_train, **fit_params)
  positive)
  positive)
  estimator.fit(X_train, y_train, **fit_params)
  positive)
  positive)
  estimator.fit(X_train, y_train, **fit_params)
  positive)
  positive)
  estimator.fit(X_train, y_train, **fit_params)
  positive)
  positive)
  estimator.fit(X_train, y_train, **fit_params)
  positive)
  positive)
  estimator.fit(X_train, y_train, **fit_params)
  positive)
  positive)
  estimator.fit(X_train, y_train, **fit_params)
  positive)
  positive)
  estimator.fit(X_train, y_train, **fit_params)
  positive)
  positive)
  estimator.fit(X_train, y_train, **fit_params)
  positive)
  positive)


GridSearchCV(cv=10, error_score=nan,
             estimator=Lasso(alpha=1.0, copy_X=True, fit_intercept=True,
                             max_iter=1000, normalize=False, positive=False,
                             precompute=False, random_state=None,
                             selection='cyclic', tol=0.0001, warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'alpha': array([1.0e+00, 1.0e-01, 1.0e-02, 1.0e-03, 1.0e-04, 0.0e+00, 1.5e+00,
       2.0e+00])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)


In [346]:
## Display best parameters.
print(grid.best_score_)
print(grid.best_estimator_.alpha)

0.2537252610628664
0.01


In [347]:
## Instantiate Lasso and fit it.
Lasso_model= Lasso(alpha=2.0,normalize=False)
Lasso_model.fit(X_train,y_train) ## Applying it on the train data, to obtain the coefficients.

Lasso(alpha=2.0, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)

In [348]:
## Get the predictions on train and validation data.
pred_train = Lasso_model.predict(X_train)
pred_test = Lasso_model.predict(X_test)

In [349]:
## Get predictions on test data.
test_pred = Lasso_model.predict(test)

In [313]:
## Display RMSLE * 100 value for train and validation data.
print("Train Error:",sqrt(mean_squared_log_error(y_train, pred_train))*100)
print("Test Error:",sqrt(mean_squared_log_error(y_test, pred_test))*100)

Train Error: 149.7562801456566
Test Error: 149.62871162091162


In [351]:
## Prepare a dataframe with test data index,prediction values.
dataframe = pd.DataFrame({'record_ID' : test.index,
                          'units_sold' : test_pred})

In [352]:
## Check dimesnions of test data.
test.shape

(13860, 9)

In [353]:
## Check dimensons of dataframe.
dataframe.shape

(13860, 2)

In [354]:
## Copy dataframe data into a CSV file.
dataframe.to_csv('Lasso.csv',index=False)