Problem statement

The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and predict the sales of each product at a particular outlet.

Using this model, BigMart will try to understand the properties of products and outlets which play a key role in increasing sales.

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.

Import libraries

In [253]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
%matplotlib inline
warnings.filterwarnings('ignore')


Load and read datasets

In [254]:
#Load datasets
train=pd.read_csv("https://datahack-prod.s3.amazonaws.com/train_file/train_v9rqX0R.csv")
train

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.300,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.1380
1,DRC01,5.920,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.500,Low Fat,0.016760,Meat,141.6180,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.2700
3,FDX07,19.200,Regular,0.000000,Fruits and Vegetables,182.0950,OUT010,1998,,Tier 3,Grocery Store,732.3800
4,NCD19,8.930,Low Fat,0.000000,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052
...,...,...,...,...,...,...,...,...,...,...,...,...
8518,FDF22,6.865,Low Fat,0.056783,Snack Foods,214.5218,OUT013,1987,High,Tier 3,Supermarket Type1,2778.3834
8519,FDS36,8.380,Regular,0.046982,Baking Goods,108.1570,OUT045,2002,,Tier 2,Supermarket Type1,549.2850
8520,NCJ29,10.600,Low Fat,0.035186,Health and Hygiene,85.1224,OUT035,2004,Small,Tier 2,Supermarket Type1,1193.1136
8521,FDN46,7.210,Regular,0.145221,Snack Foods,103.1332,OUT018,2009,Medium,Tier 3,Supermarket Type2,1845.5976


In [255]:
train.shape

(8523, 12)

In [256]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


In [257]:
train.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,7060.0,8523.0,8523.0,8523.0,8523.0
mean,12.857645,0.066132,140.992782,1997.831867,2181.288914
std,4.643456,0.051598,62.275067,8.37176,1706.499616
min,4.555,0.0,31.29,1985.0,33.29
25%,8.77375,0.026989,93.8265,1987.0,834.2474
50%,12.6,0.053931,143.0128,1999.0,1794.331
75%,16.85,0.094585,185.6437,2004.0,3101.2964
max,21.35,0.328391,266.8884,2009.0,13086.9648


In [258]:
#Load datasets
test=pd.read_csv("https://datahack-prod.s3.amazonaws.com/test_file/test_AbJTz2l.csv")
test

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,FDW58,20.750,Low Fat,0.007565,Snack Foods,107.8622,OUT049,1999,Medium,Tier 1,Supermarket Type1
1,FDW14,8.300,reg,0.038428,Dairy,87.3198,OUT017,2007,,Tier 2,Supermarket Type1
2,NCN55,14.600,Low Fat,0.099575,Others,241.7538,OUT010,1998,,Tier 3,Grocery Store
3,FDQ58,7.315,Low Fat,0.015388,Snack Foods,155.0340,OUT017,2007,,Tier 2,Supermarket Type1
4,FDY38,,Regular,0.118599,Dairy,234.2300,OUT027,1985,Medium,Tier 3,Supermarket Type3
...,...,...,...,...,...,...,...,...,...,...,...
5676,FDB58,10.500,Regular,0.013496,Snack Foods,141.3154,OUT046,1997,Small,Tier 1,Supermarket Type1
5677,FDD47,7.600,Regular,0.142991,Starchy Foods,169.1448,OUT018,2009,Medium,Tier 3,Supermarket Type2
5678,NCO17,10.000,Low Fat,0.073529,Health and Hygiene,118.7440,OUT045,2002,,Tier 2,Supermarket Type1
5679,FDJ26,15.300,Regular,0.000000,Canned,214.6218,OUT017,2007,,Tier 2,Supermarket Type1


In [259]:
test.shape

(5681, 11)

In [260]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5681 entries, 0 to 5680
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            5681 non-null   object 
 1   Item_Weight                4705 non-null   float64
 2   Item_Fat_Content           5681 non-null   object 
 3   Item_Visibility            5681 non-null   float64
 4   Item_Type                  5681 non-null   object 
 5   Item_MRP                   5681 non-null   float64
 6   Outlet_Identifier          5681 non-null   object 
 7   Outlet_Establishment_Year  5681 non-null   int64  
 8   Outlet_Size                4075 non-null   object 
 9   Outlet_Location_Type       5681 non-null   object 
 10  Outlet_Type                5681 non-null   object 
dtypes: float64(3), int64(1), object(7)
memory usage: 488.3+ KB


In [261]:
test.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year
count,4705.0,5681.0,5681.0,5681.0
mean,12.695633,0.065684,141.023273,1997.828903
std,4.664849,0.051252,61.809091,8.372256
min,4.555,0.0,31.99,1985.0
25%,8.645,0.027047,94.412,1987.0
50%,12.5,0.054154,141.4154,1999.0
75%,16.7,0.093463,186.0266,2004.0
max,21.35,0.323637,266.5884,2009.0


Identify datatypes

In [262]:
train.dtypes

Item_Identifier               object
Item_Weight                  float64
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
Item_Outlet_Sales            float64
dtype: object

In [263]:
test.dtypes

Item_Identifier               object
Item_Weight                  float64
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
dtype: object

Check for null values

In [264]:
train.isnull().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

In [265]:
test.isnull().sum()

Item_Identifier                 0
Item_Weight                   976
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  1606
Outlet_Location_Type            0
Outlet_Type                     0
dtype: int64

Impute values of Outlet_Size

In [266]:
train.fillna(-999, inplace=True)
test.fillna(-999,inplace=True)

Iterative Impute Item Weight

Define X, y and X_test

In [268]:
#define x, y and x_test

Item_Identifier = test.Item_Identifier

y = train["Item_Outlet_Sales"]
X = train.drop(["Item_Outlet_Sales", "Item_Identifier"], axis=1)
X_test = test.drop(["Item_Identifier"], axis=1)

Split train set for training and validating

In [269]:
# Split into validation and training data
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=1, test_size=.1)
X_train.shape, y_train.shape, X_val.shape, y_val.shape, X_test.shape

((7670, 10), (7670,), (853, 10), (853,), (5681, 10))

In [270]:
categorical_features_indices = np.where(X.dtypes != np.float)[0]

Define model

In [271]:
!pip install catboost



In [272]:
#importing library and building model
from catboost import CatBoostRegressor
model=CatBoostRegressor(iterations=1000, depth=5, learning_rate=0.1, loss_function='RMSE')
model.fit(X_train, y_train, cat_features=categorical_features_indices,eval_set=(X_val, y_val),plot=True)
print(model.score(X_train, y_train))

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

0:	learn: 1623.3000472	test: 1607.5395329	best: 1607.5395329 (0)	total: 25.8ms	remaining: 25.8s
1:	learn: 1546.3159797	test: 1523.2227107	best: 1523.2227107 (1)	total: 43ms	remaining: 21.5s
2:	learn: 1477.6001591	test: 1447.9885533	best: 1447.9885533 (2)	total: 67.5ms	remaining: 22.4s
3:	learn: 1418.7923466	test: 1385.8556254	best: 1385.8556254 (3)	total: 98ms	remaining: 24.4s
4:	learn: 1365.1582144	test: 1331.6080586	best: 1331.6080586 (4)	total: 110ms	remaining: 22s
5:	learn: 1319.5639861	test: 1283.3053371	best: 1283.3053371 (5)	total: 131ms	remaining: 21.7s
6:	learn: 1282.3100836	test: 1244.1761209	best: 1244.1761209 (6)	total: 153ms	remaining: 21.7s
7:	learn: 1249.6937841	test: 1210.7988342	best: 1210.7988342 (7)	total: 177ms	remaining: 22s
8:	learn: 1221.8378614	test: 1181.4504544	best: 1181.4504544 (8)	total: 201ms	remaining: 22.1s
9:	learn: 1199.4875358	test: 1158.1305818	best: 1158.1305818 (9)	total: 215ms	remaining: 21.3s
10:	learn: 1180.0184432	test: 1137.3842443	best: 1137.

In [273]:
predictions = model.predict(X_test)
predictions

array([1712.25185939, 1372.92021476,  665.19090366, ..., 1856.41004552,
       3550.81744645, 1212.4570794 ])

In [274]:
predictions

array([1712.25185939, 1372.92021476,  665.19090366, ..., 1856.41004552,
       3550.81744645, 1212.4570794 ])

In [275]:
#Load sample submission
sample=pd.read_csv("https://datahack-prod.s3.amazonaws.com/sample_submission/sample_submission_8RXa3c6.csv")
sample

Unnamed: 0,Item_Identifier,Outlet_Identifier,Item_Outlet_Sales
0,FDW58,OUT049,1000
1,FDW14,OUT017,1000
2,NCN55,OUT010,1000
3,FDQ58,OUT017,1000
4,FDY38,OUT027,1000
...,...,...,...
5676,FDB58,OUT046,1000
5677,FDD47,OUT018,1000
5678,NCO17,OUT045,1000
5679,FDJ26,OUT017,1000


In [276]:
output = pd.DataFrame({'Item_Identifier': Item_Identifier,'Outlet_Identifier': test.Outlet_Identifier,
                       'Item_Outlet_Sales': predictions})
output.to_csv('submission.csv', index=False)
output

Unnamed: 0,Item_Identifier,Outlet_Identifier,Item_Outlet_Sales
0,FDW58,OUT049,1712.251859
1,FDW14,OUT017,1372.920215
2,NCN55,OUT010,665.190904
3,FDQ58,OUT017,2505.614295
4,FDY38,OUT027,5859.422900
...,...,...,...
5676,FDB58,OUT046,2087.946737
5677,FDD47,OUT018,2413.512981
5678,NCO17,OUT045,1856.410046
5679,FDJ26,OUT017,3550.817446
