# 1. Problem Statement 
The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim of this data science project is to build a predictive model and find out the sales of each product at a particular store.
Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.
 The data has missing values as some stores do not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.

# 2. Hypothesis Generation 
We could get some hypotheses that might affect the sales, and they are as following: 
1. Sales might be higher at weekends 
2. Sales might be higher on occasions (e.g., Ramadan Month, Feasts, Festivals and so on...)
3. The offers, discounts and packages the store offers makes sales higher 
4. The location of the store and the population in this location 
5. Ads and posters  
6. studying people needs and priorities 
7. maintaining the relationship with the customers 
8. Store Size 
9. Products that the store offers, the more products, the higher the sales 
10. offering children's products 
11. Store age

# 3. Loading Packages and Data 

In [1]:
# import modules
import pandas as pd 
import numpy as np

import seaborn as sns 
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split 
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.impute import SimpleImputer

In [44]:
def get_mae(X_t, X_v, y_t, y_v):
    # specifying the model 
    model = RandomForestRegressor(random_state=0)

    # fit model 
    model.fit(X_t, y_t)

    # make validation predictions and calculate mae 
    pred_v = model.predict(X_v)
    mae_v = mean_absolute_error(y_v, pred_v)
    
    return mae_v 

In [45]:
# Read the data 
train_data_file_path = '../data/Train.csv'
test_data_file_path  = '../data/Test.csv'

train_data = pd.read_csv(train_data_file_path)
test_data  = pd.read_csv(test_data_file_path)

print("Training data shape: ", train_data.shape)
print("Test data shape: ",     test_data.shape)

Training data shape:  (8523, 12)
Test data shape:  (5681, 11)


In [46]:
train_data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [47]:
test_data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,FDW58,20.75,Low Fat,0.007565,Snack Foods,107.8622,OUT049,1999,Medium,Tier 1,Supermarket Type1
1,FDW14,8.3,reg,0.038428,Dairy,87.3198,OUT017,2007,,Tier 2,Supermarket Type1
2,NCN55,14.6,Low Fat,0.099575,Others,241.7538,OUT010,1998,,Tier 3,Grocery Store
3,FDQ58,7.315,Low Fat,0.015388,Snack Foods,155.034,OUT017,2007,,Tier 2,Supermarket Type1
4,FDY38,,Regular,0.118599,Dairy,234.23,OUT027,1985,Medium,Tier 3,Supermarket Type3


# 4. Data Structure and Content 

# 8.Missing Value Treatment 

In [48]:
# get the number of missing data points per column
missing_vals_counts = train_data.isnull().sum()
print("Missing values counts per columns: \n", missing_vals_counts)

Missing values counts per columns: 
 Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64


In [49]:
# get percentage of missing data 
total_cells = np.product(train_data.shape)
total_missing_vals = missing_vals_counts.sum()

percent_missing = (total_missing_vals/total_cells) * 100
print("Percentage of missing data = {:.4f} ".format(percent_missing))

Percentage of missing data = 3.7868 


In [50]:
# get the number of missing data points per column
missing_vals_counts_test = test_data.isnull().sum()
print("Missing values counts per columns: \n", missing_vals_counts_test)

Missing values counts per columns: 
 Item_Identifier                 0
Item_Weight                   976
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  1606
Outlet_Location_Type            0
Outlet_Type                     0
dtype: int64


In [51]:
# get percentage of missing data 
total_cells_test = np.product(test_data.shape)
total_missing_vals_test = missing_vals_counts_test.sum()

percent_missing_test = (total_missing_vals_test/total_cells_test) * 100
print("Percentage of missing data in test set = {:.4f} ".format(percent_missing_test))

Percentage of missing data in test set = 4.1318 


We see that the missing values are in just two columns, Item_weight and Outlet_size and the percentage of the missing data is small comparing to whole datasets (training and test sets)

There are two ways to deal with this problem, We just can ignore those two columns and drop them or we can perform imputation add fill those null values with some value and get the score for each one to select what is the best choice for us 

## 1. Dropping the missing Values 
drop the two columns `Item_Weight` and `Outlet_Size` from the `train_data` and `test_data`

In [52]:
train_data_dropped = train_data.dropna(axis=1, inplace=True)
test_data_dropped  = test_data.dropna(axis=1, inplace=True)

In [53]:
train_data_dropped.head()

Unnamed: 0,Item_Identifier,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Tier 1,Supermarket Type1,3735.138
1,DRC01,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Tier 3,Supermarket Type2,443.4228
2,FDN15,Low Fat,0.01676,Meat,141.618,OUT049,1999,Tier 1,Supermarket Type1,2097.27
3,FDX07,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,Tier 3,Grocery Store,732.38
4,NCD19,Low Fat,0.0,Household,53.8614,OUT013,1987,Tier 3,Supermarket Type1,994.7052


In [54]:
test_data_dropped.head()

Unnamed: 0,Item_Identifier,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Location_Type,Outlet_Type
0,FDW58,Low Fat,0.007565,Snack Foods,107.8622,OUT049,1999,Tier 1,Supermarket Type1
1,FDW14,reg,0.038428,Dairy,87.3198,OUT017,2007,Tier 2,Supermarket Type1
2,NCN55,Low Fat,0.099575,Others,241.7538,OUT010,1998,Tier 3,Grocery Store
3,FDQ58,Low Fat,0.015388,Snack Foods,155.034,OUT017,2007,Tier 2,Supermarket Type1
4,FDY38,Regular,0.118599,Dairy,234.23,OUT027,1985,Tier 3,Supermarket Type3


In [55]:
# getting only the nmerical data for testing the model 
features = ['Item_Visibility', 'Item_MRP', 'Outlet_Establishment_Year']
X_data = train_data_dropped[features]
X_test = test_data_dropped[features]
y_data = train_data['Item_Outlet_Sales']

# Splitting training data into train and validation data 
X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, test_size=0.2, train_size=0.8)

mae = get_mae(X_train, X_val, y_train, y_val)
print("Mean absolute error = {:.4f}".format(mae))

Mean absolute error = 981.7385


## 2. Imputation

In [59]:
# strategies = ['mean', 'median', 'most_frequent', 'constant']

# for strategy in strategies:
imputer = SimpleImputer(strategy='median')
imputed_X_train = pd.DataFrame(imputer.fit_transform(train_data[features]))
imputed_X_test  = pd.DataFrame(imputer.transform(test_data[features]))

imputed_X_train.columns = features
imputed_X_test.columns  = features

# Splitting training data into train and validation data 
X_train, X_val, y_train, y_val = train_test_split(imputed_X_train, y_data, 
                                                  test_size=0.2, train_size=0.8)

mae = get_mae(X_train, X_val, y_train, y_val)
print("Strategy: {}, Mean absolute error = {:.4f}".format(strategy, mae))

Strategy: constant, Mean absolute error = 937.6367


It's more efficient if we use imputation with median