# Regression Benchmark 

### Probem Example (Regression) - Big Mart Sales:
To build a predictive model and find out the sales of each product at the store.

Good starting point:
- Mean - whats has been total sales of each product month on month
- Mean with respect to another variable

The two most commonly used measures of central tendency for numerical data are the mean and the median. Since the regression problem deals with continuous data, mean and median are the correct measures.


To evaluate the model:

Mean Absolute Error -> sum of abs different between every observation, divided by number of obs


In [1]:
#importing libraries 

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

In [2]:
data = pd.read_csv('train_bm.csv')

In [4]:
data.shape

(8523, 12)

In [56]:
data.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,7060.0,8523.0,8523.0,8523.0,8523.0
mean,12.857645,0.066132,140.992782,1997.831867,2181.288914
std,4.643456,0.051598,62.275067,8.37176,1706.499616
min,4.555,0.0,31.29,1985.0,33.29
25%,8.77375,0.026989,93.8265,1987.0,834.2474
50%,12.6,0.053931,143.0128,1999.0,1794.331
75%,16.85,0.094585,185.6437,2004.0,3101.2964
max,21.35,0.328391,266.8884,2009.0,13086.9648


In [6]:
data.isnull().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

### Shuffling and Creating Train and Test Set

In [8]:
from sklearn.utils import shuffle

In [12]:
#shuffle dataset

data = shuffle(data, random_state=42)

#creating 4 division of data
div = int(data.shape[0]/4)

# 3 parts to train set and 1 part to test set
train = data.iloc[:3*div+1,:]
test = data.iloc[3*div+1:]

In [13]:
train.shape, test.shape, data.shape

((6391, 12), (2132, 12), (8523, 12))

## simple mean model (benchmark)

Find out the simple mean of train model and save it as a column in test dataset


In [15]:
test['simple_mean'] = train.Item_Outlet_Sales.mean()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['simple_mean'] = train.Item_Outlet_Sales.mean()


Calculate the error between the mean values created in line above with the actual value in Item_Outlet_Sales()


In [57]:
from sklearn.metrics import mean_absolute_error as MAE

simple_mean_error = MAE(test.Item_Outlet_Sales, test.simple_mean)
simple_mean_error

1348.3091635746123

This is the benchmark value for model we will create to predict accuracy of the models.

## Mean Item Outlet Sales with respect to Outlet_Type

Now, we will try to improve the above prediction by predicting based on outlet tyoe


In [27]:
out_type = pd.pivot_table(train, values = 'Item_Outlet_Sales', index=['Outlet_Type'], aggfunc=np.mean)

out_type

Unnamed: 0_level_0,Item_Outlet_Sales
Outlet_Type,Unnamed: 1_level_1
Grocery Store,334.106148
Supermarket Type1,2293.636762
Supermarket Type2,2034.330733
Supermarket Type3,3684.008727


In [29]:
# initializing new column to zero
test['Out_type_mean'] = 0

for i in test.Outlet_Type.unique():
    test['Out_type_mean'][test.Outlet_Type == str(i)] = train['Item_Outlet_Sales'][train.Outlet_Type == str(i)].mean()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['Out_type_mean'] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['Out_type_mean'][test.Outlet_Type == str(i)] = train['Item_Outlet_Sales'][train.Outlet_Type == str(i)].mean()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._where(~key, value, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-d

In [30]:
test.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,simple_mean,Out_type_mean
432,FDF10,15.5,Regular,0.157172,Snack Foods,149.1418,OUT049,1999,Medium,Tier 1,Supermarket Type1,588.5672,2169.533,2293.636762
4451,FDZ37,,Regular,0.019673,Canned,86.4198,OUT027,1985,Medium,Tier 3,Supermarket Type3,1918.8356,2169.533,3684.008727
1412,DRF23,4.61,Low Fat,0.123346,Hard Drinks,172.5396,OUT017,2007,,Tier 2,Supermarket Type1,3663.2316,2169.533,2293.636762
1329,NCQ41,,Low Fat,0.019386,Health and Hygiene,194.5794,OUT027,1985,Medium,Tier 3,Supermarket Type3,3511.4292,2169.533,3684.008727
6874,NCM17,7.93,Low Fat,0.071426,Health and Hygiene,45.9086,OUT018,2009,Medium,Tier 3,Supermarket Type2,1070.6064,2169.533,2034.330733


In [32]:
#mean suqare error

err = MAE(test.Item_Outlet_Sales, test.Out_type_mean)
err

1114.8889656414237

Conclusion: As we can see, the MAE did improve by using Outlet Type.

## Mean Item Outlet Sales with respect to Outlet_Establishment_Year


In [35]:
establis_year = pd.pivot_table(data, values='Item_Outlet_Sales', index= ['Outlet_Establishment_Year'], aggfunc=np.mean)

establis_year

Unnamed: 0_level_0,Item_Outlet_Sales
Outlet_Establishment_Year,Unnamed: 1_level_1
1985,2483.677474
1987,2298.995256
1997,2277.844267
1998,339.351662
1999,2348.354635
2002,2192.384798
2004,2438.841866
2007,2340.675263
2009,1995.498739


In [42]:
test['establisment_year_mean'] = 0

for i in data.Outlet_Establishment_Year.unique():
    test['establisment_year_mean'][test.Outlet_Establishment_Year == str(i)] = train['Item_Outlet_Sales'][train['Outlet_Establishment_Year'] == str(i)].mean()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['establisment_year_mean'] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['establisment_year_mean'][test.Outlet_Establishment_Year == str(i)] = train['Item_Outlet_Sales'][train['Outlet_Establishment_Year'] == str(i)].mean()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._where(~key, value, inplace=True)


In [43]:
errorrr = MAE(test.Item_Outlet_Sales, test.establisment_year_mean)

errorrr

2216.5290828330203

Conclusion: As we can see, the MAE did not improve by using establishment year.

## Mean Item_Outlet_Sales with respect to both Outlet_Location_Type and Outlet_Establishment_Year

Now we will use 2 features to create our model. Making the model more complex

In [46]:
combo_mean = pd.pivot_table(train, values='Item_Outlet_Sales', index = ['Outlet_Location_Type', 'Outlet_Establishment_Year'], aggfunc=np.mean)

combo_mean

Unnamed: 0_level_0,Unnamed: 1_level_0,Item_Outlet_Sales
Outlet_Location_Type,Outlet_Establishment_Year,Unnamed: 2_level_1
Tier 1,1985,332.70906
Tier 1,1997,2249.438082
Tier 1,1999,2368.598566
Tier 2,2002,2105.096784
Tier 2,2004,2435.711052
Tier 2,2007,2350.448072
Tier 3,1985,3684.008727
Tier 3,1987,2254.35211
Tier 3,1998,335.469243
Tier 3,2009,2034.330733


In [47]:
# Initiating new empty column
test['Super_mean'] = 0

# Assigning variables to strings ( to shorten code length)
s2 = 'Outlet_Location_Type'
s1 = 'Outlet_Establishment_Year'

# For every Unique Value in s1
for i in test[s1].unique():
  # For every Unique Value in s2
  for j in test[s2].unique():
    # Calculate and Assign mean to new column, corresponding to both unique values of s1 and s2 simultaneously
    test['Super_mean'][(test[s1] == i) & (test[s2]==str(j))] = train['Item_Outlet_Sales'][(train[s1] == i) & (train[s2]==str(j))].mean()

here(~key, value, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['Super_mean'][(test[s1] == i) & (test[s2]==str(j))] = train['Item_Outlet_Sales'][(train[s1] == i) & (train[s2]==str(j))].mean()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._where(~key, value, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['Super_mean'][(test[s1] == i) & (test[s2]==str(j))] = train['Item_Outlet_Sales'][(train[s1] == i) & (train[s2]==str(j))].mean()
A value is trying to be set on a copy o

In [48]:
#calculating mean absolute error
super_mean_error = MAE(test['Item_Outlet_Sales'] , test['Super_mean'] )
super_mean_error

1118.0230715619844

# Classification Benchmark

### Probem Example (Classification) - Titanic:
To predict wether a passenger of titanic would have survived or not?

Good point to start: Mode

To Evaluate: Accuracy - Correctly predict observation upon total observation


In [50]:
#importing libraries 
import pandas as pd 
import numpy as np
from sklearn.metrics import accuracy_score

In [54]:
data2 = pd.read_csv("train.csv")
data2.shape

(891, 12)

In [55]:
data2.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292
