## Baseline Regressor (DummyModel)

The baseline regressor to make predictions with simple rules, possibly without using any features. 

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
from sklearn.dummy import DummyRegressor

### Load datasets

In [2]:
# we are using sklearns toy data of diabetes
dataset = load_diabetes()

print("Dataset features")
print(dataset.feature_names)
print("Total sample in data ", len(dataset.data))

dataset.data[:5, :]

Dataset features
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
Total sample in data  442


array([[ 0.03807591,  0.05068012,  0.06169621,  0.02187235, -0.0442235 ,
        -0.03482076, -0.04340085, -0.00259226,  0.01990842, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, -0.02632783, -0.00844872,
        -0.01916334,  0.07441156, -0.03949338, -0.06832974, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, -0.00567061, -0.04559945,
        -0.03419447, -0.03235593, -0.00259226,  0.00286377, -0.02593034],
       [-0.08906294, -0.04464164, -0.01159501, -0.03665645,  0.01219057,
         0.02499059, -0.03603757,  0.03430886,  0.02269202, -0.00936191],
       [ 0.00538306, -0.04464164, -0.03638469,  0.02187235,  0.00393485,
         0.01559614,  0.00814208, -0.00259226, -0.03199144, -0.04664087]])

### Dataset description

In [3]:
X_df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
X_df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641


In [4]:
X_df.mean(axis=0)

age   -3.634285e-16
sex    1.308343e-16
bmi   -8.045349e-16
bp     1.281655e-16
s1    -8.835316e-17
s2     1.327024e-16
s3    -4.574646e-16
s4     3.777301e-16
s5    -3.830854e-16
s6    -3.412882e-16
dtype: float64

In [5]:
np.linalg.norm(X_df, axis=0)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

All the features in the dataset has a unit norm. So, this dataset is normalized. 

### Dummy Regressor Strategies

Ref: https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyRegressor.html

Dummy regressor have several strategies to do the prediction. The following four are supported in the sklearn `DummyRegressor` class. 

* “mean”: always predicts the mean of the training set
* “median”: always predicts the median of the training set
* “quantile”: always predicts a specified quantile of the training set, provided with the quantile parameter.
* “constant”: always predicts a constant value that is provided by the user.

### Strategy: Mean

This strategy uses mean value of the target variable for prediction. 

In [6]:
x = dataset.data
y = dataset.target

# we choose the mean to get the best prediction.
dummy_model = DummyRegressor(strategy='mean')

dummy_model.fit(x,y)

print(dummy_model.score(x,y))


0.0


### Strategy: Median

This strategy chooses median value of the target variable for prediction.


In [7]:
x = dataset.data
y = dataset.target

# we choose the mean to get the best prediction.
dummy_model = DummyRegressor(strategy='median')

dummy_model.fit(x,y)

print(dummy_model.score(x,y))


-0.02282303217029802


### Strategy: constant

This strategy employs a user given value for prediction. 


In [8]:
x = dataset.data
y = dataset.target

# we choose the mean to get the best prediction.
dummy_model = DummyRegressor(strategy='constant', constant=3)

dummy_model.fit(x,y)

print(dummy_model.score(x,y))


-3.7506286353302976


### Strategy: quantile

Given a quntile, this strategy indentifies the corresponding value from the taget for making prediction. 


In [9]:
x = dataset.data
y = np.asarray(dataset.target, dtype='float')

# we choose the best quantile to get the best prediction.
dummy_model = DummyRegressor(strategy='quantile', quantile=0.554)

dummy_model.fit(x,y)

print(dummy_model.score(x,y))


-0.00011324912791921271
