<a href="https://colab.research.google.com/github/ShaunakSen/Data-Science-and-Machine-Learning/blob/master/Probablilistic_Prediction_NGBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## NGBoost: Natural Gradient Boosting for Probabilistic Prediction

[by Tony Duan*, Anand Avati*, Daisy Yi Ding, Sanjay Basu, Andrew Ng, Alejandro Schuler](https://stanfordmlgroup.github.io/projects/ngboost/)

[article](https://towardsdatascience.com/interpreting-the-probabilistic-predictions-from-ngboost-868d6f3770b2)

In [1]:
!pip install --upgrade git+https://github.com/stanfordmlgroup/ngboost.git

Collecting git+https://github.com/stanfordmlgroup/ngboost.git
  Cloning https://github.com/stanfordmlgroup/ngboost.git to /tmp/pip-req-build-23wledcl
  Running command git clone -q https://github.com/stanfordmlgroup/ngboost.git /tmp/pip-req-build-23wledcl
Collecting tqdm>=4.36.1
[?25l  Downloading https://files.pythonhosted.org/packages/72/c9/7fc20feac72e79032a7c8138fd0d395dc6d8812b5b9edf53c3afd0b31017/tqdm-4.41.1-py2.py3-none-any.whl (56kB)
[K     |████████████████████████████████| 61kB 3.1MB/s 
[?25hCollecting lifelines>=0.22.8
[?25l  Downloading https://files.pythonhosted.org/packages/a5/8e/56c7d3bba5cf2f579a664c553900a2273802e0582bd4bdd809cdd6755b01/lifelines-0.23.6-py2.py3-none-any.whl (407kB)
[K     |████████████████████████████████| 409kB 15.2MB/s 
Collecting autograd-gamma>=0.3
  Downloading https://files.pythonhosted.org/packages/3e/87/788c4bf90cc5c534cb3b7fdb5b719175e33e2658decce75e35e2ce69766f/autograd_gamma-0.4.1-py2.py3-none-any.whl
Building wheels for collected packa

In [0]:
from ngboost import NGBRegressor

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [5]:
X, Y = load_boston(return_X_y=True)

print (X.shape, Y.shape)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

print (X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

(506, 13) (506,)
(404, 13) (102, 13) (404,) (102,)


In [6]:
ngb = NGBRegressor().fit(X_train, Y_train)

[iter 0] loss=3.5974 val_loss=0.0000 scale=0.5000 norm=3.2173
[iter 100] loss=3.0258 val_loss=0.0000 scale=1.0000 norm=3.5887
[iter 200] loss=2.3739 val_loss=0.0000 scale=2.0000 norm=3.8411
[iter 300] loss=1.9809 val_loss=0.0000 scale=2.0000 norm=3.0917
[iter 400] loss=1.8184 val_loss=0.0000 scale=1.0000 norm=1.4123


In [8]:
Y_preds = ngb.predict(X_test)
print (Y_preds.shape)

Y_dists = ngb.pred_dist(X_test)
print (Y_dists)

(102,)
<ngboost.distns.normal.Normal object at 0x7fdb730348d0>


In [12]:
# test Mean Squared Error
test_MSE = mean_squared_error(Y_preds, Y_test)
print('Test MSE', test_MSE)

# test Negative Log Likelihood
test_NLL = -Y_dists.logpdf(Y_test.flatten()).mean()
print('Test NLL', test_NLL)

Test MSE 18.75183128890201
Test NLL 4.9482649598617385


To find the median of a distribution, we can use the percent point function ppf, which is the inverse of the cdf:

- [from here](https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html)

In [32]:
print (type(Y_dists.dist))

Y_dists.dist.ppf(0.5)

<class 'scipy.stats._distn_infrastructure.rv_frozen'>


array([21.72825831, 15.54410556, 20.38922011, 20.41156068, 26.12046596,
       23.14957797,  8.46899507, 37.44732939, 23.04267279, 15.87599073,
       24.62382431, 15.47866909,  8.5290854 , 40.64048433, 16.8322865 ,
       21.29569447, 27.2787986 , 15.0343379 , 15.12239093, 25.70018231,
       16.478637  , 43.96227489, 14.36452463, 16.01983683, 16.32283119,
       11.70507119, 19.19341592, 20.66641506, 33.27494618, 16.49127694,
       25.00087217, 34.32202973, 21.11291837, 12.46654848, 10.50409889,
        9.88669404, 45.30848307, 21.6108767 , 19.01225634, 19.93676444,
       22.04976044, 23.10629344, 47.42229759, 21.16876433, 20.07038595,
       19.73106263, 13.26867924, 44.01917015, 20.69665256, 15.68032965,
       18.70953643, 24.7439399 , 44.85483777, 10.89654967, 11.85616995,
       26.01056295, 22.63461401, 23.38867706, 22.39245795, 13.41065386,
       15.81649381, 23.10967332, 41.77723496, 19.37421739, 20.9184772 ,
       28.47108849, 48.80310534, 17.60671394,  9.4210476 , 35.76

In [22]:
Y_preds

array([21.72825831, 15.54410556, 20.38922011, 20.41156068, 26.12046596,
       23.14957797,  8.46899507, 37.44732939, 23.04267279, 15.87599073,
       24.62382431, 15.47866909,  8.5290854 , 40.64048433, 16.8322865 ,
       21.29569447, 27.2787986 , 15.0343379 , 15.12239093, 25.70018231,
       16.478637  , 43.96227489, 14.36452463, 16.01983683, 16.32283119,
       11.70507119, 19.19341592, 20.66641506, 33.27494618, 16.49127694,
       25.00087217, 34.32202973, 21.11291837, 12.46654848, 10.50409889,
        9.88669404, 45.30848307, 21.6108767 , 19.01225634, 19.93676444,
       22.04976044, 23.10629344, 47.42229759, 21.16876433, 20.07038595,
       19.73106263, 13.26867924, 44.01917015, 20.69665256, 15.68032965,
       18.70953643, 24.7439399 , 44.85483777, 10.89654967, 11.85616995,
       26.01056295, 22.63461401, 23.38867706, 22.39245795, 13.41065386,
       15.81649381, 23.10967332, 41.77723496, 19.37421739, 20.9184772 ,
       28.47108849, 48.80310534, 17.60671394,  9.4210476 , 35.76