<a href="https://colab.research.google.com/github/Nili3005/ML/blob/main/Project_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [50]:
!pip install --upgrade ngboost
!pip install ucimlrepo



# Usage

We'll start with a probabilistic regression example on the Boston housing dataset:

In [51]:
import sys
sys.path.append('/Users/c242587/Desktop/projects/git/ngboost')

In [52]:
from ngboost import NGBRegressor

import pandas as pd
import numpy as np

#from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

#X, Y = load_boston(True)
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
X = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
Y = raw_df.values[1::2, 2]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

ngb = NGBRegressor().fit(X_train, Y_train)
Y_preds = ngb.predict(X_test)
Y_dists = ngb.pred_dist(X_test)

# test Mean Squared Error
test_MSE = mean_squared_error(Y_preds, Y_test)
print('Test MSE', test_MSE)

# test Negative Log Likelihood
test_NLL = -Y_dists.logpdf(Y_test).mean()
print('Test NLL', test_NLL)

[iter 0] loss=3.6195 val_loss=0.0000 scale=1.0000 norm=6.5324
[iter 100] loss=2.7106 val_loss=0.0000 scale=2.0000 norm=4.9724
[iter 200] loss=2.1570 val_loss=0.0000 scale=2.0000 norm=3.4388
[iter 300] loss=1.8890 val_loss=0.0000 scale=2.0000 norm=2.9728
[iter 400] loss=1.7590 val_loss=0.0000 scale=1.0000 norm=1.3779
Test MSE 13.086311863020699
Test NLL 3.7126910473017074


Getting the estimated distributional parameters at a set of points is easy. This returns the predicted mean and standard deviation of the first five observations in the test set:

In [53]:
Y_dists[0:5].params

{'loc': array([34.03382626, 35.97574997, 30.90244133, 34.35309589, 21.91208885]),
 'scale': array([1.593843  , 1.47643172, 1.31619038, 1.30195658, 1.39903428])}

## Distributions

NGBoost can be used with a variety of distributions, broken down into those for regression (support on an infinite set) and those for classification (support on a finite set).

### Regression Distributions

| Distribution | Parameters | Implemented Scores | Reference |
| --- | --- | --- | --- |
| `Normal` | `loc`, `scale` | `LogScore`, `CRPScore` | [`scipy.stats` normal](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html) |
| `LogNormal` | `s`, `scale` | `LogScore`, `CRPScore` | [`scipy.stats` lognormal](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.lognorm.html) |
| `Exponential` | `scale` | `LogScore`, `CRPScore` | [`scipy.stats` exponential](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.expon.html) |

Regression distributions can be used through the `NGBRegressor()` constructor by passing the appropriate class as the `Dist` argument. `Normal` is the default.

In [54]:
from ngboost.distns import Exponential, Normal

#X, Y = load_boston(True)
X_reg_train, X_reg_test, Y_reg_train, Y_reg_test = train_test_split(X, Y, test_size=0.2)

ngb_norm = NGBRegressor(Dist=Normal, verbose=False).fit(X_reg_train, Y_reg_train)
ngb_exp = NGBRegressor(Dist=Exponential, verbose=False).fit(X_reg_train, Y_reg_train)

There are two prediction methods for `NGBRegressor` objects: `predict()`, which returns point predictions as one would expect from a standard regressor, and `pred_dist()`, which returns a distribution object representing the conditional distribution of $Y|X=x_i$ at the points $x_i$ in the test set.

In [55]:
ngb_norm.predict(X_reg_test)[0:5]

array([13.31615402,  9.87803185, 19.78090786, 11.65408641, 17.00388446])

In [56]:
ngb_exp.predict(X_reg_test)[0:5]

array([12.95094405,  9.37525238, 20.445523  , 10.6854388 , 17.27326115])

In [57]:
ngb_exp.pred_dist(X_reg_test)[0:5].params

{'scale': array([12.95094405,  9.37525238, 20.445523  , 10.6854388 , 17.27326115])}

#### Survival Regression

NGBoost supports analyses of right-censored data. Any distribution that can be used for regression in NGBoost can also be used for survival analysis in theory, but this requires the implementation of the right-censored version of the appropriate score. At the moment, `LogNormal` and `Exponential` have these scores implemented. To do survival analysis, use `NGBSurvival` and pass both the time-to-event (or censoring) and event indicator vectors to  `fit()`:

In [58]:
import numpy as np
from ngboost import NGBSurvival
from ngboost.distns import LogNormal

#X, Y = load_boston(True)
X_surv_train, X_surv_test, Y_surv_train, Y_surv_test = train_test_split(X, Y, test_size=0.2)

# introduce administrative censoring to simulate survival data
T_surv_train = np.minimum(Y_surv_train, 30) # time of an event or censoring
E_surv_train = Y_surv_train > 30 # 1 if T[i] is the time of an event, 0 if it's a time of censoring

ngb = NGBSurvival(Dist=LogNormal).fit(X_surv_train, T_surv_train, E_surv_train)

[iter 0] loss=1.2640 val_loss=0.0000 scale=4.0000 norm=2.4457
[iter 100] loss=0.5586 val_loss=0.0000 scale=2.0000 norm=0.8612
[iter 200] loss=0.3064 val_loss=0.0000 scale=2.0000 norm=0.5166
[iter 300] loss=0.1419 val_loss=0.0000 scale=2.0000 norm=0.2929
[iter 400] loss=-0.0028 val_loss=0.0000 scale=2.0000 norm=0.2143


The scores currently implemented assume that the censoring is independent of survival, conditional on the observed predictors.

### Classification Distributions

| Distribution | Parameters | Implemented Scores | Reference |
| --- | --- | --- | --- |
| `k_categorical(K)` | `p0`, `p1`... `p{K-1}` | `LogScore` | [Categorical distribution on Wikipedia](https://en.wikipedia.org/wiki/Categorical_distribution) |
| `Bernoulli` | `p` | `LogScore` | [Bernoulli distribution on Wikipedia](https://en.wikipedia.org/wiki/Bernoulli_distribution) |

Classification distributions can be used through the `NGBClassifier()` constructor by passing the appropriate class as the `Dist` argument. `Bernoulli` is the default and is equivalent to `k_categorical(2)`.

In [59]:
from ngboost import NGBClassifier
from ngboost.distns import k_categorical, Bernoulli
from sklearn.datasets import load_breast_cancer

#X, y = load_breast_cancer(True)
data = load_breast_cancer()
X, y = data.data, data.target
y[0:15] = 2 # artificially make this a 3-class problem instead of a 2-class problem
X_cls_train, X_cls_test, Y_cls_train, Y_cls_test  = train_test_split(X, y, test_size=0.2)

ngb_cat = NGBClassifier(Dist=k_categorical(3), verbose=False) # tell ngboost that there are 3 possible outcomes
_ = ngb_cat.fit(X_cls_train, Y_cls_train) # Y should have only 3 values: {0,1,2}

When using NGBoost for classification, the outcome vector `Y` must consist only of integers from 0 to K-1, where K is the total number of classes. This is consistent with the classification standards in sklearn.

`NGBClassifier` objects have three prediction methods: `predict()` returns the most likely class, `predict_proba()` returns the class probabilities, and `pred_dist()` returns the distribution object.

In [60]:
ngb_cat.predict(X_cls_test)[0:5]

array([1, 1, 1, 1, 1])

In [61]:
ngb_cat.predict_proba(X_cls_test)[0:5]

array([[3.14557799e-03, 9.96659347e-01, 1.95074652e-04],
       [3.14557799e-03, 9.96659347e-01, 1.95074652e-04],
       [1.29217245e-01, 8.63272497e-01, 7.51025775e-03],
       [1.34545715e-02, 9.86223058e-01, 3.22370342e-04],
       [3.14561652e-03, 9.96671557e-01, 1.82826919e-04]])

In [62]:
ngb_cat.pred_dist(X_cls_test)[0:5].params

{'p0': array([0.00314558, 0.00314558, 0.12921725, 0.01345457, 0.00314562]),
 'p1': array([0.99665935, 0.99665935, 0.8632725 , 0.98622306, 0.99667156]),
 'p2': array([0.00019507, 0.00019507, 0.00751026, 0.00032237, 0.00018283])}

## Scores

NGBoost supports the log score (`LogScore`, also known as negative log-likelihood) and CRPS (`CRPScore`), although each score may not be implemented for each distribution. The score is specified by the `Score` argument in the constructor.

In [63]:
from ngboost.scores import LogScore, CRPScore
from sklearn.metrics import accuracy_score

#NGBRegressor(Dist=Exponential, Score=CRPScore, verbose=False).fit(X_reg_train, Y_reg_train)
#NGBClassifier(Dist=k_categorical(3), Score=LogScore, verbose=False).fit(X_cls_train, Y_cls_train)

reg_model = NGBRegressor(Dist=Exponential, Score=CRPScore, verbose=False).fit(X_reg_train, Y_reg_train)
cls_model = NGBClassifier(Dist=k_categorical(3), Score=LogScore, verbose=False).fit(X_cls_train, Y_cls_train)

#For Regression model - Prediction and MSE for accuracy
reg_predictions= reg_model.predict(X_reg_test)
print(reg_predictions)

mse = mean_squared_error(Y_reg_test, reg_predictions)
print(f'Mean Squared Error: {mse}')


#For Classifier model - Prediction and accuracy
cls_predictions = cls_model.predict(X_cls_test)
print(cls_predictions)

predicted_labels = np.round(cls_predictions).astype(int)
accuracy = accuracy_score(Y_cls_test, predicted_labels)
print(f'Accuracy: {accuracy}')


[14.17693471 13.28605331 20.79865025 13.93046904 16.92209987 27.61295736
 36.51742064 16.92209987 26.8370973  19.48135766 18.12556832 28.48264484
 20.12121998 18.59218993 27.03332705 18.8487727  29.40795682 18.2974402
 13.19500119 21.85550529 20.46808549 24.08839659 19.82831636 40.6764783
 20.53458014 28.16397455 14.97565519 26.20634875 17.01909086 22.43035361
 15.97739817 24.08839659 15.59258743 15.152676   16.84709104 26.26426617
 26.54678062 20.12121998 22.42232367 18.39270282 18.25591858 13.93046904
 18.59218993 19.82831636 37.33462563 36.45875522 20.12121998 21.53735576
 24.92782823 21.53735576 22.42232367 29.54386428 40.20427171 28.91507638
 20.08090466 19.82831636 22.095923   22.42232367 26.47204993 22.43035361
 15.77907275 18.29870551 19.82831636 21.53735576 27.32609216 21.53735576
 16.11670173 22.42232367 20.53458014 29.54386428 16.66211311 16.80738655
 20.53458014 20.34929452 18.58400491 22.43035361 21.55594542 24.71312458
 15.04703298 21.53735576 30.25587799 16.30877251 19.8

## Base Learners

NGBoost can be used with any sklearn regressor as the base learner, specified with the `Base` argument. The default is a depth-3 regression tree.

In [64]:
from sklearn.tree import DecisionTreeRegressor

learner = DecisionTreeRegressor(criterion='friedman_mse', max_depth=5)

NGBSurvival(Dist=Exponential, Score=CRPScore, Base=learner, verbose=False).fit(X_surv_train, T_surv_train, E_surv_train)

## Other Arguments

The learning rate, number of estimators, minibatch fraction, and column subsampling are also easily adjusted:

In [65]:
ngb = NGBRegressor(n_estimators=100, learning_rate=0.01,
             minibatch_frac=0.5, col_sample=0.5)
ngb.fit(X_reg_train, Y_reg_train)

[iter 0] loss=3.6546 val_loss=0.0000 scale=1.0000 norm=6.8256


Sample weights (for training) are set using the `sample_weight` argument to `fit`.

In [66]:
ngb = NGBRegressor(n_estimators=100, learning_rate=0.01,
             minibatch_frac=0.5, col_sample=0.5)
weights = np.random.random(Y_reg_train.shape)
ngb.fit(X_reg_train, Y_reg_train, sample_weight=weights)

[iter 0] loss=3.7596 val_loss=0.0000 scale=1.0000 norm=7.5870
