<a href="https://colab.research.google.com/github/Nili3005/ML/blob/main/Project_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [36]:
!pip install --upgrade ngboost
!pip install ucimlrepo



# Usage

We'll start with a probabilistic regression example on the Boston housing dataset:

In [37]:
import sys
sys.path.append('/Users/c242587/Desktop/projects/git/ngboost')

In [38]:
from ngboost import NGBRegressor

import pandas as pd
import numpy as np

#from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from ucimlrepo import fetch_ucirepo

#Fetch dataset
wine_quality = fetch_ucirepo(id=186)

X = wine_quality.data.features
Y = wine_quality.data.targets

#Flatten Y using ravel()
Y = np.ravel(Y)

#X, Y = load_boston(True)
#data_url = "http://lib.stat.cmu.edu/datasets/boston"
#raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
#X = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
#Y = raw_df.values[1::2, 2]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

ngb = NGBRegressor().fit(X_train, Y_train)
Y_preds = ngb.predict(X_test)
Y_dists = ngb.pred_dist(X_test)

# test Mean Squared Error
test_MSE = mean_squared_error(Y_preds, Y_test)
print('Test MSE', test_MSE)

# test Negative Log Likelihood
test_NLL = -Y_dists.logpdf(Y_test).mean()
print('Test NLL', test_NLL)

[iter 0] loss=1.2815 val_loss=0.0000 scale=1.0000 norm=0.9147
[iter 100] loss=1.0967 val_loss=0.0000 scale=1.0000 norm=0.7851
[iter 200] loss=1.0266 val_loss=0.0000 scale=2.0000 norm=1.5252
[iter 300] loss=0.9854 val_loss=0.0000 scale=1.0000 norm=0.7536
[iter 400] loss=0.9604 val_loss=0.0000 scale=1.0000 norm=0.7507
Test MSE 0.4960218270619618
Test NLL 1.0471328009304808


Getting the estimated distributional parameters at a set of points is easy. This returns the predicted mean and standard deviation of the first five observations in the test set:

In [39]:
Y_dists[0:5].params

{'loc': array([6.54208419, 5.67759248, 6.07989233, 5.85556052, 5.62922047]),
 'scale': array([0.7595685 , 0.57332335, 0.62923854, 0.60181325, 0.57764623])}

## Distributions

NGBoost can be used with a variety of distributions, broken down into those for regression (support on an infinite set) and those for classification (support on a finite set).

### Regression Distributions

| Distribution | Parameters | Implemented Scores | Reference |
| --- | --- | --- | --- |
| `Normal` | `loc`, `scale` | `LogScore`, `CRPScore` | [`scipy.stats` normal](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html) |
| `LogNormal` | `s`, `scale` | `LogScore`, `CRPScore` | [`scipy.stats` lognormal](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.lognorm.html) |
| `Exponential` | `scale` | `LogScore`, `CRPScore` | [`scipy.stats` exponential](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.expon.html) |

Regression distributions can be used through the `NGBRegressor()` constructor by passing the appropriate class as the `Dist` argument. `Normal` is the default.

In [40]:
from ngboost.distns import Exponential, Normal

#X, Y = load_boston(True)
X_reg_train, X_reg_test, Y_reg_train, Y_reg_test = train_test_split(X, Y, test_size=0.2)

ngb_norm = NGBRegressor(Dist=Normal, verbose=False).fit(X_reg_train, Y_reg_train)
ngb_exp = NGBRegressor(Dist=Exponential, verbose=False).fit(X_reg_train, Y_reg_train)

There are two prediction methods for `NGBRegressor` objects: `predict()`, which returns point predictions as one would expect from a standard regressor, and `pred_dist()`, which returns a distribution object representing the conditional distribution of $Y|X=x_i$ at the points $x_i$ in the test set.

In [41]:
ngb_norm.predict(X_reg_test)[0:5]

array([5.32616578, 5.49701783, 5.90284105, 6.09607629, 5.83945417])

In [42]:
ngb_exp.predict(X_reg_test)[0:5]

array([5.28220534, 5.48476816, 5.79223312, 6.08844029, 5.79566338])

In [43]:
ngb_exp.pred_dist(X_reg_test)[0:5].params

{'scale': array([5.28220534, 5.48476816, 5.79223312, 6.08844029, 5.79566338])}

#### Survival Regression

NGBoost supports analyses of right-censored data. Any distribution that can be used for regression in NGBoost can also be used for survival analysis in theory, but this requires the implementation of the right-censored version of the appropriate score. At the moment, `LogNormal` and `Exponential` have these scores implemented. To do survival analysis, use `NGBSurvival` and pass both the time-to-event (or censoring) and event indicator vectors to  `fit()`:

In [44]:
import numpy as np
from ngboost import NGBSurvival
from ngboost.distns import LogNormal

#X, Y = load_boston(True)
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
X = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
Y = raw_df.values[1::2, 2]
X_surv_train, X_surv_test, Y_surv_train, Y_surv_test = train_test_split(X, Y, test_size=0.2)

# introduce administrative censoring to simulate survival data
T_surv_train = np.minimum(Y_surv_train, 30) # time of an event or censoring
E_surv_train = Y_surv_train > 30 # 1 if T[i] is the time of an event, 0 if it's a time of censoring

ngb = NGBSurvival(Dist=LogNormal).fit(X_surv_train, T_surv_train, E_surv_train)

[iter 0] loss=1.2481 val_loss=0.0000 scale=4.0000 norm=2.4215
[iter 100] loss=0.5287 val_loss=0.0000 scale=2.0000 norm=0.8484
[iter 200] loss=0.2843 val_loss=0.0000 scale=2.0000 norm=0.4982
[iter 300] loss=0.1246 val_loss=0.0000 scale=2.0000 norm=0.2687
[iter 400] loss=-0.0351 val_loss=0.0000 scale=2.0000 norm=0.1915


The scores currently implemented assume that the censoring is independent of survival, conditional on the observed predictors.

### Classification Distributions

| Distribution | Parameters | Implemented Scores | Reference |
| --- | --- | --- | --- |
| `k_categorical(K)` | `p0`, `p1`... `p{K-1}` | `LogScore` | [Categorical distribution on Wikipedia](https://en.wikipedia.org/wiki/Categorical_distribution) |
| `Bernoulli` | `p` | `LogScore` | [Bernoulli distribution on Wikipedia](https://en.wikipedia.org/wiki/Bernoulli_distribution) |

Classification distributions can be used through the `NGBClassifier()` constructor by passing the appropriate class as the `Dist` argument. `Bernoulli` is the default and is equivalent to `k_categorical(2)`.

In [45]:
from ngboost import NGBClassifier
from ngboost.distns import k_categorical, Bernoulli
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import LabelEncoder
from sklearn.datasets import fetch_openml

#Load the heart disease dataset
heart_disease = fetch_openml(name='heart', version=1)
X, y = heart_disease.data, heart_disease.target

#X, y = load_breast_cancer(True)
#data = load_breast_cancer()
#X, y = data.data, data.target
y[0:15] = 2 # artificially make this a 3-class problem instead of a 2-class problem
X_cls_train, X_cls_test, Y_cls_train, Y_cls_test  = train_test_split(X, y, test_size=0.2)

#Encode the target variable
label_encoder = LabelEncoder()
Y_cls_train_encoded = label_encoder.fit_transform(Y_cls_train)

#Check the unique values in the encoded target variable
print("Unique values in Y_cls_train_encoded:", set(Y_cls_train_encoded))

#Check the shapes of X_cls_train and Y_cls_train_encoded
print("X_cls_train shape:", X_cls_train.shape)
print("Y_cls_train_encoded shape:", Y_cls_train_encoded.shape)

ngb_cat = NGBClassifier(Dist=k_categorical(3), verbose=False) # tell ngboost that there are 3 possible outcomes
_ = ngb_cat.fit(X_cls_train, Y_cls_train_encoded) # Y should have only 3 values: {0,1,2}

  warn(


Unique values in Y_cls_train_encoded: {0, 1, 2}
X_cls_train shape: (216, 13)
Y_cls_train_encoded shape: (216,)


When using NGBoost for classification, the outcome vector `Y` must consist only of integers from 0 to K-1, where K is the total number of classes. This is consistent with the classification standards in sklearn.

`NGBClassifier` objects have three prediction methods: `predict()` returns the most likely class, `predict_proba()` returns the class probabilities, and `pred_dist()` returns the distribution object.

In [46]:
ngb_cat.predict(X_cls_test)[0:5]

array([1, 0, 0, 1, 0])

In [47]:
ngb_cat.predict_proba(X_cls_test)[0:5]

array([[1.02014021e-02, 9.89552586e-01, 2.46011803e-04],
       [8.05147365e-01, 3.33608049e-03, 1.91516555e-01],
       [9.91401647e-01, 3.90426820e-03, 4.69408437e-03],
       [3.52302960e-03, 9.96342756e-01, 1.34214619e-04],
       [9.66194197e-01, 3.35886922e-02, 2.17110410e-04]])

In [48]:
ngb_cat.pred_dist(X_cls_test)[0:5].params

{'p0': array([0.0102014 , 0.80514736, 0.99140165, 0.00352303, 0.9661942 ]),
 'p1': array([0.98955259, 0.00333608, 0.00390427, 0.99634276, 0.03358869]),
 'p2': array([2.46011803e-04, 1.91516555e-01, 4.69408437e-03, 1.34214619e-04,
        2.17110410e-04])}

## Scores

NGBoost supports the log score (`LogScore`, also known as negative log-likelihood) and CRPS (`CRPScore`), although each score may not be implemented for each distribution. The score is specified by the `Score` argument in the constructor.

In [49]:
from ngboost.scores import LogScore, CRPScore
from sklearn.metrics import accuracy_score

#NGBRegressor(Dist=Exponential, Score=CRPScore, verbose=False).fit(X_reg_train, Y_reg_train)
#NGBClassifier(Dist=k_categorical(3), Score=LogScore, verbose=False).fit(X_cls_train, Y_cls_train)

reg_model = NGBRegressor(Dist=Exponential, Score=CRPScore, verbose=False).fit(X_reg_train, Y_reg_train)
cls_model = NGBClassifier(Dist=k_categorical(3), Score=LogScore, verbose=False).fit(X_cls_train, Y_cls_train_encoded)

#For Regression model - Prediction and MSE for accuracy
reg_predictions= reg_model.predict(X_reg_test)
print(reg_predictions)

mse = mean_squared_error(Y_reg_test, reg_predictions)
print(f'Mean Squared Error: {mse}')


#For Classifier model - Prediction and accuracy
cls_predictions = cls_model.predict(X_cls_test)
print(cls_predictions)

predicted_labels = np.round(cls_predictions).astype(int)
accuracy = accuracy_score(Y_cls_test, predicted_labels)
print(f'Accuracy: {accuracy}')


[5.80347314 5.80347314 5.80657505 ... 5.80657505 5.8084494  5.8084494 ]
Mean Squared Error: 0.7627637194214171
[1 0 0 1 0 0 0 1 0 0 1 1 1 0 0 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0 1 0 1 0 1 0 0
 0 1 1 1 0 0 0 0 1 0 0 1 0 1 1 0 0]
Accuracy: 0.24074074074074073


## Base Learners

NGBoost can be used with any sklearn regressor as the base learner, specified with the `Base` argument. The default is a depth-3 regression tree.

In [50]:
from sklearn.tree import DecisionTreeRegressor

learner = DecisionTreeRegressor(criterion='friedman_mse', max_depth=5)

NGBSurvival(Dist=Exponential, Score=CRPScore, Base=learner, verbose=False).fit(X_surv_train, T_surv_train, E_surv_train)

## Other Arguments

The learning rate, number of estimators, minibatch fraction, and column subsampling are also easily adjusted:

In [51]:
ngb = NGBRegressor(n_estimators=100, learning_rate=0.01,
             minibatch_frac=0.5, col_sample=0.5)
ngb.fit(X_reg_train, Y_reg_train)

[iter 0] loss=1.2592 val_loss=0.0000 scale=1.0000 norm=0.8927


Sample weights (for training) are set using the `sample_weight` argument to `fit`.

In [52]:
ngb = NGBRegressor(n_estimators=100, learning_rate=0.01,
             minibatch_frac=0.5, col_sample=0.5)
weights = np.random.random(Y_reg_train.shape)
ngb.fit(X_reg_train, Y_reg_train, sample_weight=weights)

[iter 0] loss=1.2921 val_loss=0.0000 scale=1.0000 norm=0.9179
