<a href="https://colab.research.google.com/github/Nili3005/ML/blob/main/Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [9]:
!pip install --upgrade ngboost
!pip install ucimlrepo

Collecting ngboost
  Downloading ngboost-0.4.2-py3-none-any.whl (33 kB)
Collecting lifelines>=0.25 (from ngboost)
  Downloading lifelines-0.27.8-py3-none-any.whl (350 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m350.7/350.7 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
Collecting autograd-gamma>=0.3 (from lifelines>=0.25->ngboost)
  Downloading autograd-gamma-0.5.0.tar.gz (4.0 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting formulaic>=0.2.2 (from lifelines>=0.25->ngboost)
  Downloading formulaic-0.6.6-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.0/91.0 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
Collecting astor>=0.8 (from formulaic>=0.2.2->lifelines>=0.25->ngboost)
  Downloading astor-0.8.1-py2.py3-none-any.whl (27 kB)
Collecting interface-meta>=1.2.0 (from formulaic>=0.2.2->lifelines>=0.25->ngboost)
  Downloading interface_meta-1.3.0-py3-none-any.whl (14 kB)
Building wheels for collected pack

# Usage

We'll start with a probabilistic regression example on the Boston housing dataset:

In [10]:
import sys
sys.path.append('/Users/c242587/Desktop/projects/git/ngboost')

In [11]:
from ngboost import NGBRegressor

import pandas as pd
import numpy as np

#from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from ucimlrepo import fetch_ucirepo

# fetch dataset
#wine_quality = fetch_ucirepo(id=186)

# data (as pandas dataframes)
#X = wine_quality.data.features
#Y = wine_quality.data.targets

# Flatten Y using ravel()
#Y = np.ravel(Y)
#X, Y = load_boston(True)
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
X = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
Y = raw_df.values[1::2, 2]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

ngb = NGBRegressor().fit(X_train, Y_train)
Y_preds = ngb.predict(X_test)
Y_dists = ngb.pred_dist(X_test)

# test Mean Squared Error
test_MSE = mean_squared_error(Y_preds, Y_test)
print('Test MSE', test_MSE)

# test Negative Log Likelihood
test_NLL = -Y_dists.logpdf(Y_test).mean()
print('Test NLL', test_NLL)

[iter 0] loss=3.6238 val_loss=0.0000 scale=1.0000 norm=6.7170
[iter 100] loss=2.7043 val_loss=0.0000 scale=2.0000 norm=4.9132
[iter 200] loss=2.1597 val_loss=0.0000 scale=2.0000 norm=3.4042
[iter 300] loss=1.9104 val_loss=0.0000 scale=1.0000 norm=1.5005
[iter 400] loss=1.7806 val_loss=0.0000 scale=2.0000 norm=2.8023
Test MSE 7.299824632033597
Test NLL 2.895569380271196


Getting the estimated distributional parameters at a set of points is easy. This returns the predicted mean and standard deviation of the first five observations in the test set:

In [12]:
Y_dists[0:5].params

{'loc': array([21.16888241, 18.29786186, 42.64954321, 22.11837228, 13.23325624]),
 'scale': array([1.18749569, 1.60423796, 1.60792276, 1.48885109, 1.29549688])}

## Distributions

NGBoost can be used with a variety of distributions, broken down into those for regression (support on an infinite set) and those for classification (support on a finite set).

### Regression Distributions

| Distribution | Parameters | Implemented Scores | Reference |
| --- | --- | --- | --- |
| `Normal` | `loc`, `scale` | `LogScore`, `CRPScore` | [`scipy.stats` normal](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html) |
| `LogNormal` | `s`, `scale` | `LogScore`, `CRPScore` | [`scipy.stats` lognormal](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.lognorm.html) |
| `Exponential` | `scale` | `LogScore`, `CRPScore` | [`scipy.stats` exponential](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.expon.html) |

Regression distributions can be used through the `NGBRegressor()` constructor by passing the appropriate class as the `Dist` argument. `Normal` is the default.

In [13]:
from ngboost.distns import Exponential, Normal

#X, Y = load_boston(True)
X_reg_train, X_reg_test, Y_reg_train, Y_reg_test = train_test_split(X, Y, test_size=0.2)

ngb_norm = NGBRegressor(Dist=Normal, verbose=False).fit(X_reg_train, Y_reg_train)
ngb_exp = NGBRegressor(Dist=Exponential, verbose=False).fit(X_reg_train, Y_reg_train)

There are two prediction methods for `NGBRegressor` objects: `predict()`, which returns point predictions as one would expect from a standard regressor, and `pred_dist()`, which returns a distribution object representing the conditional distribution of $Y|X=x_i$ at the points $x_i$ in the test set.

In [14]:
ngb_norm.predict(X_reg_test)[0:5]

array([10.96172215, 22.55216199, 21.84292561, 22.90781053, 34.34409991])

In [15]:
ngb_exp.predict(X_reg_test)[0:5]

array([10.80557037, 22.99913963, 22.31504771, 23.41282165, 34.53682112])

In [16]:
ngb_exp.pred_dist(X_reg_test)[0:5].params

{'scale': array([10.80557037, 22.99913963, 22.31504771, 23.41282165, 34.53682112])}

#### Survival Regression

NGBoost supports analyses of right-censored data. Any distribution that can be used for regression in NGBoost can also be used for survival analysis in theory, but this requires the implementation of the right-censored version of the appropriate score. At the moment, `LogNormal` and `Exponential` have these scores implemented. To do survival analysis, use `NGBSurvival` and pass both the time-to-event (or censoring) and event indicator vectors to  `fit()`:

In [17]:
import numpy as np
from ngboost import NGBSurvival
from ngboost.distns import LogNormal

#X, Y = load_boston(True)
X_surv_train, X_surv_test, Y_surv_train, Y_surv_test = train_test_split(X, Y, test_size=0.2)

# introduce administrative censoring to simulate survival data
T_surv_train = np.minimum(Y_surv_train, 30) # time of an event or censoring
E_surv_train = Y_surv_train > 30 # 1 if T[i] is the time of an event, 0 if it's a time of censoring

ngb = NGBSurvival(Dist=LogNormal).fit(X_surv_train, T_surv_train, E_surv_train)

[iter 0] loss=1.2823 val_loss=0.0000 scale=4.0000 norm=2.2866
[iter 100] loss=0.5736 val_loss=0.0000 scale=2.0000 norm=0.8491
[iter 200] loss=0.3152 val_loss=0.0000 scale=2.0000 norm=0.4991
[iter 300] loss=0.1258 val_loss=0.0000 scale=2.0000 norm=0.2803
[iter 400] loss=-0.0325 val_loss=0.0000 scale=2.0000 norm=0.2189


The scores currently implemented assume that the censoring is independent of survival, conditional on the observed predictors.

### Classification Distributions

| Distribution | Parameters | Implemented Scores | Reference |
| --- | --- | --- | --- |
| `k_categorical(K)` | `p0`, `p1`... `p{K-1}` | `LogScore` | [Categorical distribution on Wikipedia](https://en.wikipedia.org/wiki/Categorical_distribution) |
| `Bernoulli` | `p` | `LogScore` | [Bernoulli distribution on Wikipedia](https://en.wikipedia.org/wiki/Bernoulli_distribution) |

Classification distributions can be used through the `NGBClassifier()` constructor by passing the appropriate class as the `Dist` argument. `Bernoulli` is the default and is equivalent to `k_categorical(2)`.

In [18]:
from ngboost import NGBClassifier
from ngboost.distns import k_categorical, Bernoulli
from sklearn.datasets import load_breast_cancer

from ucimlrepo import fetch_ucirepo

from ucimlrepo import fetch_ucirepo

# fetch dataset
#heart_disease = fetch_ucirepo(id=45)

# data (as pandas dataframes)
#X = heart_disease.data.features
#y = heart_disease.data.targets


#X, y = load_breast_cancer(True)
data = load_breast_cancer()
X, y = data.data, data.target
y[0:15] = 2 # artificially make this a 3-class problem instead of a 2-class problem
X_cls_train, X_cls_test, Y_cls_train, Y_cls_test  = train_test_split(X, y, test_size=0.2)

ngb_cat = NGBClassifier(Dist=k_categorical(3), verbose=False) # tell ngboost that there are 3 possible outcomes
_ = ngb_cat.fit(X_cls_train, Y_cls_train) # Y should have only 3 values: {0,1,2}

When using NGBoost for classification, the outcome vector `Y` must consist only of integers from 0 to K-1, where K is the total number of classes. This is consistent with the classification standards in sklearn.

`NGBClassifier` objects have three prediction methods: `predict()` returns the most likely class, `predict_proba()` returns the class probabilities, and `pred_dist()` returns the distribution object.

In [19]:
ngb_cat.predict(X_cls_test)[0:5]

array([0, 0, 1, 1, 1])

In [20]:
ngb_cat.predict_proba(X_cls_test)[0:5]

array([[9.48355272e-01, 5.12332294e-02, 4.11498347e-04],
       [9.92796726e-01, 6.83575542e-03, 3.67518953e-04],
       [4.24328073e-03, 9.95494360e-01, 2.62359331e-04],
       [4.34911250e-03, 9.95381985e-01, 2.68902842e-04],
       [4.24328073e-03, 9.95494360e-01, 2.62359331e-04]])

In [21]:
ngb_cat.pred_dist(X_cls_test)[0:5].params

{'p0': array([0.94835527, 0.99279673, 0.00424328, 0.00434911, 0.00424328]),
 'p1': array([0.05123323, 0.00683576, 0.99549436, 0.99538198, 0.99549436]),
 'p2': array([0.0004115 , 0.00036752, 0.00026236, 0.0002689 , 0.00026236])}

## Scores

NGBoost supports the log score (`LogScore`, also known as negative log-likelihood) and CRPS (`CRPScore`), although each score may not be implemented for each distribution. The score is specified by the `Score` argument in the constructor.

In [22]:
from ngboost.scores import LogScore, CRPScore

reg_model = NGBRegressor(Dist=Exponential, Score=CRPScore, verbose=False).fit(X_reg_train, Y_reg_train)
cls_model = NGBClassifier(Dist=k_categorical(3), Score=LogScore, verbose=False).fit(X_cls_train, Y_cls_train)

# Make predictions
reg_predictions = reg_model.predict(X_reg_test)
print(reg_predictions)
cls_predictions = cls_model.predict(X_cls_test)
print(cls_predictions)

[13.82521871 21.71930169 20.47514449 21.10050302 30.14969866 13.65722278
 20.14196318 15.42448388 20.27361031 20.01984544 16.47141149 13.91228161
 28.99495239 22.4247005  21.70226441 21.71930169 26.67748557 17.55476288
 16.41458287 28.99495239 39.90239156 24.61445674 20.40887017 15.50802302
 22.28874286 31.00805106 22.44230493 20.13192051 38.09619339 22.28874286
 14.9229364  20.42175861 24.61445674 20.41576013 29.55777856 13.65722278
 20.0029289  19.21753973 23.55880711 20.31699508 27.72214565 13.96597069
 18.3870906  16.84365601 15.97434119 30.72990128 20.13192051 26.94302197
 22.4247005  25.56938106 20.39414792 16.40769175 23.82147528 21.71930169
 13.65722278 31.03516533 16.41458287 21.71930169 22.64228256 28.88673674
 22.44230493 16.05131273 17.59841907 14.15501533 17.77132751 17.49419655
 18.47304828 22.44230493 13.65722278 17.35865527 19.67899171 14.08341499
 13.65722278 17.73556233 22.4247005  28.80670903 17.95228385 22.11649658
 21.67139707 26.94302197 14.03697394 21.67139707 21

## Base Learners

NGBoost can be used with any sklearn regressor as the base learner, specified with the `Base` argument. The default is a depth-3 regression tree.

In [23]:
from sklearn.tree import DecisionTreeRegressor

learner = DecisionTreeRegressor(criterion='friedman_mse', max_depth=5)

NGBSurvival(Dist=Exponential, Score=CRPScore, Base=learner, verbose=False).fit(X_surv_train, T_surv_train, E_surv_train)


## Other Arguments

The learning rate, number of estimators, minibatch fraction, and column subsampling are also easily adjusted:

In [24]:
ngb = NGBRegressor(n_estimators=100, learning_rate=0.01,
             minibatch_frac=0.5, col_sample=0.5)
ngb.fit(X_reg_train, Y_reg_train)

[iter 0] loss=3.5966 val_loss=0.0000 scale=2.0000 norm=12.9112


Sample weights (for training) are set using the `sample_weight` argument to `fit`.

In [25]:
ngb = NGBRegressor(n_estimators=100, learning_rate=0.01,
             minibatch_frac=0.5, col_sample=0.5)
weights = np.random.random(Y_reg_train.shape)
ngb.fit(X_reg_train, Y_reg_train, sample_weight=weights)

[iter 0] loss=3.6098 val_loss=0.0000 scale=1.0000 norm=6.3839


In [26]:
# Step 4: Add, Commit, and Push Changes
!git add .
!git commit -m "Add Colab code to Demo branch"  # Adjust the commit message as needed
!git push origin Demo  # Use the branch name you created

On branch Demo
nothing to commit, working tree clean
fatal: could not read Username for 'https://github.com': No such device or address
