Generalized Linear Models, and Poisson loss for gradient boosting

Long-awaited Generalized Linear Models with non-normal loss functions are now available. In particular, three new regressors were implemented: PoissonRegressor, GammaRegressor, and TweedieRegressor. The Poisson regressor can be used to model positive integer counts, or relative frequencies. Read more in the User Guide. Additionally, HistGradientBoostingRegressor supports a new ‘poisson’ loss as well.

In [6]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import PoissonRegressor
from sklearn.experimental import enable_hist_gradient_boosting  # noqa
from sklearn.ensemble import HistGradientBoostingRegressor

n_samples, n_features = 1000, 20
rng = np.random.RandomState(0)
X = rng.randn(n_samples, n_features)
# positive integer target correlated with X[:, 5] with many zeros:
y = rng.poisson(lam=np.exp(X[:, 5]) / 2)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
glm = PoissonRegressor()
gbdt = HistGradientBoostingRegressor(loss='poisson', learning_rate=.01)
glm.fit(X_train, y_train)
gbdt.fit(X_train, y_train)
print(glm.score(X_test, y_test))
print(gbdt.score(X_test, y_test))

0.35776189065725783
0.42425183539869415


Rich visual representation of estimators

Estimators can now be visualized in notebooks by enabling the display='diagram' option.

This is particularly useful to summarise the structure of pipelines and other composite estimators, with interactivity to provide detail. 
Click on the example image below to expand Pipeline elements. See Visualizing Composite Estimators for how you can use this feature.

In [1]:
import sklearn
sklearn.__version__

'0.23.1'

In [7]:
from sklearn import set_config
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LogisticRegression
set_config(display='diagram')

num_proc = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())

cat_proc = make_pipeline(
    SimpleImputer(strategy='constant', fill_value='missing'),
    OneHotEncoder(handle_unknown='ignore'))

preprocessor = make_column_transformer((num_proc, ('feat1', 'feat3')),
                                       (cat_proc, ('feat0', 'feat2')))

clf = make_pipeline(preprocessor, LogisticRegression())
clf

Scalability and stability improvements to KMeans

The KMeans estimator was entirely re-worked, and it is now significantly faster and more stable. In addition, the Elkan algorithm is now compatible with sparse matrices. The estimator uses OpenMP based parallelism instead of relying on joblib, so the n_jobs parameter has no effect anymore. For more details on how to control the number of threads, please refer to our Parallelism notes.